Collaborative Targeted Maximum Likelihood Estimation.
Collaborative Targeted Maximum Likelihood Estimation
Collaborative Targeted Maximum Likelihood Estimation (C-TMLE) is an extention of Targeted Maximum Likelihood Estimation (TMLE). It applies variable/model selection for nuisance parameter (e.g. the propensity score) estimation in a 'collaborative' way, by directly optimizing the empirical metric on the causal estimator.
In this package, we implemented the general template of C-TMLE, for the estimation of the average treatment effect (ATE).
The package also offers convenient functions for discrete C-TMLE for variable selection, and LASSO-C-TMLE for model selection of LASSO, in estimation of the propensity score (PS).
Installation
To install the CRAN release version of ctmle
:
install.packages('ctmle')
To install the development version (requires the devtools package):
devtools::install_github('jucheng1992/ctmle')
C-TMLE for variable selection
In this section, we start with examples of discrete C-TMLE for variable selection, using greedy forward searching, and scalable discrete C-TMLE with pre-ordering option.
library(ctmle)
#> Loading required package: SuperLearner
#> Loading required package: nnls
#> Super Learner
#> Version: 2.0-22
#> Package created on 2017-07-18
#> Loading required package: tmle
#> Welcome to the tmle package, version 1.2.0-5
#>
#> Use tmleNews() to see details on changes and bug fixes
#> Loading required package: glmnet
#> Loading required package: Matrix
#> Loading required package: foreach
#> Loaded glmnet 2.0-10
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(123)
N <- 1000
p = 5
Wmat <- matrix(rnorm(N * p), ncol = p)
beta1 <- 4+2*Wmat[,1]+2*Wmat[,2]+2*Wmat[,5]
beta0 <- 2+2*Wmat[,1]+2*Wmat[,2]+2*Wmat[,5]
tau <- 2
gcoef <- matrix(c(-1,-1,rep(-(3/((p)-2)),(p)-2)),ncol=1)
W <- as.matrix(Wmat)
g <- 1/(1+exp(W%*%gcoef /3))
A <- rbinom(N, 1, prob = g)
epsilon <-rnorm(N, 0, 1)
Y <- beta0 + tau * A + epsilon
# With initial estimate of Q
Q <- cbind(rep(mean(Y[A == 0]), N), rep(mean(Y[A == 1]), N))
time_greedy <- system.time(
ctmle_discrete_fit1 <- ctmleDiscrete(Y = Y, A = A, W = data.frame(Wmat), Q = Q,
preOrder = FALSE, detailed = TRUE)
)
ctmle_discrete_fit2 <- ctmleDiscrete(Y = Y, A = A, W = data.frame(Wmat),
preOrder = FALSE, detailed = TRUE)
time_preorder <- system.time(
ctmle_discrete_fit3 <- ctmleDiscrete(Y = Y, A = A, W = data.frame(Wmat), Q = Q,
preOrder = TRUE,
order = rev(1:p), detailed = TRUE)
)
Scalable (discrete) C-TMLE takes much less computation time:
time_greedy
#> user system elapsed
#> 1.589 0.045 1.646
time_preorder
#> user system elapsed
#> 0.994 0.012 1.008
Show the brief results from greedy CTMLE:
ctmle_discrete_fit1
#> C-TMLE result:
#> parameter estimate: 1.99472
#> estimated variance: 0.00838
#> p-value: <2e-16
#> 95% conf interval: (1.81533, 2.1741)
Summary function offers detial information of which variable is selected.
summary(ctmle_discrete_fit1)
#>
#> Number of candidate TMLE estimators created: 6
#> A candidate TMLE estimator was created at each move, as each new term
#> was incorporated into the model for g.
#> ----------------------------------------------------------------------
#> term added cleverCovar estimate cv-RSS cv-varIC cv-penRSS
#> cand 1 (intercept) 1 4.22 19.9 0.0788 14045
#> cand 2 X2 1 3.22 19.6 0.0851 13818
#> cand 3 X5 1 2.61 19.1 0.0870 13485
#> cand 4 X1 1 2.00 18.3 0.0955 12945
#> cand 5 X4 2 1.99 18.3 0.0950 12937
#> cand 6 X3 3 2.01 18.3 0.1008 12941
#> ----------------------------------------------------------------------
#> Selected TMLE estimator is candidate 5
#>
#> Each TMLE candidate was created by fluctuating the initial fit, Q0(A,W)=E[Y|A,W], obtained in stage 1.
#>
#> cand 1: Q1(A,W) = Q0(A,W) + epsilon1a * h1a
#> h1a is based on an intercept-only model for treatment mechanism g(A,W)
#>
#> cand 2: Q2(A,W) = Q0(A,W) + epsilon1b * h1b
#> h1b is based on a treatment mechanism model containing covariates X2
#>
#> cand 3: Q3(A,W) = Q0(A,W) + epsilon1c * h1c
#> h1c is based on a treatment mechanism model containing covariates X2, X5
#>
#> cand 4: Q4(A,W) = Q0(A,W) + epsilon1d * h1d
#> h1d is based on a treatment mechanism model containing covariates X2, X5, X1
#>
#> cand 5: Q5(A,W) = Q0(A,W) + epsilon1d * h1d + epsilon2 * h2 = Q4(A,W) + epsilon2 * h2,
#> h2 is based on a treatment mechanism model containing covariates X2, X5, X1, X4
#>
#> cand 6: Q6(A,W) = Q0(A,W) + epsilon1d * h1d + epsilon2 * h2 + epsilon3 * h3 = Q5(A,W) + epsilon3 * h3,
#> h3 is based on a treatment mechanism model containing covariates X2, X5, X1, X4, X3
#>
#> ----------
#> C-TMLE result:
#> parameter estimate: 1.99472
#> estimated variance: 0.00838
#> p-value: <2e-16
#> 95% conf interval: (1.81533, 2.1741)
LASSO-C-TMLE for model selection of LASSO
In this section, we introduce the LASSO-C-TMLE algorithm for model selection of LASSO in the estimation of the propensity score. We implemented three variations of the LASSO-C-TMLE algorithm. For simplicity, we call them C-TMLE1-3. See technical details in the corresponding references.
# Generate high-dimensional data
set.seed(123)
N <- 1000
p = 100
Wmat <- matrix(rnorm(N * p), ncol = p)
beta1 <- 4 + 2 * Wmat[,1] + 2 * Wmat[,2] + 2 * Wmat[,5] + 2 * Wmat[,6] + 2 * Wmat[,8]
beta0 <- 2 + 2 * Wmat[,1] + 2 * Wmat[,2] + 2 * Wmat[,5] + 2 * Wmat[,6] + 2 * Wmat[,8]
tau <- 2
gcoef <- matrix(c(-1,-1,rep(-(3/((p)-2)),(p)-2)),ncol=1)
W <- as.matrix(Wmat)
g <- 1/(1+exp(W%*%gcoef /3))
A <- rbinom(N, 1, prob = g)
epsilon <-rnorm(N, 0, 1)
Y <- beta0 + tau * A + epsilon
# With initial estimate of Q
Q <- cbind(rep(mean(Y[A == 0]), N), rep(mean(Y[A == 1]), N))
glmnet_fit <- cv.glmnet(y = A, x = W, family = 'binomial', nlambda = 20)
We start build a sequence of lambdas from the lambda selected by cross-validation, as the model selected by cv.glmnet would over-smooth w.r.t. the target parameter.
lambdas <- glmnet_fit$lambda[(which(glmnet_fit$lambda==glmnet_fit$lambda.min)):length(glmnet_fit$lambda)]
We fit C-TMLE1 algorithm by feed the algorithm with a vector of lambda, in decreasing order:
time_ctmlelasso1 <- system.time(
ctmle_fit1 <- ctmleGlmnet(Y = Y, A = A,
W = data.frame(W = W),
Q = Q, lambdas = lambdas, ctmletype=1,
family="gaussian",gbound=0.025, V=5)
)
We fit C-TMLE2 algorithm:
time_ctmlelasso2 <- system.time(
ctmle_fit2 <- ctmleGlmnet(Y = Y, A = A,
W = data.frame(W = W),
Q = Q, lambdas = lambdas, ctmletype=2,
family="gaussian",gbound=0.025, V=5)
)
For C-TMLE3, we need two gn estimators, one with lambda selected by cross-validation, and the other with lambda slightly different from the selected lambda:
gcv <- predict.cv.glmnet(glmnet_fit, newx=W, s="lambda.min",type="response")
gcv <- bound(gcv,c(0.025,0.975))
s_prev <- glmnet_fit$lambda[(which(glmnet_fit$lambda == glmnet_fit$lambda.min))] * (1+5e-2)
gcvPrev <- predict.cv.glmnet(glmnet_fit,newx = W,s = s_prev,type="response")
gcvPrev <- bound(gcvPrev,c(0.025,0.975))
time_ctmlelasso3 <- system.time(
ctmle_fit3 <- ctmleGlmnet(Y = Y, A = A, W = W, Q = Q,
ctmletype=3, g1W = gcv, g1WPrev = gcvPrev,
family="gaussian",
gbound=0.025, V = 5)
)
Les't compare the running time for each LASSO-C-TMLE
time_ctmlelasso1
#> user system elapsed
#> 15.005 0.104 15.266
time_ctmlelasso2
#> user system elapsed
#> 18.351 0.083 18.528
time_ctmlelasso3
#> user system elapsed
#> 0.005 0.000 0.006
Finally, we compare three C-TMLE estimates:
ctmle_fit1
#> C-TMLE result:
#> parameter estimate: 2.20368
#> estimated variance: 0.09796
#> p-value: 1.9124e-12
#> 95% conf interval: (1.59022, 2.81714)
ctmle_fit2
#> C-TMLE result:
#> parameter estimate: 2.16669
#> estimated variance: 0.05327
#> p-value: <2e-16
#> 95% conf interval: (1.71429, 2.61908)
ctmle_fit3
#> C-TMLE result:
#> parameter estimate: 2.02388
#> estimated variance: 0.04972
#> p-value: <2e-16
#> 95% conf interval: (1.58684, 2.46093)
Show which regularization parameter (lambda) is selected by C-TMLE1:
lambdas[ctmle_fit1$best_k]
#> [1] 0.004409285
In comparison, we show which regularization parameter (lambda) is selected by cv.glmnet:
glmnet_fit$lambda.min
#> [1] 0.03065303
Advanced topic: the general template of C-TMLE
In this section, we briefly introduce the general template of C-TMLE. In this function, the gn candidates could be a user-specified matrix, each column stand for the estimated PS for each unit. The estimators should be ordered by their empirical fit.
As C-TMLE requires cross-validation, it needs two gn estimate: one from cross-validated prediction, one from a vanilla prediction. For example, consider 5-folds cross-validation, where argument folds
is the list of indices for each folds, then the (i,j)-th element in input gn_candidates_cv
should be the predicted value of i-th unit, predicted by j-th unit, trained by other 4 folds where all of them do not contain i-th unit. gn_candidates
should be just the predicted PS for each estimator trained on the whole data.
We could easily use SuperLearner
package and build_gn_seq
function to easily achieve this:
lasso_fit <- cv.glmnet(x = as.matrix(W), y = A, alpha = 1, nlambda = 100, nfolds = 10)
lasso_lambdas <- lasso_fit$lambda[lasso_fit$lambda <= lasso_fit$lambda.min][1:5]
# Build SL template for glmnet
SL.glmnet_new <- function(Y, X, newX, family, obsWeights, id, alpha = 1,
nlambda = 100, lambda = 0,...){
# browser()
if (!is.matrix(X)) {
X <- model.matrix(~-1 + ., X)
newX <- model.matrix(~-1 + ., newX)
}
fit <- glmnet::glmnet(x = X, y = Y,
lambda = lambda,
family = family$family, alpha = alpha)
pred <- predict(fit, newx = newX, type = "response")
fit <- list(object = fit)
class(fit) <- "SL.glmnet"
out <- list(pred = pred, fit = fit)
return(out)
}
# Use a sequence of estimator to build gn sequence:
SL.cv1lasso <- function (... , alpha = 1, lambda = lasso_lambdas[1]){
SL.glmnet_new(... , alpha = alpha, lambda = lambda)
}
SL.cv2lasso <- function (... , alpha = 1, lambda = lasso_lambdas[2]){
SL.glmnet_new(... , alpha = alpha, lambda = lambda)
}
SL.cv3lasso <- function (... , alpha = 1, lambda = lasso_lambdas[3]){
SL.glmnet_new(... , alpha = alpha, lambda = lambda)
}
SL.cv4lasso <- function (... , alpha = 1, lambda = lasso_lambdas[4]){
SL.glmnet_new(... , alpha = alpha, lambda = lambda)
}
SL.library = c('SL.cv1lasso', 'SL.cv2lasso', 'SL.cv3lasso', 'SL.cv4lasso', 'SL.glm')
Construct the object folds
, which is a list of indices for each fold
V = 5
folds <-by(sample(1:N,N), rep(1:V, length=N), list)
Use folds
and SuperLearner template to compute gn_candidates
and gn_candidates_cv
gn_seq <- build_gn_seq(A = A, W = W, SL.library = SL.library, folds = folds)
#> Number of covariates in All is: 100
#> CV SL.cv1lasso_All
#> CV SL.cv2lasso_All
#> CV SL.cv3lasso_All
#> CV SL.cv4lasso_All
#> CV SL.glm_All
#> Number of covariates in All is: 100
#> CV SL.cv1lasso_All
#> CV SL.cv2lasso_All
#> CV SL.cv3lasso_All
#> CV SL.cv4lasso_All
#> CV SL.glm_All
#> Number of covariates in All is: 100
#> CV SL.cv1lasso_All
#> CV SL.cv2lasso_All
#> CV SL.cv3lasso_All
#> CV SL.cv4lasso_All
#> CV SL.glm_All
#> Number of covariates in All is: 100
#> CV SL.cv1lasso_All
#> CV SL.cv2lasso_All
#> CV SL.cv3lasso_All
#> CV SL.cv4lasso_All
#> CV SL.glm_All
#> Number of covariates in All is: 100
#> CV SL.cv1lasso_All
#> CV SL.cv2lasso_All
#> CV SL.cv3lasso_All
#> CV SL.cv4lasso_All
#> CV SL.glm_All
#> Non-Negative least squares convergence: TRUE
#> full SL.cv1lasso_All
#> full SL.cv2lasso_All
#> full SL.cv3lasso_All
#> full SL.cv4lasso_All
#> full SL.glm_All
Lets look at the output of build_gn_seq
gn_seq$gn_candidates %>% dim
#> [1] 1000 5
gn_seq$gn_candidates_cv %>% dim
#> [1] 1000 5
gn_seq$folds %>% length
#> [1] 5
Then we could use ctmleGeneral
algorithm. As input estimator is already trained, it is much faster than previous C-TMLE algorithms.
Note: we recommand use the same folds
as build_gn_seq
for ctmleGeneral
, to make cross-validation objective.
ctmle_general_fit1 <- ctmleGeneral(Y = Y, A = A, W = W, Q = Q,
ctmletype = 1,
gn_candidates = gn_seq$gn_candidates,
gn_candidates_cv = gn_seq$gn_candidates_cv,
folds = folds, V = 5)
ctmle_general_fit1
#> C-TMLE result:
#> parameter estimate: 2.19494
#> estimated variance: 0.08348
#> p-value: 3.0302e-14
#> 95% conf interval: (1.62865, 2.76122)
Citation
If you used ctmle
package in your research, please cite:
Ju, Cheng; Susan, Gruber; van der Laan, Mark J.; ctmle: Collaborative Targeted Maximum Likelihood Estimation. R package version 0.1.1, https://CRAN.R-project.org/package=ctmle.
@Manual{,
title = {ctmle: Collaborative Targeted Maximum Likelihood Estimation},
author = {Cheng Ju and Susan Gruber and Mark van der Laan},
year = {2017},
note = {R package version 0.1.1},
url = {https://CRAN.R-project.org/package=ctmle},
}
References (by inverse chronological order)
C-TMLE for Adaptive Propensity Score Truncation
Ju, Cheng, Schwab, Joshua, & van der Laan, Mark J. (2017). On Adaptive Propensity Score Truncation in Causal Inference. arXiv preprint arXiv:1707.05861 (2017).
LASSO-C-TMLE
Ju, Cheng; Wyss, Richard; Franklin, Jessica M.; Schneeweiss, Sebastian; Häggström, Jenny; van der Laan, Mark J.. "Collaborative-controlled LASSO for Constructing Propensity Score-based Estimators in High-Dimensional Data", arXiv preprint arXiv: 1706.10029 (2017). (To appear in Statistical Methods in Medical Research)
Scalable Discrete C-TMLE with Pre-ordering
Ju, Cheng; Gruber, Susan; Lendle, Samuel D.; Chambaz, Antoine; Franklin, Jessica M.; Wyss, Richard; Schneeweiss, Sebastian; and van der Laan, Mark J.. "Scalable Collaborative Targeted Learning for High-dimensional Data", Statistical Methods in Medical Research (2017), https://doi.org/10.1177/0962280217729845.
Discrete C-TMLE with Greedy Search
Susan, Gruber, and van rder Laan, Mark J.. "An Application of Collaborative Targeted Maximum Likelihood Estimation in Causal Inference and Genomics." The International Journal of Biostatistics 6.1 (2010): 1-31.
General Template of C-TMLE
van der Laan, Mark J., and Susan Gruber. "Collaborative double robust targeted maximum likelihood estimation." The international journal of biostatistics 6.1 (2010): 1-71.
C-TMLE for Model Selection
In preperation.