A Big Picture of Causal Inference

In general, most research on causal inference either studies the population-wise treatment effect, i.e., the average treatment effect (ATE), or the subject-level/individual treatment effect (ITE), i.e., the conditional treatment effect (CATE), which is exchangeable to heterogenuous treatment effect (HTE).

Basically, the observations can be either from randomized experiments, like randomized controlled trial (RCT), or observational studies, so-called real world data (RWD). With different observations, the estimation methods can be very different.

Briefly speaking, the ATE can be easily obtained via well-designed RCTs or A/B tests (we will see why in section Average Treatment Effect). That’s how pharmaceutical/biotech companies evaluate their drug effectiveness. In recent years, a big trend in drug development which is supported by FDA is to use RWD and real world evidence (RWE) to support as well as speed-up the regulatory approval. Since RWD is obervational rather than randomized controlled, a sizable methods are proposed including Matching (Rosenbaum and Rubin 1983; Austin 2014), Subclassification (Rosenbaum and Rubin 1984), Weighting (Hirano, Imbens, and Ridder 2003; Austin 2011; Austin and Stuart 2015), Regression (Donald B. Rubin 1979; D. B. Rubin 1985; Hahn 1998; Gutman and Rubin 2013), or a mixture of them (Bang and Robins 2005; Van Der Laan and Rubin 2006).

Another heat topic in causal inference is about the estimation of CATE/HTE/ITE from observational studies. This is usually tough because given the noise of observational studies and subject-level estimation, the estimates can have large variance. Various of methods have been proposed including but not limited to causal boosting (Powers et al. 2018), causal forests (Athey and Imbens 2016), individual-treatment-rule-based methods that mostly involves outcome weighted learning (OWL) (Qian and Murphy 2011; Zhao et al. 2012; X. Zhou et al. 2017; Qi et al. 2020), and several meta-learners, such as X-learner (Künzel et al. 2019) and R-learner (Nie and Wager 2020). These methods can be easily degenerated to study the CATE/HTE/ITE in randomized trials by replacing the estimated propensity scores to fixed numbers.

Causal Estimands

Different scientific questions leads to the estimation of different causal estimands. As aforementioned, ATE is basically about assessing the population/sub-population treatment effect which is the key focus for the drug companies. On the other hand, CATE is more about individual-level treatment effect, so it can be used to do precision medicine or personalized recommendations. Here is a list of the acronym of some commonly used causal estimands:

ITE (individual treatment effect)
ATE (average treatment effect)
- PATE (population average treatment effect)
- SATE (sample average treatment effect)
ATT (average treatment effect for the treated)
ATU/ATC (average treatment effect for the untreated/control)
CATE (conditional treatment effect), HTE (heterogeneous treatment effect), Local ATE
CATT (conditional treatment effect for the treated)
CATU/CATC (conditional average treatment effect for the untreated/control)

Average Treatment Effect (ATE)

Randomized Controlled Experiments

ATE can be observed from well-designed randomized controlled trial (RCT) directly, which reveals the population-wise treatment effect. $\begin{aligned} τ & = E [Y^{(1)} - Y^{(- 1)}] \\ = E [Y^{(1)}] - E [Y^{(- 1)}] (causation) \\ = E [Y^{(1)} ∣ T = 1] - E [Y^{(- 1)} ∣ T = - 1] \\ = E [Y ∣ T = 1] - E [Y ∣ T = - 1] (association) \end{aligned}$

The essence here is the randomization eliminates all the confounding effects, or in other words, the Unconfoundedness and Positivity assumptions holds automatically under randomized controlled experiments.

Remark $E [Y^{(1)}] - E [Y^{(- 1)}]$ is the interested causation but is unobservable in real world. By unconfoundedness and positivity, we infer the unobservable causation via observed data, i.e., $E [Y ∣ T = 1] - E [Y ∣ T = - 1]$ , which is also known as association. In general, association does not imply causation.

Another commonly used causal estimand is average treatment effect for the treated (ATT), which focus on the population receive the treatment $τ^{ATT} = E [Y^{(1)} - Y^{(- 1)} ∣ T = 1] .$

Observational Studies

Propensity score (PS) (Rosenbaum and Rubin 1983) is defined as the conditional probability of receiving a treatment given pre-treatment covariates $X$ . PS is the very core part in doing causal analysis, especially in the observational studies (but even in randomized trials, PS can be used to do some covariates adjustment for better results).

Though Matching is also a very popular approach in ATE estimation, here we discuss Weighting in details.

Matching(subclassification/coarsened)

The very beginning idea of estimating ATE (under the Assumption 1, 2 and 3) by first finding pairs of observations with covariates close enough to each other in two arms, and then average them out: $τ = E_{X} {E [Y | X \in X_{j}, T = 1] - E [Y | X \in X_{j}, T = - 1]}$ where $X_{j}$ is a subspace such that $⋃_{j = 1}^{J} X_{j} = X$ and $X_{i} ⋂ X_{j} = ϕ$ for any $i \neq j$ . One-to-one paired matching is hard to find, especially when $X$ is of high dimension. So people propose to use subclassification or coarsened, which also provide more robust estimations.

Matching by PS

The matching idea is finally ruled by the milestone work from Rosenbaum and Rubin (1983) who demonstrated that only matching by propensity score is good enough, because $(Y^{(1)}, Y^{(- 1)}) ⊥ T ∣ X \Rightarrow (Y^{(1)}, Y^{(- 1)}) ⊥ T ∣ π (X) .$ The whole matching task is transformed from a potential high diemensional covariate space $X$ to a linear scalar. $\begin{aligned} E_{π (X)} {E [Y ∣ T = 1, π (X)] - E [Y ∣ T = - 1, π (X)]} \\ = & E_{π (X)} {E [Y^{(1)} ∣ π (X)] - E [Y^{(- 1)} ∣ π (X)]} \\ = & E [Y^{(1)} - Y^{(0)}] = τ \end{aligned}$ But how to achieve a good estimator of propensity score $π (\cdot)$ turns out to be the key problem.

Inverse Probability Weighting (IPW)

It can be shown that (Proof in Appendix) $E [Y^{(t)}] = E [\frac{1 [T = t] Y}{P r (T = t ∣ X)}]$ which leads to $\begin{matrix} (2) & τ = E [Y^{(1)} - Y^{(- 1)}] = E [\frac{1 [T = 1] Y}{π (X)}] - E [\frac{1 [T = - 1] Y}{1 - π (X)}] . \end{matrix}$ Thus, the estimator of ATE based on observed dataset is $\hat{τ} = \frac{1}{n} \sum_{i : t_{i} = 1} \frac{y_{i}}{\hat{π} (x_{i})} - \frac{1}{n} \sum_{i : t_{i} = - 1} \frac{y_{i}}{1 - \hat{π} (x_{i})} .$ As this approach can be traced back to Horvitz and Thompson (1952), sometimes it is refered to as Horvitz–Thompson (HT) estimator (Imai and Ratkovic 2014). Since directly weighting on the inverse of propensity score may cause large variability when estimated $π (x_{i})$ is either close to 1 or 0, Hirano, Imbens, and Ridder (2003) propose to standardize the weight by $\hat{τ} = \sum_{i : t_{i} = 1} \frac{y_{i}}{\hat{π} (x_{i})} / \sum_{i : t_{i} = 1} \frac{1}{\hat{π} (x_{i})} - \sum_{i : t_{i} = - 1} \frac{y_{i}}{1 - \hat{π} (x_{i})} / \sum_{i : t_{i} = - 1} \frac{1}{1 - \hat{π} (x_{i})} .$ In our simulation studies, to distinguish from HT, IPW is refered to this standardized weighting method. Similarly, for ATT, it can be estimated by $\begin{aligned} τ^{ATT} = & E [Y^{(1)} - Y^{(- 1)} ∣ T = 1] = E [Y^{(1)} ∣ T = 1] - E [Y^{(- 1)} ∣ T = 1] \\ = & \frac{1}{π} (E [1 [T = 1] Y^{(1)}] + E [1 [T = - 1] Y^{(- 1)}] - E [\frac{1 [T = - 1] Y}{1 - π (X)}]) \\ = & \frac{1}{π} E [(1 - \frac{1 [T = - 1]}{1 - π (X)}) Y] = \frac{1}{π} E [(\frac{1 [T = 1] - π (X)}{1 - π (X)}) Y] \\ = & \frac{1}{n_{1}} (\sum_{i : t_{i} = 1} y_{i} - \sum_{i : t_{i} = - 1} \frac{π (x_{i})}{1 - π (x_{i})} y_{i}) for finite sample estimation \end{aligned}$ where the second equation comes from: $\begin{aligned} E [\frac{1 [T = - 1] Y}{1 - π (X)}] = E [Y^{(- 1)}] = E [Y^{(- 1)} ∣ T = 1] Pr (T = 1) + E [Y^{(- 1)} ∣ T = - 1] Pr (T = - 1) \\ \Rightarrow & E [Y^{(- 1)} ∣ T = 1] = \frac{1}{π} (E [\frac{1 [T = - 1] Y}{1 - π (X)}] - E [1 [T = - 1] Y^{(- 1)}]) \end{aligned}$

Augmented Inverse Probability Weighting (AIPW)

J. M. Robins, Rotnitzky, and Zhao (1995) augmented the IPW by a weighted average of the outcome model $\begin{aligned} τ^{AIPW} = & \underset{IPW}{\underset{⏟}{E [\frac{1 [T = 1] Y}{π (X)}] - E [\frac{1 [T = - 1] Y}{1 - π (X)}]}} \\ (3) & - \underset{Augmentation}{\underset{⏟}{E [\frac{1 [T = 1] - π (X)}{π (X)} μ_{1} (X) + \frac{1 [T = 1] - π (X)}{1 - π (X)} μ_{- 1} (X)]}} \end{aligned}$ This AIPW is called “doubly robust” because it is consistent as long as either the treatment assignment mechanism or the outcome model is correctly specified (Kurz 2021). If the propensity score is correctly specified, then $E (1 [T = t] - Pr (T = t ∣ X)) = E [E (1 [T = t] - Pr (T = t ∣ X) ∣ X)] = 0$ which simplifies AIPW to IPW no matter whether response surface functions $μ_{T} ()$ are correct. On the other hand, if response model $μ_{T} ()$ are correctly specified but propensity score is not correctly estimated, AIPW reduces to S- or T-learner (definitions are in the next section. Proof in Appendix)

Covariate Balancing Propensity Score (CBPS)

The very nature of weighting is it helps to make sure the confounding covariates distribution between the observed two arms are balanced. If covariate balancing is guaranteed, then the observed data can mimic the randomized trial data and the treatment effect can be directly obtained from observations. Imai and Ratkovic (2014) propose to directly target on the covariate balancing instead of weighting. So they developed a Covariate Balancing Propensity Score (CBPS) method which is a propensity score estimating method that yields the PSs improve the covariate balancing. The core idea of CBPS is fit a logistic PS model–maximize the likelihood function–subject to covariate balancing constraints. Specifically, the logistic PS model looks like $\hat{β} = \arg min_{β \in Θ} - \sum_{i = 1}^{N} T_{i} \log π (X_{i}; β) + (1 - T_{i}) \log (1 - π (X_{i}; β))$ which is equivalent to (take derivative w.r.t. $β$ ) $E [\frac{T_{i} {\tilde{X}}_{i}}{π (X_{i}; β)} - \frac{(1 - T_{i}) {\tilde{X}}_{i}}{1 - π (X_{i}; β)}] = 0$ where ${\tilde{X}}_{i} = π^{'} (X_{i}; β)$ which is the first derivative of $π (X_{i}; β)$ over $β$ . The balancing conditions are $E [\frac{T_{i} f (X_{i})}{π (X_{i}; β)}] = E [\frac{(1 - T_{i}) f (X_{i})}{1 - π (X_{i}; β)}]$ where $f ()$ can be any type of function that intended to be balanced, including first moment $X$ , second moment $X^{2}$ , interaction $X_{i} X_{j}, i \neq j$ , etc. When $f (X_{i}) = {\tilde{X}}_{i}$ , CBPS is exactly the MLE of logistic regression. If ${\tilde{X}}_{i} = X_{i}$ (a linear logistic model), the number of constraints is equal to the number of parameters, which is called exact CBPS. But we can manually add more constraints to make sure higher moments of covariates are balanced as well, then we have more constraints than the covariates, which is call over-identified CBPS. But “overidentifying restrictions generally improve asymptotic efficiency but may result in a poor finite sample performance” (Imai and Ratkovic 2014). The caveat of CBPS is, even PS model is completely misspecificed, the imposed moments (of $X$ , $X_{i} X_{j}$ ) are still balanced. CBPS largely avoids extremely propensities and thus, is more robust and better balance.

CBPS is just one of the Covariate-Balancing-type method to obtain the weights. There are many other methods following this idea (Hainmueller 2012; Chan, Yam, and Zhang 2016; Pirracchio and Carone 2018; Huling and Mak 2020) but will not be introduced here.

TMLE

Another estimator that is receiving more attention for ATE estimation is the targeted maximum likelihood estimator (TMLE) (Van Der Laan and Rubin 2006), which also enjoys the double robustness as AIPW along with other desired features like efficient and aysmptotically linear (allows for construction of valid Wald-type confidence intervals). Either ${\hat{μ}}_{i} (\cdot)$ and $\hat{π} (\cdot)$ are correctly specified leads to consistent estimator while if both are consistently estimated, TMLEs achieve the semi-parametric efficiency bound (Van Der Laan and Rubin 2006).

TMLE is a two-stage procedure with the first stage to obtain an initial estimate of response surface/conditional mean response ${\hat{μ}}_{1} (X)$ and ${\hat{μ}}_{- 1} (X)$ . If the initial estimator response surfaces is consistent, the TMLE remains consistent; but if the initial estimator is not consistent, the subsequent targeting step provides an opportunity for TMLE to reduce any residual bias in the estimate of the parameter of interest. Specifically, TMLE solves the efficient influence curve estimating equation for the target parameter (ATE here) in the second stage, which fluctuate/correct the initial estimation on the correct direction as a one-step estimator. The whole idea is supported by the semiparametric theory and here are some references: Semiparametric theory; Nonparametric efficiency theory in causal inference; influence functions.

${\hat{τ}}^{TMLE} (X) = {\hat{μ}}_{1} (X) - {\hat{μ}}_{- 1} (X) + \hat{ε} (\frac{1 [T = 1]}{\hat{π} (X)} - \frac{1 [T = - 1]}{1 - \hat{π} (X)})$ where $ε$ is a flunctuation parameter to update the initial estimation of response surfaces ${\hat{μ}}_{1} (X) - {\hat{μ}}_{- 1} (X)$ using the information from propensity score, just like other doubly-robust estimator or X-learner. $ε$ can be estimated by fitting ‘clever covariate’ $H (T, X) = \frac{1 [T = 1]}{\hat{π} (X)} - \frac{1 [T = - 1]}{1 - \hat{π} (X)}$ on the residual $Y - E [Y ∣ T, X] \sim H (T, X)$ .

General Framework of ATE Estiamtion Process

After introducing this much methods, we would like to provide a high-level picture of how and when to use these methods. Usually, there are two-step included in propensity score (PS) related methods in causal estimand estimation.

The first step is to estimate the propensity score. There are several commonly adopted branches:

Logistic regression related. For instance, including all interaction terms as well as squared terms long with LASSO-type penalty for sparsity
Machine learning algorithms, such as CART (Breiman 2017), BART (Chipman, George, and McCulloch 1998, 2002; Chipman et al. 2010), GAM (Hastie and Tibshirani 1986), GBM (Friedman 2001), Random Forest (Breiman 2001), XGBoost (Chen and Guestrin 2016), Super Learner (Van der Laan, Polley, and Hubbard 2007), etc.
Weights (or propensity score) estimated on purposely for covariate balancing, e.g., CBPS (Imai and Ratkovic 2014), Entropy balancing (Hainmueller 2012), EBCM (empirical balancing calibration weighting) (Chan, Yam, and Zhang 2016), etc. (Mostly, PS/weights obtained in this approach are mainly for weighting method)

In second step, we plug in the propensity score estimated from the first step for the desired estimand. The common approaches for ATE/ATT include:

Matching (Rosenbaum and Rubin 1983)
Subclassification (Rosenbaum and Rubin 1984)
Weighting
- HT (Horvitz and Thompson 1952)
- IPW (IPTW) (Hirano, Imbens, and Ridder 2003)
Regression
- Outcome regression/outcome model (Donald B. Rubin 1979), i.e. $E (Y ∣ X, T)$
- $E (Y ∣ π (X))$ or $E (Y ∣ π (X), X)$ (D. B. Rubin 1985; Hahn 1998; Gutman and Rubin 2013)
- Weighted regression of outcome $Y$ on treatment indicator $T$ with weights be the inverse of propensity scores
Mixture of above
- AIPW (Bang and Robins 2005)
- TMLE (Van Der Laan and Rubin 2006)

ConditionalAverage Treatment Effect (CATE)

Sometimes also known as Heterogeneous Treatment effect (HTE). With assumption 1, 2, and 3, it’s definition under two-arm scenario is $\begin{aligned} τ (x) & = E [Y^{(1)} - Y^{(- 1)} ∣ X = x] \\ = E (Y ∣ X = x, T = 1) - E (Y ∣ X = x, T = - 1) \end{aligned}$

Randomized Experiments

As discussed, the CATE estimation under randomized setting is generally a special case of those methods under observational studies. Nonetheless, we want to briefly introduce some interaction-oriented methods that are more suitable in randomized setting.

Interaction-oriented Methods

Without loss of generality, outcome $Y$ can be written by $E (Y) = f (X) + 1 [T = 1] g (X),$ then $E (Y ∣ X = x, T = 1) = f (x) + g (x)$ , $E (Y ∣ X = x, T = - 1) = f (x)$ , and therefore, $τ (x) = g (x)$ . So the problem of estimating treatment effect, or subgroup identification turns to be figuring out $g (x)$ . Notice that here $1 (T = 1)$ is independent of covariates $X$ , reflecting the nature of randomized trial.

Also be aware that the covariates of $f (\cdot)$ and $g (\cdot)$ could be different. Denote $Z$ and $V$ are two subspaces of origincal covariate space $X$ , and $E (Y) = f (V) + 1 [T = 1] g (Z),$ where $Z \in Z, V \in V$ , and $V$ and $Z$ can have overlap. Generally, covariates in $V$ are called diagnostic variables and $Z$ are predictive variables (Imai, Ratkovic, et al. 2013). Some variables are predictive only, indicating that they only react to the treatment; some variables are diagnostic only, meaning that they do nothing with treatment but impact the outcome; some variables are both predictive and diagnostic, while some variables play no effect at all. See Figure (1) for illustration.

A bunch of methods are developed following this idea: Interaction trees (Su et al. 2008, 2009), GUIDE (Generalized unbiased interaction detection and estimation) (Loh 2002; Loh, He, and Man 2015), MOB (Model-Based Recursive Partitioning) (Zeileis, Hothorn, and Hornik 2008; Seibold, Zeileis, and Hothorn 2016).

$\label{fig:prediag}Illustration of predictive and diagnostic variable$

Figure 1: Illustration of predictive and diagnostic variable

Observational Studies

Q-learning

Q-learning gets its name because its objective function plays a role similar to that of the Q or reward function in reinforcement learnin (Li, Wang, and Tu 2021) and is first used in estimating optimal dynamic treatment regime (Murphy 2003; J. M. Robins 2004; Schulte et al. 2014). The basic idea is focusing on the estimation of response surfaces $E [Y ∣ X, T]$ , which is also known as g-computation (J. Robins 1986). So this approach is also called the parametric g-formula (Hernán and Robins 2010) in the literature.

1. Single-Learner (S-learner)

Estimate a joint function $\hat{μ} (X, T) = E [Y ∣ X, T]$ then $\begin{matrix} (4) & \hat{τ} (X) = \hat{μ} (X, 1) - \hat{μ} (X, - 1) . \end{matrix}$ This approach is usually named as G-computation estimators, Parametric G-formula, or S-learner. But $T$ can be neglected or underweighted if $X$ has high dimension

2. Two-Learner (T-learner)

Estimate response surfaces $E [Y ∣ X, T]$ separately, i.e., one function $μ_{1} (X) = E [Y ∣ X, T = 1]$ for $T = 1$ and another $μ_{- 1} (X) = E [Y ∣ X, T = - 1]$ for $T = - 1$ . Then $\begin{matrix} (5) & \hat{τ} (X) = {\hat{μ}}_{1} (X) - {\hat{μ}}_{- 1} (X) \end{matrix}$ It is called T-learner. But lose data efficiency as only part of data is used in each model.

A-learning

The basic idea of advantage-learning or A-learning, is to estimate treatment effect $τ (x)$ directly, rather than estimating the reponse surfaces like Q-learning which can be deemed as an indirect way. Here are some selected approaches in this camp.

1. Outcome Weighted Methods/Modified Outcome Methods (MOM)

For a 50:50 randomized trial, we can make following transformation (Signorovitch 2007) $Y^{*} = {\begin{cases} 2 Y & if T = 1 \\ - 2 Y & if T = - 1 \end{cases},$ or equivalently, $Y^{*} = 2 T Y$ , and then, we may notice that $\begin{aligned} E (Y^{*} ∣ X) & = E (Y^{*} ∣ X, T = 1) Pr (T = 1) + E (Y^{*} ∣ X, T = - 1) Pr (T = - 1) \\ = E (2 Y ∣ X, T = 1) / 2 + E (- 2 Y ∣ X, T = - 1) / 2 \\ = E (Y ∣ X, T = 1) - E (Y ∣ X, T = - 1) \\ = τ (X) . \end{aligned}$ This means that we can directly fit a model on $Y^{*}$ using $X$ and it will return an unbiased estimator of treatment effect (even though the variance could be very large). Similarly, from equation (2), we know that $\begin{matrix} (6) & τ (x) = E [\frac{T Y}{T π (X) + (1 - T) / 2} | X = x] \end{matrix}$ The corresponding transformation or modification of outcome is $Y^{*} = \frac{T Y}{T π (X) + (1 - T) / 2}$ and this MOM-IPW loss function is $\begin{matrix} (7) & E [l (\hat{τ}, τ)] = E [{\hat{τ} (X) - \frac{T Y}{T π (X) + (1 - T) / 2}}^{2}] \end{matrix}$

Notably, this method though simple and straightforward, suffers from the limitation of outcome type and the larger variance.

2. Modified Covariate Methods (MCM)

Observing that we only need to optimize the loss function such as (7), Tian et al. (2014) modified MOM methods by adjusting the proxy loss function a little bit $E [{\hat{τ} (X) - Y^{*}}^{2}] = E [c^{2} {\hat{τ} (X) / c - Y}^{2}]$ and then minimizing the right-hand-size, which is the loss function for MCM, targeting on $\hat{τ} (X)$ directly as well. For example, the MCM version of the loss function for randomized trial is thus $E [{\hat{τ} (X) \cdot W / 2 - Y}^{2}] .$ For MOM-IPW version, with some modification on (7), the loss function for MCM-IPW version is derived as $E [{\hat{τ} (X) \cdot [(W + 1) / 2 - π (X)] - Y}^{2}]$ and with further adjustment, it turns out to be $E [\frac{{\hat{τ} (X) \cdot W - Y}^{2}}{W π (X) + (1 - W) / 2}] .$ So, we no longer need to worry about the outcome type.

3. R-learner

R-learner (Nie and Wager 2020) can be considered as a very special case of MCM, but it does not follow the same logic flow as MCM. It adopts the Robinson’s decomposition (Robinson 1988) to connect the CATE with the observed outcome $\begin{matrix} (8) & E [Y ∣ X, T] = m (X) + (1 [T = 1] - π (X)) τ (X) \end{matrix}$ where $m (X) = E [Y ∣ X]$ . In (J. Zhou, Zhang, and Tu), they extended R-learner to multi-armed settings without recommendation inconsistency issue that naturally embedded in most approaches under A-learning framework.

4. X-learner

X-learner (Künzel et al. 2019) enjoys the simplicity of T-learner but fixes its data efficiency issue by targeting on the treatment effects rather than the response surfaces. The main procedure is the following:

Step 1: Estimate ${\hat{μ}}_{1} (\cdot)$ and ${\hat{μ}}_{- 1} (\cdot)$ just like T-learner
Step 2a: Impute ITEs for subjects in $T = 1$ arm by ${\hat{τ}}_{1, i} = Y_{i} - {\hat{μ}}_{- 1} (x_{i})$ (recall that $τ_{i} = Y_{i}^{(1)} - Y_{i}^{(- 1)}$ ) and ITEs for subjects in $T = - 1$ arm by ${\hat{τ}}_{- 1, i} = {\hat{μ}}_{1} (x_{i}) - Y_{i}$
Step 2b: Fit one model ${\hat{τ}}_{1} (\cdot)$ to predict ${\hat{τ}}_{1, i}$ using data in $T = 1$ arm, i.e., ${(x_{i}, Y_{i})}_{i : T_{i} = 1}$ ; and fit another model ${\hat{τ}}_{- 1} (\cdot)$ to predict ${\hat{τ}}_{- 1, i}$ using data in $T = - 1$ arm, i.e., ${(x_{i}, Y_{i})}_{i : T_{i} = - 1}$
Step 3: Combine ${\hat{τ}}_{1} (\cdot)$ and ${\hat{τ}}_{- 1} (\cdot)$ to achieve the final treatment effect model $\hat{τ} (x) = g (x) τ_{- 1} (x) + (1 - g (x)) τ_{1} (x)$ , where $g (\cdot)$ is some weighting function, e.g., propensity score.

Notably, when calculating ${\hat{τ}}_{1} (\cdot)$ , for example, X-learner uses whole data, including instances from both $T = 1$ and $T = - 1$ , which is deems as fixing the data efficiency issue in T-learner approach.

Individualized Treatment Rule

Most of the previous methods target on CATE estimation, then the optimal treatment can be found automatically by looking at the estimates. Whereas, obtaining the consistent and unbiased estimates for CATE is difficult in practice. However, if we only need to know which treatment is the optimal, we actually do not need to estimate the CATE but only the treatment assignment rule, or so-called individual-treatment-rule (ITR), $d (x) = s g n {τ (x)}$ , which is a mapping $d : X \to T$ . In practice, we do not need to estimate $τ (\cdot)$ but $d (\cdot)$ directly. See Figure (2) for a demonstration of the difference between Q-learning, A-learning, and ITR.

$\label{fig:ITR}Illustration of causal inference approaches. (A) Q-learning aims at estimating the response surfaces of each treatment group; (B) A-learning aims at estimation of treatment effect directly; (C) ITR aims at estimation of treatment region rather than the treatment effect.$

Figure 2: Illustration of causal inference approaches. (A) Q-learning aims at estimating the response surfaces of each treatment group; (B) A-learning aims at estimation of treatment effect directly; (C) ITR aims at estimation of treatment region rather than the treatment effect.

Recall the ITR is indeed a mapping from the covariate space into the treatment space $d : X \to T$ . According to Qian and Murphy (2011) and Zhao et al. (2012), the value function under the ITR $d (X)$ is $V (d) = E [Y ∣ d (X) = t] = E [\frac{1 [T = d (X)] Y}{π (X ∣ T = t)}]$ and the optimal ITR is the one corresponds to the largest value function $d^{*} () = \arg max_{d ()} V (d) .$ Notice that solving this objective function requires discontinuous loss functions, which may incur certain numerical difficulty. So far, various methods have been proposed in the literature to obtain the ITR following this idea (Qian and Murphy 2011; Zhao et al. 2012; Xu et al. 2015; Zhu et al. 2017; X. Zhou et al. 2017; Zhang and Zhang 2018; Qi et al. 2020). Due to its structure similar to propensity score weighting but only on outcome $Y$ , it is also known as outcome weighted learning, or O-learning (Zhao et al. 2012).

Other tree-based methods

There are many tree-based methods such as virtual trees (Foster 2013), causal tree/forest (Athey and Imbens 2016; Wager and Athey 2018), causal boosting/PTO forest/causal MARS (Powers et al. 2018). Actually, most of them can be categorized as S- or T-learners on a high-level.

(For more details on CATE/HTE estimation, please refer to another post Meta-Learners)

Appendix

Proof of IPW

$\begin{aligned} E [\frac{1 [T = t] Y}{Pr (T = t ∣ X)}] \\ = & E [E (\frac{1 [T = t] Y}{Pr (T = t ∣ X)} ∣ X)] \\ = & E [\frac{E (1 [T = t] \sum_{c} 1 [T = c] Y^{(c)} ∣ X)}{Pr (T = t ∣ X)}] \\ = & E [\frac{\sum_{a} E (1 [T = t] \sum_{c} 1 [T = c] Y^{(c)} ∣ T = a, X) Pr (T = a ∣ X)}{Pr (T = t ∣ X)}] \\ = & E [\frac{E (Y^{(t)} ∣ T = t, X) Pr (T = t ∣ X)}{Pr (T = t ∣ X)}] \\ = & E [E (Y^{(t)} ∣ T = t, X)] \\ = & E [E (Y^{(t)} ∣ X)] (unconfoundedness) \\ = & E [Y^{(t)}] . \end{aligned}$

Proof of AIPW

To demonstrate that AIPW can be consistent even when propensity score is not correctly specified, we only need to focus on the general element $E [\frac{1 [T = t] Y}{π^{(t)} (X)} - \frac{1 [T = t] - π^{(t)} (X)}{π^{(t)} (X)} μ_{t} (X)]$ .

When propensity score is wrong, i.e., $π^{(t)} (X) \neq Pr (T = t ∣ X)$ but response surface estimation is correct, i.e., $μ_{t} (X) = E (Y^{(t)} ∣ X)$ , $\begin{aligned} E [\frac{1 [T = t] Y}{π^{(t)} (X)} - \frac{1 [T = t]}{π^{(t)} (X)} μ_{t} (X) + μ_{t} (X)] \\ = & E [E (\frac{1 [T = t] Y}{π^{(t)} (X)} - \frac{1 [T = t]}{π^{(t)} (X)} μ_{t} (X) + μ_{t} (X) ∣ X)] \\ = & E [\frac{E (1 [T = t] (Y - μ_{t} (X)) ∣ X)}{π^{(t)} (X)}] + E [E (Y^{(t)} ∣ X)] \\ = & E [\frac{\sum_{a} E (1 [T = t] (Y - μ_{t} (X)) ∣ T = a, X) Pr (T = a ∣ X)}{π^{(t)} (X)}] + E [Y^{(t)}] \\ = & E [\frac{E (Y - μ_{t} (X) ∣ T = t, X) Pr (T = t ∣ X)}{π^{(t)} (X)}] + E [Y^{(t)}] \\ = & E [\frac{Pr (T = t ∣ X)}{π^{(t)} (X)} \underset{= 0}{\underset{⏟}{(E (Y^{(t)} ∣ X) - μ_{t} (X))}}] + E [Y^{(t)}] (unconfoundedness) \\ = & E [Y^{(t)}] . \end{aligned}$ Therefore, under this situation, AIPW in equation (3) becomes $τ^{A I P W} = E [Y^{(1)}] - E [Y^{(- 1)}] = τ$ .

References

Athey, Susan, and Guido Imbens. 2016. “Recursive Partitioning for Heterogeneous Causal Effects.” Proceedings of the National Academy of Sciences 113 (27): 7353–60.

Austin, Peter C. 2011. “An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies.” Multivariate Behavioral Research 46 (3): 399–424.

———. 2014. “A Comparison of 12 Algorithms for Matching on the Propensity Score.” Statistics in Medicine 33 (6): 1057–69.

Austin, Peter C, and Elizabeth A Stuart. 2015. “Moving Towards Best Practice When Using Inverse Probability of Treatment Weighting (IPTW) Using the Propensity Score to Estimate Causal Treatment Effects in Observational Studies.” Statistics in Medicine 34 (28): 3661–79.

Bang, Heejung, and James M Robins. 2005. “Doubly Robust Estimation in Missing Data and Causal Inference Models.” Biometrics 61 (4): 962–73.

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32.

———. 2017. Classification and Regression Trees. Routledge.

Chan, Kwun Chuen Gary, Sheung Chi Phillip Yam, and Zheng Zhang. 2016. “Globally Efficient Non-Parametric Inference of Average Treatment Effects by Empirical Balancing Calibration Weighting.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78 (3): 673–700.

Chen, Tianqi, and Carlos Guestrin. 2016. “XGBoost: A Scalable Tree Boosting System.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–94. KDD ’16. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2939672.2939785.

Chipman, Hugh A, Edward I George, and Robert E McCulloch. 1998. “Bayesian CART Model Search.” Journal of the American Statistical Association 93 (443): 935–48.

———. 2002. “Bayesian Treed Models.” Machine Learning 48 (1-3): 299–320.

Chipman, Hugh A, Edward I George, Robert E McCulloch, et al. 2010. “BART: Bayesian Additive Regression Trees.” The Annals of Applied Statistics 4 (1): 266–98.

Foster, Jared C. 2013. “Subgroup Identification and Variable Selection from Randomized Clinical Trial Data.” PhD thesis.

Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics, 1189–1232.

Gutman, Roee, and Donald B Rubin. 2013. “Robust Estimation of Causal Effects of Binary Treatments in Unconfounded Studies with Dichotomous Outcomes.” Statistics in Medicine 32 (11): 1795–1814.

Hahn, Jinyong. 1998. “On the Role of the Propensity Score in Efficient Semiparametric Estimation of Average Treatment Effects.” Econometrica, 315–31.

Hainmueller, Jens. 2012. “Entropy Balancing for Causal Effects: A Multivariate Reweighting Method to Produce Balanced Samples in Observational Studies.” Political Analysis 20 (1): 25–46.

Hastie, Trevor, and Robert Tibshirani. 1986. “Generalized Additive Models.” Statistical Science, 297–310.

Hernán, Miguel A, and James M Robins. 2010. “Causal Inference.” CRC Boca Raton, FL;

Hirano, Keisuke, Guido W Imbens, and Geert Ridder. 2003. “Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score.” Econometrica 71 (4): 1161–89.

Horvitz, Daniel G, and Donovan J Thompson. 1952. “A Generalization of Sampling Without Replacement from a Finite Universe.” Journal of the American Statistical Association 47 (260): 663–85.

Huling, Jared D, and Simon Mak. 2020. “Energy Balancing of Covariate Distributions.” arXiv Preprint arXiv:2004.13962.

Imai, Kosuke, Marc Ratkovic, et al. 2013. “Estimating Treatment Effect Heterogeneity in Randomized Program Evaluation.” The Annals of Applied Statistics 7 (1): 443–70.

Imai, Kosuke, and Marc Ratkovic. 2014. “Covariate Balancing Propensity Score.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 (1): 243–63.

Imbens, Guido W, and Donald B Rubin. 2015. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press.

Künzel, Sören R, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. 2019. “Metalearners for Estimating Heterogeneous Treatment Effects Using Machine Learning.” Proceedings of the National Academy of Sciences 116 (10): 4156–65.

Kurz, Christoph F. 2021. “Augmented Inverse Probability Weighting and the Double Robustness Property.” Medical Decision Making, 0272989X211027181.

Li, Ruohong, Honglang Wang, and Wanzhu Tu. 2021. “Robust Estimation of Heterogeneous Treatment Effects Using Electronic Health Record Data.” Statistics in Medicine 40 (11): 2713–52.

Loh, Wei-Yin. 2002. “Regression Tress with Unbiased Variable Selection and Interaction Detection.” Statistica Sinica, 361–86.

Loh, Wei-Yin, Xu He, and Michael Man. 2015. “A Regression Tree Approach to Identifying Subgroups with Differential Treatment Effects.” Statistics in Medicine 34 (11): 1818–33.

Murphy, Susan A. 2003. “Optimal Dynamic Treatment Regimes.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65 (2): 331–55.

Neyman, Jerzy S. 1923. “On the Application of Probability Theory to Agricultural Experiments. Essay on Principles. Section 9.(tlanslated and Edited by Dm Dabrowska and Tp Speed, Statistical Science (1990), 5, 465-480).” Annals of Agricultural Sciences 10: 1–51.

Nie, X, and S Wager. 2020. “Quasi-oracle estimation of heterogeneous treatment effects.” Biometrika 108 (2): 299–319. https://doi.org/10.1093/biomet/asaa076.

Pirracchio, Romain, and Marco Carone. 2018. “The Balance Super Learner: A Robust Adaptation of the Super Learner to Improve Estimation of the Average Treatment Effect in the Treated Based on Propensity Score Matching.” Statistical Methods in Medical Research 27 (8): 2504–18.

Powers, Scott, Junyang Qian, Kenneth Jung, Alejandro Schuler, Nigam H Shah, Trevor Hastie, and Robert Tibshirani. 2018. “Some Methods for Heterogeneous Treatment Effect Estimation in High Dimensions.” Statistics in Medicine 37 (11): 1767–87.

Qi, Zhengling, Dacheng Liu, Haoda Fu, and Yufeng Liu. 2020. “Multi-Armed Angle-Based Direct Learning for Estimating Optimal Individualized Treatment Rules with Various Outcomes.” Journal of the American Statistical Association 115 (530): 678–91.

Qian, Min, and Susan A Murphy. 2011. “Performance Guarantees for Individualized Treatment Rules.” Annals of Statistics 39 (2): 1180.

Robins, James. 1986. “A New Approach to Causal Inference in Mortality Studies with a Sustained Exposure Period—Application to Control of the Healthy Worker Survivor Effect.” Mathematical Modelling 7 (9-12): 1393–1512.

Robins, James M. 2004. “Optimal Structural Nested Models for Optimal Sequential Decisions.” In Proceedings of the Second Seattle Symposium in Biostatistics, 189–326. Springer.

Robins, James M, Andrea Rotnitzky, and Lue Ping Zhao. 1995. “Analysis of Semiparametric Regression Models for Repeated Outcomes in the Presence of Missing Data.” Journal of the American Statistical Association 90 (429): 106–21.

Robinson, P. M. 1988. “Root-n-Consistent Semiparametric Regression.” Econometrica 56 (4): 931–54. http://www.jstor.org/stable/1912705.

Rosenbaum, Paul R, and Donald B Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1): 41–55.

———. 1984. “Reducing Bias in Observational Studies Using Subclassification on the Propensity Score.” Journal of the American Statistical Association 79 (387): 516–24.

Rubin, D. B. 1985. “The Use of Propensity Scores in Applied Bayesian Inference.” In Bayesian Statistics 2, 463–72. North-Holland/Elsevier (Amsterdam; New York).

Rubin, Donald B. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology 66 (5): 688.

———. 1979. “Using Multivariate Matched Sampling and Regression Adjustment to Control Bias in Observational Studies.” Journal of the American Statistical Association 74 (366a): 318–28.

———. 1980. “Randomization Analysis of Experimental Data: The Fisher Randomization Test Comment.” Journal of the American Statistical Association 75 (371): 591–93.

Schulte, Phillip J, Anastasios A Tsiatis, Eric B Laber, and Marie Davidian. 2014. “Q-and a-Learning Methods for Estimating Optimal Dynamic Treatment Regimes.” Statistical Science: A Review Journal of the Institute of Mathematical Statistics 29 (4): 640.

Seibold, Heidi, Achim Zeileis, and Torsten Hothorn. 2016. “Model-Based Recursive Partitioning for Subgroup Analyses.” The International Journal of Biostatistics 12 (1): 45–63.

Signorovitch, James Edward. 2007. “Identifying Informative Biological Markers in High-Dimensional Genomic Data and Clinical Trials.” PhD thesis, Harvard University.

Su, Xiaogang, Chih-Ling Tsai, Hansheng Wang, David M Nickerson, and Bogong Li. 2009. “Subgroup Analysis via Recursive Partitioning.” Journal of Machine Learning Research 10 (Feb): 141–58.

Su, Xiaogang, Tianni Zhou, Xin Yan, Juanjuan Fan, and Song Yang. 2008. “Interaction Trees with Censored Survival Data.” The International Journal of Biostatistics 4 (1).

Tian, Lu, Ash A Alizadeh, Andrew J Gentles, and Robert Tibshirani. 2014. “A Simple Method for Estimating Interactions Between a Treatment and a Large Number of Covariates.” Journal of the American Statistical Association 109 (508): 1517–32.

Van der Laan, Mark J, Eric C Polley, and Alan E Hubbard. 2007. “Super Learner.” Statistical Applications in Genetics and Molecular Biology 6 (1).

Van Der Laan, Mark J, and Daniel Rubin. 2006. “Targeted Maximum Likelihood Learning.” The International Journal of Biostatistics 2 (1).

Wager, Stefan, and Susan Athey. 2018. “Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests.” Journal of the American Statistical Association 113 (523): 1228–42.

Xu, Yaoyao, Menggang Yu, Ying-Qi Zhao, Quefeng Li, Sijian Wang, and Jun Shao. 2015. “Regularized Outcome Weighted Subgroup Identification for Differential Treatment Effects.” Biometrics 71 (3): 645–53.

Zeileis, Achim, Torsten Hothorn, and Kurt Hornik. 2008. “Model-Based Recursive Partitioning.” Journal of Computational and Graphical Statistics 17 (2): 492–514.

Zhang, Baqun, and Min Zhang. 2018. “C-Learning: A New Classification Framework to Estimate Optimal Dynamic Treatment Regimes.” Biometrics 74 (3): 891–99.

Zhao, Yingqi, Donglin Zeng, A John Rush, and Michael R Kosorok. 2012. “Estimating Individualized Treatment Rules Using Outcome Weighted Learning.” Journal of the American Statistical Association 107 (499): 1106–18.

Zhou, Junyi, Ying Zhang, and Wanzhu Tu. “A Reference-Free r-Learner for Treatment Recommendation.” Statistical Methods in Medical Research 0 (0): 09622802221144326. https://doi.org/10.1177/09622802221144326.

Zhou, Xin, Nicole Mayer-Hamblett, Umer Khan, and Michael R Kosorok. 2017. “Residual Weighted Learning for Estimating Individualized Treatment Rules.” Journal of the American Statistical Association 112 (517): 169–87.

Zhu, Ruoqing, Ying-Qi Zhao, Guanhua Chen, Shuangge Ma, and Hongyu Zhao. 2017. “Greedy Outcome Weighted Tree Learning of Optimal Personalized Treatment Rules.” Biometrics 73 (2): 391–400.

Brief Introduction

Potential Outcome Framework

A Big Picture of Causal Inference

Causal Estimands

Average Treatment Effect (ATE)

Randomized Controlled Experiments

Observational Studies

Matching(subclassification/coarsened)

Matching by PS

Inverse Probability Weighting (IPW)

Augmented Inverse Probability Weighting (AIPW)

Covariate Balancing Propensity Score (CBPS)

TMLE

General Framework of ATE Estiamtion Process

ConditionalAverage Treatment Effect (CATE)

Randomized Experiments

Interaction-oriented Methods

Observational Studies

Q-learning

1. Single-Learner (S-learner)

2. Two-Learner (T-learner)

A-learning

1. Outcome Weighted Methods/Modified Outcome Methods (MOM)

2. Modified Covariate Methods (MCM)

3. R-learner

4. X-learner

Individualized Treatment Rule

Other tree-based methods

Appendix

Proof of IPW

Proof of AIPW

References