Notice: the causal inference described here follows the idea of Potential Outcome Framework (Neyman 1923; Rubin 1974). For another big camp of causal inference studies, i.e., Structural Causal Model (SCM) (Pearl 2009), please refer to this well-written paper here.

Heterogeneous Treatment Effect Estimation

1. Introduction

The need for causal inference arises in many different fields. Economists want to quantify the causal effects of interest rate cut on the economy. Trade policy makers want to know whether increased tariff causes changes in trade deficit. Healthcare providers want to assess whether a therapy causes changes in specific patient outcomes. In this proposal, I examine the estimation of causal effects therapeutic treatments by using observational data in a population where the treatment effects vary by observed patient characteristics. Such estimation is essential for the practice of precision medicine, which by definition targets patients with specific characteristics to maximize the therapeutic benefit.

This work is proposed within the Rubin-Neyman’s (Neyman 1923; Rubin 1974; Imbens and Rubin 2015) potential outcome framework. An essential feature of the potential outcome concept is that a given individual could have multiple potential outcomes, each under a different treatment; but since the subject only receives one treatment, only one of these potential outcomes is realized. The lack of observation on all potential outcomes within the subject makes it difficult to assess the effect of the treatment within the subject.

Denote the potential outcome from treatment $T$ as $Y^{(T)}$ , where $T \in {0, 1, 2, . . ., K}$ corresponding to $K + 1$ potential treatments. In most studies, people simply set $K = 1$ , that is there are only two possible treatments, namely treatment ( $K = 1$ ) and control ( $K = 0$ ). Two-arm scenario will be mainly discussed in Introduction and Existing Methods section and then extended to multiple arms in Proposed Method section. To further work on estimation of causal effects, we need to make Stable Unit Treatment Value Assumption (SUTVA, Rubin (1980)):
Assumption 1. SUTVA
1) Treatment applied to one unit does not affect the outcome for other units;
2) There is no hidden variation, for example, different forms or versions of each treatment level, for a single treatment
and then the observed outcome for unit $i$ could be expressed as $Y_{i} = {\begin{cases} Y_{i}^{(0)} & if T_{i} = 0 \\ Y_{i}^{(1)} & if T_{i} = 1 \end{cases}$ or equivalently,

$\begin{matrix} (1) & Y_{i} = T_{i} Y_{i}^{(1)} + (1 - T_{i}) Y_{i}^{(0)} . \end{matrix}$

The causal (or treatment) effect is defined by the comparison between potential outcomes. For unit $i$ , the individual treatment effect (ITE) $ξ_{i}$ is thus $ξ_{i} = Y_{i}^{(1)} - Y_{i}^{(0)} .$ Clearly, the definition of causal effect does not depend on which potential outcome is observed. Besides, ITE should be measured on the same unit, at same time point post-treatment. A direct observation and measurement of ITE is therefore infeasible. However, we are able to estimate the average individual treatment effect under plausible assumptions, which will be discussed in subsection 1.2. Here we first denote expected ITE for a specific unit $i$ with a vector of covariates $X_{i}$ , or individual average treatment effect (IATE), as $τ (x) = E [ξ_{i} ∣ X_{i} = x] .$ IATE is conditioning on every single covariate which is a finest case of conditional average treatment effect (CATE), or heterogeneous treatment effect. CATE, on the other hand, can only conditional on part of the covariates that are of interest, which is the main target in the following discussions. The most coarse one is average treatment effect (ATE) which only measures the populational treatment effect, without taking subject features into consideration, noted as $τ$ and $τ = E [ξ_{i}] .$

For any observed potential outcome, we need to think of the reason why this outcome is realized instead of others. Clearly if the treatment is not randomly assigned, there must be some reason for a unit to take such treatment, which means that the treatment assignment is informative. Therefore, the estimation of ATE or CATE varies under different assignment mechanism. The discussion for randomized experiment is listed in subsection 1.1 and observational studies in 1.2.

1.1 Randomized Experiments

In randomized experiment, the treatment assignment is totally random which means $T_{i} ⊥ (X_{i}, Y_{i}^{(0)}, Y_{i}^{(1)}),$ or equivalently, $E [Y^{(k)}] = E [Y^{(k)} ∣ T] .$ Then the ATE could be calculated by $τ = E [Y_{i}^{(1)} - Y_{i}^{(0)}] = E [Y_{i}^{(1)} ∣ T_{i} = 1] - E [Y_{i}^{(0)} ∣ T_{i} = 0] = E [Y_{i} ∣ T_{i} = 1] - E [Y_{i} ∣ T_{i} = 0]$ where the last equality comes from equation (1). Thus, the ATE could be easily obtained by taking the difference of two group average. This is how clinical trials measure the treatment effect between the proposed therapy against the placebo or standard therapy. However, even when the population treatment effect is not significant, the drug could still benefit a subgroup of the patients which could be hardly measured in traditional clinical trials.

In order to see the subgroup treatment effect, we need to estimate CATE. The estimation of CATE in randomized experiments is the simplified version of that in observational studies, which just replace the propensity score (will be introduced in subsection 1.2) by constant values.

1.2 Observational Studies

In observational studies, the treatment assignment is related to the outcome and no longer fully random, which means $E [Y^{(k)}] \neq E [Y^{(k)} ∣ T]$ and hence, ATE is no longer trivial to obtain. In order to further explore the problem, a stronger assumption is usually provided in literature, namely unconfoundedness or ignorability
Assumption 2. Unconfoundedness $\begin{matrix} (2) & (Y^{(1)}, Y^{(0)}) ⊥ T ∣ X, \end{matrix}$ which implies that there is no confounding covariates that are not observed. This is the fundamental assumption that frequently made in observational studies to rule out the potential confounders, since the unobserved confounding covariates could not cancel out as randomized experiments do. Sometimes it is followed by another assumption
Assumption 3. Common Support $\begin{matrix} (3) & 0 < P (T_{i} = 1 | X_{i} = x) = π (x) < 1 \end{matrix}$ which in combine called strong ignorability. Assumption (3) is a common support regularization which means that in every corner of the covariate space, no arm will dominate. This assumption is often required in matching, weighting, and sub-classification related approaches to avoid the extreme value of propensity score (will be formally introduced in the next section).

With assumption (2), consistency could be deduced $E [Y ∣ T = 1, X] = E [Y^{(1)} \cdot T + Y^{(0)} \cdot (1 - T) ∣ T = 1, X] = E [Y^{(1)} ∣ T = 1, X],$ and similarly, $E [Y ∣ T = 0, X] = E [Y^{(0)} ∣ T = 0, X] .$ Then, population ATE can be written (for proof, please refer to Appendix) as either $\begin{matrix} (4) & τ = E_{X} {E [Y ∣ T = 1, X] - E [Y ∣ T = 0, X]}, \end{matrix}$ or $\begin{matrix} (5) & τ = E [\frac{T Y}{π (X)}] - E [\frac{(1 - T) Y}{1 - π (X)}] = E [\frac{T - π (X)}{π (X) (1 - π (X))} Y], \end{matrix}$ where $π (x) = E [T = 1 | X]$ is called propensity score that measures the possibility of receiving treatment. Similarly, the expression for CATE is either $\begin{matrix} (6) & τ (x) = E [Y ∣ T = 1, X = x] - E [Y ∣ T = 0, X = x], \end{matrix}$ or $\begin{matrix} (7) & τ (x) = E [\frac{T Y}{π (X)} | X = x] - E [\frac{(1 - T) Y}{1 - π (X)} | X = x] = E [\frac{T - π (X)}{π (X) (1 - π (X))} Y | X = x] . \end{matrix}$ In some literature, the treatment indicator is denoted as $W = 2 T - 1 \in {- 1, 1}$ , then equation (5) and (7) could be simply expressed by $\begin{aligned} (8) & τ & = E [\frac{W Y}{W π (X) + (1 - W) / 2}], \\ (9) & τ (x) & = E [\frac{W Y}{W π (X) + (1 - W) / 2} | X = x] . \end{aligned}$

Almost every work aiming at finding ATE or CATE start from one of the above equations. For example, equation (4) and CATE1 are the foundation of conditional mean response methods, whereby the target is transformed to estimating conditional mean response $μ_{k} (x) = E [Y ∣ T = k, X = x], k \in {0, 1}$ . Equation (5), (7), (8), and (9) are the prototype of inverse probability weighted (IPW) methods which have been frequently used in modified outcome methods (MOM) or modified covariate methods (MCM). For instance, if both arm are equally allocated, that is $π = 0.5$ , then equation (9) turns out to be $τ (x) = E [2 W Y ∣ X = x]$ which is the origin of MOM illustrated in Signorovitch (2007) and Tian et al. (2014). More details of these methods will be illustrated in the next section.

The focus of this dissertation is the estimation of CATE in observational studies.

2 Existing Methods

Matching is the most straightforward and original way of seeking CATE. Under the Assumption 1, 2 and 3, for an individual, as long as we could find another one with the exactly the same covariates $X$ and belongs to different treatment arm, then the expected difference in response of this two individual equals $τ (X)$ $E [Y ∣ T = 1, X = x] - E [Y ∣ T = 0, X = x] = E [Y^{(1)} ∣ X = x] - E [Y^{(0)} ∣ X = x] = τ (x) .$ However, in real observational studies, such perfectly matched pair may not be easily found, and even found, a single paired difference it is not the expected difference which might involve a large uncertainty.

Since it is almost impossible to find a perfect match, especially in high dimensional space, people try to figure out a way that only match on important covariates to ease the process. This branch of studies finally ruled by the milestone work from Rosenbaum and Rubin (1983) who illustrate that only matching by propensity score is good enough $E [Y ∣ T = 1, π (x)] - E [Y ∣ T = 0, π (x)] = E [Y^{(1)} ∣ π (x)] - E [Y^{(0)} ∣ π (x)] = τ (π (x))$ under assumption (2). Since then, the whole matching task is transformed from a high dimensional space to a linear one, but how to achieve a good estimator of propensity score $π (X)$ turns out to be the key problem.

However, as long as using paired matching, people always suffers from the larger variance coming from a single estimate. This leads to subclassification approach that seeks a group of units share similar, or same, which is the perfect scenario, propensity score. As long as at least one unit is in the different treatment group, the difference between the averaged outcome within each subgroup equals the desired treatment effect at given propensity score. It could help to reduce the estimation error by taking the average but if there is not enough samples at each propensity level, the benefit is quite limited.

Matching or subclassification based on propensity score is tailored mainly for the estimation of ATE. After matching/subclassification at each covariate level, they take the average over all samples to achieve the final ATE $\begin{aligned} E_{π (x)} {E [Y ∣ T = 1, π (x)] - E [Y ∣ T = 0, π (x)]} & = E_{π (x)} {E [Y^{(1)} ∣ π (x)] - E [Y^{(0)} ∣ π (x)]} \\ = E [Y^{(1)} - Y^{(0)}] = τ \end{aligned}$ which does not hurt a lot by the missed values along $π (x)$ . However, in estimation of CATE, there will be a lot of ‘holes’ in the estimators since it could be infinite numbers between $[0, 1]$ , and it is then less likely to do the prediction.

Hirano, Imbens, and Ridder (2003) borrow the strength of equation (5) to directly estimate the ATE which only need to plug in the pre-estimated propensity score $\hat{π} (X)$ without matching and subclassification $\begin{matrix} (10) & τ = \frac{1}{| D |} \sum_{i \in D} [\frac{T_{i} - \hat{π} (X_{i})}{\hat{π} (X_{i}) (1 - \hat{π} (X_{i})} Y_{i}] . \end{matrix}$ They also claim that this will yield an efficient estimate of ATE. Since the propensity score appears on the denominator to weight the outcome, this method is then called weighting, or Inverse Probability Weighting (IPW). Many later literature estimating CATE follows this idea. But the key flaw in this method is the large variability and computational instability introduced by the boundary value of propensity score, for example 0.

To sum, to estimate CATE, or to recommend the optimal treatment, there are mainly two problems to solve: 1) how to define the loss metric and 2) how to model and achieve the estimators and/or the nuisance parameters in the loss function. For loss functions determination, there could be either direct way or indirect way, which will be mainly discussed in subsection 2.1. For example, in methods described by equation (10), the loss function is actually measuing the loss in estimated propensity score, which is an indirect way. Moreover, a proper modeling and estimation of $π (x)$ is the key to the success. Current methods include model based methods, kernel based methods, and other machine learning methods including, but not limited to regression tree, classification tree, and support vector machine (SVM).

The structure of this section is designed as follows. The loss metrics that are frequently adopted in current literature will be discussed in subsection 2.1. In subsection 2.2, we will demonstrate the estimation algorithms that used to solve the loss functions.

2.1 Loss Metric

The most different and difficult part in causal inference is the true IATE could never be observed. So no matter what estimation algorithm–either parametric or non-parametric method–is used, a direct evaluation of the loss in estimated treatment effect $\hat{τ} (X)$ is infeasible. Following the literature in this field, we measure the loss between $\hat{τ} (X)$ and $τ (X)$ , denote as $l (\hat{τ}, τ)$ , by Mean Squared Error (MSE) $\begin{matrix} (11) & E [l (\hat{τ}, τ)] = \frac{1}{| D |} \sum_{i \in D} {(\hat{τ} (x_{i}) - τ (x_{i}))}^{2}, \end{matrix}$ where $τ (\cdot)$ is the real CATE that is unknown, $D$ is observed dataset, and $| D |$ is the size of the it. Generally, there are two main branches to work around this problem, either to avoid directly calculating $τ$ -loss function, named Indirect Loss Function, or to approximately evaluate it, called Direct Loss Function Approximation.

Indirect Loss Function

Propensity Score Based Methods

Aforementioned matching, weighting, and subclassification methods belong to this category. Among these methods, an accurate estimation of the propensity score is the essence but estimating $τ (x)$ accurately is not the direct goal. Due to the conditional independence, the estimation of $π$ has nothing to do with outcome $Y$ , which might be its shortcomings due to not fully use the observed information. The corresponding loss function is $\arg min_{γ} - \sum_{i \in D} \log [W_{i} π (X_{i}; γ) + (1 - W_{i}) / 2],$ where $W_{i} = 2 T_{i} - 1 \in {- 1, 1}$ . If the propensity score (PS) adopts the logit function, then it is equivalent to the logit loss function: $\arg min_{γ} \sum_{i \in D} \log [1 + e^{- W_{i} X_{i}^{T} γ}] .$ In many latter studies, estimating $\hat{π} (X)$ is just an intermediate and inevitable step.

Conditional Mean Based Methods

Another set of straightforward methods follow directly from equation (6) that the estimation of CATE $τ (X)$ is accomplished by the estimation of the conditional means $μ_{k} (X), k \in {0, 1}$ $\hat{τ} (X) = {\hat{μ}}_{1} (X) - {\hat{μ}}_{0} (X) .$ There are two different approaches to estimate $μ_{k} (X)$ . One is to train separate models for treatment group and control group. The other is to train a single model but treat treatment indicator as an additional covariate. The former one sometimes is named as T-learner (stands for two-learner) and the latter one is S-learner (stands for single-learner) according to Künzel et al. (2019). The objective here turns out to be either minimize $\arg min_{β^{(k)}} E {{(Y - μ_{k} (X; β^{(k)}))}^{2}}, k \in {0, 1}$ for T-learner, where $β^{(k)}$ is a set of parameters for model $μ_{k} (\cdot)$ , or $\arg min_{β} E {{(Y - μ (X, T; β))}^{2}}$ for S-learner, where $μ (X, T = k; β) = μ_{k} (X; β)$ , and $β$ are the parameters for $μ (\cdot, \cdot)$ . Since both loss metric is not direct focusing on CATE, we categorize this method in indirect loss function branch.

Any machine learning algorithms could be used to fit conditional means, for example, kernel regression models with $L_{1}$ -penalty, regression trees, or Bayesian Additive Regression Tree (BART, Chipman et al. (2010)). This branch of methods seems reasonable, but actually suffer from the fact that minimizing the loss function of $μ_{i} (x)$ does not guarantee the optimal estimation of $τ (x)$ . The MSE of $\hat{τ} (X)$ could be expanded as follows $\begin{aligned} E [τ (X) - \hat{τ} (X)]^{2} & = E [(μ_{1} (X) - {\hat{μ}}_{1} (X)) - (μ_{0} (X) - {\hat{μ}}_{0} (X))]^{2} \\ = M S E (μ_{1}) + M S E (μ_{0}) - 2 E [(μ_{1} (X) - {\hat{μ}}_{1} (X)) (μ_{0} (X) - {\hat{μ}}_{0} (X))] \end{aligned}$ where $MSE (μ_{k}) = E [(μ_{k} (X) - {\hat{μ}}_{k} (X))^{2}]$ . Notice that minimize MSE( $μ_{k}$ ) is not equivalent to minimize $E [l (\hat{τ}, τ)]$ and no matter how good we fit $μ_{k} (X)$ , we still have no idea the goodness of $\hat{τ} (X)$ .

Propensity Score + Conditional Mean

Most of the research works involve estimating both response surface/conditional mean and treatment assignment mechanism/propensity score for observational studies. For example, Robins, Hernan, and Brumback (2000) propose a weighted least squares (WLS) regression estimator for modeling the conditional mean response and the weight is the inverse of propensity score. However, for these methods, if the model of $μ_{k} (\cdot)$ , or $π (\cdot)$ is mis-specified, then the results are no longer consistent (Kang, Schafer, and others 2007; Freedman and Berk 2008; Imai and Ratkovic 2014). So a double robust concept is proposed by Bang and Robins (2005), which involved both estimation of $μ_{i} (\cdot)$ and $π (\cdot)$ , and as long as one of the models is correctly specified, the double robust estimator is still consistent. That means people have two chances of guessing the true model and the odds of obtaining consistent estimator increases accordingly. The expression of doubly robust (DR) estimator of $τ$ is actually a combination of equation (4) and (5) $\begin{matrix} (12) & τ_{D R} (X) = μ_{1} (X) - μ_{0} (X) + \frac{T (Y - μ_{1} (X))}{π (X)} - \frac{(1 - T) (Y - μ_{0} (X))}{1 - π (X)} . \end{matrix}$ ${\hat{τ}}_{D R} (X)$ is obtainable by plugging in pre-estimated ${\hat{μ}}_{i} (X)$ and $\hat{π} (X)$ . However, the biggest problem is once both models are not correctly specified, the DR results could be even worse then other comparable methods (Kang, Schafer, and others 2007; Imai and Ratkovic 2014).

Individual Treatment Rule (ITR) Based Methods

In many applications, an accurate treatment effect estimation may not be the prior interest and people only want to know which treatment is the best for an individual. This leads to a different idea that not estimating treatment effect directly but seeking the optimal treatment assignment. The individual treatment rule (ITR) based methods pursue the best overall outcomes. If larger $Y$ the preferred, then the objective function is $\arg max_{g} V (g),$ where $V (g) = E_{g} [Y] = {\begin{cases} E [μ_{1} (X)] g (X) + E [μ_{0} (X)] (1 - g (X)) & Plain Version \\ E [\frac{1 [T = g (X)]}{π^{c} (X)} Y] & IPW Version \\ E [\frac{1 [T = g (X)]}{π^{c} (X)} Y - \frac{1 [T = g (X)] - π^{c} (X)}{π^{c} (X)} {μ_{1} g (X) + μ_{0} (1 - g (X))}] & DR Version \end{cases}$ and $π^{c} (X) = W π (X) + (1 - W) / 2$ , where $W = 2 T - 1 \in {- 1, 1}$ .

Apparently, one challenge in IRT based methods is the optimization of non-convex and non-continuous objective function. Usually it is not easy to be solved by many off-the-shelf algorithms. Besides, the estimators for CATE is not readily obtained as a by-product which might be its drawback if people want CATE as well. In Qian and Murphy (2011), they use the plain version, and avoid computing issue by first figuring out conditional mean estimators, and then determine the optimal region $g (\cdot)$ . The error will accumulate during the two-step estimation procedure and final regime might be sensitive to the $\hat{μ} (X)$ . Zhang et al. (2012) worked on IPW (inverse probability weighting) and DR (doubly robust) version and provide two approaches to solve the objective function, one is grid search, which is almost impossible in high dimensional cases, the other is a generic algorithm by Goldberg (1989) with R package rgenoud (Mebane and Skehon, 2011). Zhao et al. (2012) transformed the objective function a little bit and put it into the framework of SVM and successfully solve it. But SVM suffers from multilevel classification, that is, this method could not be easily extended to multiple treatment studies. Furthermore, SVM could not handle imbalanced allocation well and also weak in $p ≫ n$ cases. Chen et al. (2017) provided a general framework to this set of problems which could also quantify the treatment effect (not just the regime). What they did is actually figuring out a way to connect optimal regime finding to the direct loss function calculation (will be discussed in next subsection) which put a closure to this pipeline.

Direct Loss Function Approximation

The main idea is that even though we cannot observe $τ (X)$ directly, we could replace the unknown $τ (X)$ in equation (11) by some available proxies with the property that minimizing this new loss function is equivalent to minimizing the target loss function (11). Most methods to discuss in this subsection are in the category of modified variate methods (MVM), which consists of modified outcome methods (MOM) and modified covariate method (MCM).

Modified Outcome Methods (MOM)

Suppose there exists $τ (X) = E [Y^{*} ∣ X]$ , where $Y^{*} = c Y$ , we are able to show that minimizing the proxy loss function $E [l (\hat{τ}, Y^{*})] = E [{\hat{τ} (X) - Y^{*}}^{2}] = E [{\hat{τ} (X) - τ (X) + τ (X) - Y^{*}}^{2}] = E [{\hat{τ} (X) - τ (X)}^{2}] + V a r (Y^{*})$ is equivalent to minimizing $E [l (\hat{τ}, τ)]$ since $V a r (Y^{*})$ is a constant. As the modification is made upon the outcome $Y$ to $Y^{*}$ , it gets the name modified outcome method.

Signorovitch (2007) first explores this idea in his PHD dissertation based on the relationship $τ (X) = E [2 W Y ∣ X]$ . The objective is then to minimize the loss function $\begin{matrix} (13) & E [{\hat{τ} (X) - 2 W Y}^{2}] . \end{matrix}$ If further assuming the linear structure of CATE, that is $τ (X) = X^{T} β$ , then the minimization task can be simply done by fitting a linear model $2 W Y = X^{T} β + ε$ if the outcome $Y$ is continuous. But $τ (X) = E [2 W T]$ holds only in 1:1 randomized experiments, an IPW version of MOM is designed for observational studies. Based on equation (7), the loss function for IPW version MOM is very straightforward $\begin{matrix} (14) & E [l (\hat{τ}, τ)] = E [{\hat{τ} (X) - \frac{T - π (X)}{π (X) (1 - π (X))} Y}^{2}], \end{matrix}$ where $π (X)$ is usually pre-estimated separately. Just as other IPW based methods, the results are sensitive to propensity score and no longer consistent when the model of $π (\cdot)$ is mis-specified (Knaus, Lechner, and Strittmatter 2018).

Similarly, adopting equation (12), we can develop the DR version of MOM. We do not further explore the possible proxies here. The main issue for MOM is that the outcome is required to be continuous and cannot be extended to a various type of outcomes. Therefore, MCM methods are proposed to fix this issue.

Modified Covariate Methods (MCM)

MCM is first developed in Tian et al. (2014) who modified MOM methods by adjusting the proxy loss function a little bit $E [{\hat{τ} (X) - Y^{*}}^{2}] = E [c^{2} {\hat{τ} (X) / c - Y}^{2}]$ and then minimizing the right-hand-size, which is the loss function for MCM, targeting on $\hat{τ} (X)$ directly as well. This branch of methods are fundamentally similar to MOM but do not make modification on the outcome, resulting in no restriction on $Y$ . If under the linear assumption of $\hat{τ} (X) = X^{T} β$ , the original covariates $X$ is then turned to be $X / c$ , and this is where the name modified covariate comes from. Furthermore, the CATE can be easily obtained using weighted least square methods, where $c^{2}$ is the weights.

For example, the MCM version of the loss function (13) is thus $E [{\hat{τ} (X) \cdot W / 2 - Y}^{2}] .$ For MOM-IPW version, with some modification on (14), the loss function for MCM-IPW version is derived as $E [{\hat{τ} (X) \cdot [(W + 1) / 2 - π (X)] - Y}^{2}]$ and with further adjustment, it turns out to be $E [\frac{{\hat{τ} (X) \cdot W - Y}^{2}}{W π (X) + (1 - W) / 2}] .$ These two loss functions are actually the two methods proposed by Chen et al. (2017) after generalizing the ITR method.

Note that only the linear structure on $\hat{τ} (X)$ in MVM based methods (Tian et al. 2014; Nie and Wager 2017; Knaus, Lechner, and Strittmatter 2018) has been studied in the literature due to the simplicity and ease in computation, though it is limited in scope. For more complicated proposed structure of $\hat{τ} (X)$ , general optimization algorithms could be directly applied on these loss functions.

R-learner

R-learner (Nie and Wager 2017) is the only method that does not directly rely on equation (4) - (9). Technically, however R-learner can still be regarded as a MOM based method. We highlight it here because it is different from previous methods and more importantly, it builds the technical foundation of our proposed method. R-learner method mainly bases on Robinson’s (Robinson 1988) transformation $Y - m (X) = {T - π (X)} τ (X) + ε,$ where the marginal mean function $m (X) = E [Y ∣ X]$ . Then the loss function is aiming to minimize $\arg min_{τ} E [ε^{2} (τ)] = \arg min_{τ} E [{Y - m (X) - (T - π (X)) τ (X)}^{2}] .$ After some algebra, it is equivalent to $\arg min_{β} E [(T - π (X))^{2} {\hat{τ} (X; β) - \frac{Y - m (X)}{T - π (X)}}^{2}]$ which is a typical form of MOM loss function. In Nie’s work, they assume a linear structure of $\hat{τ} (X)$ as well and estimate it with a two-step procedure by first calculate $\hat{m} (X)$ and $\hat{π} (X)$ .

Overall, the direct loss function based methods are usually more popular as it directly measures the loss of $\hat{τ} (x)$ . On the other hand, comparing to indirect loss functions, the requirement of additional assumptions to model $τ$ could be a serious drawback for this set of methods. Many methods, and almost all the direct loss functions based methods involve a two-step procedures, i.e. first estimate ${\hat{μ}}_{k} (\cdot)$ , $\hat{π} (\cdot)$ , or $\hat{m} (\cdot)$ , and then plug into the final loss functions to obtain $\hat{τ} (\cdot)$ . So the accuracy of these nuisance estimators is critical to the consistency of treatment effect estimation.

2.2 Estimation Algorithm

Construction of objective loss function transforms the causal inference problem in observational studies into a mathematical optimization problem. So the next step is to figure out the estimators that appear in the loss function, they are ${\hat{μ}}_{k} (x)$ , $\hat{π} (x)$ , $\hat{m} (X)$ , and $\hat{τ} (x)$ . The forthcoming discussion will focus on parametric methods, kernel-based methods, and machine learning algorithms, mainly tree-based algorithms.

Parametric Methods

The estimation is generally carried out by working with generalized linear model (GLM). For instance, it is very common to estimate propensity score by a logistic regression $l o g i t (π) = X^{T} β .$ In many indirect loss metric methods, people often estimate CATE $τ (X)$ by a linear model $τ (X) = X^{T} β$ along with LASSO to accomplish variable selection for all direct loss metric. But it is not very often to see people directly use linear model in estimation of conditional means. Parametric models enjoy the benefits of simplicity in both theoretical proof and computation, but are overly restricted to the specified models.

Kernel-based Methods

As one of the most popular non-parametric methods, the kernel-based model is readily used for estimating the propensity score, for example, $\hat{π} (x) = \frac{\sum_{i = 1}^{n} T_{i} K (\frac{X_{i} - x}{h}) / n h^{k}}{\sum_{i = 1}^{n} K (\frac{X_{i} - x}{h}) / n h^{k}}$ in Abrevaya, Hsu, and Lieli (2015). Then they plug this estimator into equation (7), $\hat{τ} (x)$ can be obtained by locally weighted approach. For direct loss metric, Chen et al. (2017) adopted B-spline based additive model to estimate $\hat{τ} (x)$ . $L_{1}$ penalty is also commonly used in kernel based methods to control the complexity of the models.

Tree-based Methods

Athey and Imbens (2015), Athey and Imbens (2016), Wager and Athey (2018), Athey et al. (2019), provided a very comprehensive view of incorporating tree-related algorithm (CART, Breiman (2017); bagging, Breiman (1996); boosting, Freund, Schapire, and others (1996); gradient boosting, Friedman (2001)) into the estimation of CATE. These methods could also be further split into two branches: regression trees, and causal trees. In our framework, regression trees are deemed as using indirect loss metric during tree construction, while causal trees adopt direct loss metric.

The regression trees based methods follow the idea of estimating $μ_{k} (x)$ by CART or random forests. The idea is no different than using kernel based methods to estimate conditional means but different learning algorithms. For tree-related algorithms, it is more or less to conduct subclassification so that each leaf represents homogeneous subjects and the outcome could be estimated by leaf mean. People could build up two trees and each one aims at one treatment group corresponding to T-learner, or a single tree, following S-learner idea, that treatment indicator can be used to split trees. These methods provide a non-parametric estimator of conditional mean response ${\hat{μ}}_{k} (x)$ , and the estimated treatment effect is simply $\hat{τ} (x) = {\hat{μ}}_{1} (x) - {\hat{μ}}_{0} (x)$ .

Aside from estimating conditional means by traditional regression tree methods, Bayesian Additive Regression Trees (BART) method was also adopted in Hill (2011). BART (Chipman, George, and McCulloch 1998, 2002; Chipman et al. 2010) has a similar structure to boosting tree (Freund, Schapire, and others 1996) but each tree is considered as a function of mapping covariates to the outcomes. So each tree is a model and tree structures, including depth of the tree, the mean value at final leaves, are the model parameters. With the assumption that units in the same leaf is normally distributed, the likelihood for a set of trees could be written out and with proper prior assumption for the parameters, posterior distribution is calculated via an iterative Bayesian backfitting MCMC algorithm. Even though this method adopts T-learner, an indirect loss metric, it is one of the top performers (based on bias and RMSE) among 15 black-box algorithms in estimation of CATE across 20 data sets, according to Dorie et al. (2019).

Causal trees, on the other hand, are no longer targeting at homogeneous outcomes during the tree construction but looking into the causal effect. Therefore, the loss functions to get minimized at each step are the direct loss metrics. For instance, considering MOM-IPW case (Athey and Imbens 2015; Powers et al. 2018), $Y^{*} \leftarrow \frac{T - \tilde{π} (X)}{\tilde{π} (X) (1 - \tilde{π} (X))} Y$ where $\tilde{π} (X)$ usually needs to be estimated separately. If directly build up a tree for this transformed outcome $Y^{*}$ , and adopt MSE as split rule, then the loss function at each step is $\sum_{l = 1}^{L} \sum_{i \in S_{l}} {(Y_{i}^{*} - {\bar{Y}}_{i}^{*})}^{2} = \sum_{i} {(Y_{i}^{*} - \hat{τ} (X_{i}))}^{2},$ where $S_{l}$ represents the set of units in node $l$ , and ${\bar{Y}}^{*}$ is the mean of $Y^{*}$ within each node ${\bar{Y}}_{i}^{*} = \frac{1}{| S_{l} |} \sum_{i \in S_{l}} Y_{i}^{*} = \hat{τ} (X_{i}) .$ As we have already seen in previous section, this loss function leads to the target of minimizing the loss of $\hat{τ} (X)$ .

This direct application is very straightforward theoretically, but may involve some bias if directly plug in the pre-estimated propensity score to adjust for the outcomes of a subpopulation in a single leaf. So a weighted average way of estimating $\hat{τ} (x)$ is proposed in Athey and Imbens (2015). In the same paper, they also formally named a tree based algorithm causal tree which uses transformed outcome as outcome and maximizes the total treatment effect as objective function. Hence the loss function can be set as $L o s s = - \sum_{i} \hat{τ} (X_{i})^{2} .$

Notice that the above estimation methods could be used simultaneously to solve a single objective function. For example, in Nie’s work, they used a logistic regression with LASSO in estimation of propensity score and a linear model for $τ (X)$ , but kernel-based methods in other parameters. In Athey and Imbens (2015), they first estimated propensity scores from logistic models and then plug the estimators in the tree based algorithms.

Support Vector Machine (SVM)

SVM is only used in solving ITR problem due to its strength in handling discontinuous loss function (Zhao et al. 2012; Imai, Ratkovic, and others 2013). But SVM usually is hard to extend to multiple treatment scenarios. So we will not explore further on SVM considering our target is the recommendation of the optimal treatment from a number of alternatives.

3 Proposed Method

The current methods mainly discuss the scenario of two arms, treatment versus control. Even though in many literature, authors claim that their methods could be extended to multiple treatment cases, none of them really implemented it. Our proposed method is tailored for such complicated scenario and at the same time provide an alternative approach to the existing methods. Following the structure in Section 2, the proposed method will also be explored in two parts: loss metric and estimation algorithm.

Following the notation in Section 1, assuming there are a total of $K + 1$ treatments, $T \in {0, 1, 2, . . ., K}$ and denote $T = 0$ the standard therapy or placebo. The covariate-specific causal effect for the $k^{t h}$ treatment against standard therapy is $τ^{(k)} (X) = E [Y^{(k)} - Y^{(0)} | X],$ and our goal is to recommend a therapy based on covariates $X$ that could maximize the treatment effect $Recommended Treatment T^{*} = \arg max_{k} τ^{(k)} (X) .$

3.1 Proposed Loss Metric

We extended the R-learner (Nie and Wager 2017) into multiple treatments scenario following the Robinson’s decomposition (Robinson 1988). The reason we adopt this loss metric is 1) this loss metric targets directly at minimizing the estimation error of $τ (x)$ ; 2) it holds for any outcome distribution, including binary outcomes, so that we do not need any extra modification like MVM metric; 3) the propensity scores do not appear on the denominators which avoid potential computing issues as well as additional variability.

Under Assumption 1,
$\begin{aligned} E [Y ∣ T, X] & = E [\sum_{k = 1}^{K} 1 [T = k] Y^{(k)} | T, X] = \sum_{k = 1}^{K} 1 [T = k] E [Y^{(k)} ∣ T, X] \\ = \sum_{k = 1}^{K} 1 [T = k] (τ^{(k)} (X) + E [Y^{(0)} ∣ X]) \\ = E [Y^{(0)} ∣ X] + \sum_{k = 1}^{K} 1 [T = k] τ^{(k)} (X) \end{aligned}$ and we could re-write this term by plugging in the conditional mean outcome $m (X) = E [Y ∣ X] = E [\sum_{k = 1}^{K} 1 [T = k] Y^{(k)} | X] = E [Y^{(0)} ∣ X] + \sum_{k = 1}^{K} π^{(k)} (X) τ^{(k)} (X),$ which follows that $E [Y ∣ T, X] = m (X) + \sum_{k = 1}^{K} (1 [T = k] - π^{(k)} (X)) τ^{(k)} (X),$ where the propensity score for the $k^{t h}$ treatment is denoted as $π^{(k)} (X) = E [1 [T = k] ∣ X] = Pr [T = k ∣ X]$ . Hence, the estimator of ${\hat{τ}}^{(\cdot)} (\cdot)$ could be obtained by minimizing the following loss function $\begin{matrix} (15) & \arg min_{τ} {E [{(Y - m (X) - \sum_{k = 1}^{K} (1 [T = k] - π^{(k)} (X)) τ^{(k)} (X))}^{2}]} . \end{matrix}$

3.2 Proposed Estimation Algorithm

To accomplish the optimization, we adopt a two-step estimation process according to the shape of the loss function. First we need to estimate $m (X)$ and $π^{(\cdot)} (X)$ , and then plug the estimators into equation (15). Besides, we need to propose a set of functions for the causal effects ${\hat{τ}}^{(\cdot)} (X)$ .

In order to handle the potential high dimension structure of the data, we adopt Generalized Additive Model (GAM) along with B-spline basis for the estimation of both conditional mean and propensity score. For the modeling of causal treatment effect, we combine the GAM with tree based learning algorithm to utilize the strength of both methods.

Estimation of Conditional Mean Outcome $m (X)$

Suppose there are $P$ covariates, then the conditional mean outcome could be modeled by $m (X) = α + f_{1} (X_{1}) + f_{2} (X_{2}) + \dots + f_{P} (X_{P})$ where $f_{p} (x) = \sum_{l = 1}^{q_{n}} a_{l p} B_{l} (x) .$

Estimation of Propensity Score $π^{(\cdot)} (X)$

For each propensity score, due to the binary outcomes, we adopt the logit link function $\log \frac{π^{(k)} (X)}{1 - π^{(k)} (X)} = β^{(k)} + g_{1}^{(k)} (X_{1}) + g_{2}^{(k)} (X_{2}) + \dots + g_{P}^{(k)} (X_{P})$ where $g_{p}^{(k)} (x) = \sum_{l = 1}^{q_{n}} b_{l p}^{(k)} B_{l} (x) .$

Proposed Models for Causal Treatment Effects $τ^{(\cdot)} (X)$

In Nie’s work, they simply assume a linear structure of $τ (X)$ for the simplicity in proof. For a better fit and free from any model restrictions, we adopt GAM to model causal treatment effects $τ^{(k)} (X) = γ^{(k)} + h_{1}^{(k)} (X_{1}) + h_{2}^{(k)} (X_{2}) + \dots h_{P}^{(k)} (X_{P})$ where $h_{p}^{(k)} (x) = \sum_{l = 1}^{q_{n}} c_{l p}^{(k)} B_{l} (x) .$

In the two-step estimation procedure, after obtaining $\hat{m} (X)$ and ${\hat{π}}^{(\cdot)} (X)$ in the first step, minimizing the loss function yield the estimates of $(\underset{―}{γ}, \underset{―}{c})$ $\begin{matrix} (16) & (\underset{―}{\hat{γ}}, \underset{―}{\hat{c}}) = \arg min_{\underset{―}{γ}, \underset{―}{c}} \sum_{i = 1}^{n} {(Y_{i} - \hat{m} (X_{i}) - \sum_{k = 1}^{K} (1 [T_{i} = k] - {\hat{π}}^{(k)} (X_{i})) τ^{(k)} (X_{i}; \underset{―}{γ}, \underset{―}{c}))}^{2} . \end{matrix}$ The estimates of covariate-specific causal effects are then given by $\begin{matrix} (17) & {\hat{τ}}^{(k)} (X) = {\hat{γ}}^{(k)} + \sum_{p = 1}^{P} (\sum_{l = 1}^{q_{n}} {\hat{c}}_{l p}^{(k)} B_{l} (X_{p})) . \end{matrix}$

Instead of direct optimizing the loss function (16) with LASSO algorithm just as in Nie’s work, we adopt a tree-based recursive partition algorithm to figure out the optimal causal treatment effect. The idea is to recursively split the node to grow a tree which could achieve the homogeneous leaves after completion and within each node, we estimate the treatment effects by equation (17). The details of generating trees will be further demonstrated in Section 3.3.

The main reason to adopt tree-based algorithm is the natural feature of tree-based algorithm that generate homogeneous leaves and assign heterogeneous units to different leaves perfectly matches our requirement of seeking heterogeneous treatment effects. Our ultimate goal is to find out the optimal covariate specific treatment, rather than figuring out the most accurate estimation of treatment effects, which is almost impossible. So we prefer not to directly solve equation (16) by LASSO but to subclassify homogeneous patients.

Our proposed method also differs from the current existing tree based algorithms. Causal tree based algorithms split the node according to the estimation of $τ (X)$ in each node under the assumption of homogeneity. However, at the earlier stage of tree generation, the homogeneity assumption always invalid which lead to the bias in the $\hat{τ} (X)$ and undermine the final estimators. In our methods, we estimate the treatment effect using GAM at each node without the need of homogeneity to avoid such issue. Besides, we propose an adaptive approach, which estimates the nuisance parameters $m (X)$ and $π^{(\cdot)} (X)$ within each node, that is very different from most current literature who obtain the estimators based on the whole data set (Athey and Imbens 2015; Powers et al. 2018; Wager and Athey 2018). Our method also is model free comparing to Su et al. (2009), who split the tree by the statistics that testing the interaction effect between treatments and covariates in the linear model.

Aside from those advantages comparing to the existing methods, adopting a tree structure helps to alleviate the potential problems of GAM. GAM is known for its great flexibility but has several limitations: 1) it is an additive structure which is not able to account for interaction effects; 2) GAM with B-splines regression is prone to overfit. For the first issue, tree structure, however, could capture the non-linearity even $τ^{(\cdot)} (X)$ is estimated linearly in each leaf. A common way to solve overfitting issue is by adding penalty terms in the loss function, but it could be computationally very expensive in our case. Here we adopt the idea of random forest to subsample the covariates during the growth of each single tree to mitigate overfitting and simplify the optimization calculation as well.

3.3 Estimation Process

Denote $\underset{―}{τ} = (τ^{(1)}, τ^{(2)}, . . ., τ^{(K)})$ and the estimated individual treatment effect from all treatment arms for the $i^{t h}$ subject is $\underset{―}{{\hat{τ}}_{i}} = ({\hat{τ}}^{(1)} (X_{i}), {\hat{τ}}^{(2)} (X_{i}), . . ., {\hat{τ}}^{(K)} (X_{i}))$ . We propose the following algorithm to generate a single tree

Step 1: Split the current node into two parts, left subnode and right subnode, at a given value of one covariate. The samples in left and right subnode is denoted by set $S_{L}$ and $S_{R}$ respectively, each with size $N_{L}$ and $N_{R}$ . At each subnode, we first fit $\hat{m} (X)$ and ${\hat{π}}^{(k)} (X), k = 1, 2, . . ., K$ and further obtain the model parameters $(\underset{―}{\hat{γ}}, \underset{―}{\hat{c}})$ by solving equation (16). Then the estimation of the individual treatment effect $\underset{―}{{\hat{τ}}_{i}}$ is accessible by equation (17) which is called ‘pseudo’ causal effects, since they are mainly used for facilitating the tree build-up. Denote the obtained $N_{L}$ ‘pseudo’ outcomes in left subnode, ${\hat{τ}}_{L} = {\underset{―}{{\hat{τ}}_{i}} : i \in S_{L}}$ and for the right subnode, ${\hat{τ}}_{R} = {\underset{―}{{\hat{τ}}_{i}} : i \in S_{R}} .$
Step 2: Then we calculate two-sample Hotelling’s T-Square statistic to test the null hypothesis that the average treatment effect is the same in left and right subnode, that is $H_{0} : τ_{L} = τ_{R}$ . This T-Square statistic is $T_{H}^{2} = ({\bar{\hat{τ}}}_{L} - {\bar{\hat{τ}}}_{R})^{T} {S_{p} (\frac{1}{N_{L}} + \frac{1}{N_{R}})}^{- 1} ({\bar{\hat{τ}}}_{L} - {\bar{\hat{τ}}}_{R})$ where $\begin{aligned} {\bar{\hat{τ}}}_{L} & = \frac{1}{N_{L}} \sum_{i \in S_{L}} \underset{―}{{\hat{τ}}_{i}}, \\ {\bar{\hat{τ}}}_{R} & = \frac{1}{N_{R}} \sum_{i \in S_{R}} \underset{―}{{\hat{τ}}_{i}}, \\ S_{p} & = \frac{(N_{L} - 1) S_{L} + (N_{R} - 1) S_{R}}{N_{L} + N_{R} - 2} \end{aligned}$ with $\begin{aligned} S_{L} & = \frac{1}{N_{L}} \sum_{i \in S_{L}} (\underset{―}{{\hat{τ}}_{i}} - {\bar{\hat{τ}}}_{L}) {(\underset{―}{{\hat{τ}}_{i}} - {\bar{\hat{τ}}}_{L})}^{T}, \\ S_{R} & = \frac{1}{N_{R}} \sum_{i \in S_{R}} (\underset{―}{{\hat{τ}}_{i}} - {\bar{\hat{τ}}}_{R}) {(\underset{―}{{\hat{τ}}_{i}} - {\bar{\hat{τ}}}_{R})}^{T} . \end{aligned}$ Since Hotelling’s two-sample t-squared statistic is related to F-distribution $\frac{N_{L} + N_{R} - K - 1}{(N_{L} + N_{R} - 2) K} T_{H}^{2} \sim F (K, N_{L} + N_{R} - K - 1),$ the p-value of the test is computed accordingly, denoted by $p_{v a l}$ . However, a simple comparison of p-value at all possible split values is not fair unless adjusting for the total number of distinct values in the covariate $m$ (to handle multiple comparison), and the depth of the node $d$ (to pursue balanced tree and avoid overfitting). We recommend to use the adjusted logworth, that is $a . l o g w o r t h = - \log_{10} (2^{d} \cdot m \cdot p_{v a l}) .$ After going through all the potential split points, the one provides the largest $a . l o g w o r t h$ will be adopted to be the next split.
Step 3: Repeat Step 1 and 2 until no further split could be made, or a certain stopping rule is met.
Step 4: After completing the growth of the tree, we need to prune the tree to the right size to avoid overfitting. Pruning and cross-validation can be a burden when statistical tests are used for splitting (Zeileis, Hothorn, and Hornik 2008; Athey and Imbens 2016). In our work, we propose to apply a different criterion in validation and pruning so that to achieve ‘honest’ trees as proposed by Athey and Imbens (2016). Other than T-Square statistics, we adopt a direct measure of treatment effect $\sum_{i \in {S_{L}, S_{R}}} max {\underset{―}{{\hat{τ}}_{i}}}^{2}$ to evaluate whether a split could help to maximize the overall benefits. The splits that could not improve the treatment effects in validation set will be removed.
Step 5: For the final tree, each terminal node defines a homogeneous sub-population for treatment effects. Thus, the covariate specific treatment effects among all $K$ treatments could be treated as constants instead of modeling by GAM, and estimated by minimizing the loss function $\underset{―}{\hat{τ}} (X) = \arg min_{τ} \sum_{i \in {S : X \in S}} {(Y_{i} - \hat{m} (X_{i}) - \sum_{k = 1}^{K} (1 [T_{i} = k] - {\hat{π}}^{(k)} (X_{i})) τ^{(k)})}^{2} .$

To reduce the variability of estimators from a single tree, we repeat Step 1 - 5 to generate a total of $B$ trees where the samples and covariates used for each tree is subsampled from the whole set. The final treatment effect for the $i^{t h}$ subject is thus the average of the outcomes from $B$ single trees $\underset{―}{{\bar{\hat{τ}}}_{i}} = ({\bar{\hat{τ}}}_{i}^{(1)}, {\bar{\hat{τ}}}_{i}^{(2)}, . . ., {\bar{\hat{τ}}}_{i}^{(K)})$ and the recommended treatment is the one that provides the optimal effect: $T^{*} = {k : {\bar{\hat{τ}}}_{i}^{(k)} = max ({\bar{\hat{τ}}}_{i}^{(1)}, {\bar{\hat{τ}}}_{i}^{(2)}, . . ., {\bar{\hat{τ}}}_{i}^{(K)})}$ .

4 Application

In this section, I will introduce the application of proposed method on seeking the optimal antihypertensive therapies. Nowadays, hundreds of different type of antihypertensives are available in the market. The benefit of the antihypertensives varies across people due to different personal characteristics. To make things worse, for some person, only a combination of several antihypertensive drugs, which we call it an antihypertensive therapy, make effect. Our target is to find out the best antihypertensive therapy according to patient’s characteristics by our proposed method. In the following sessions, we will introduce our data system, definition of cohort, and the potential problems when applying our method to the real observational data.

4.1 Data Source

Our electronic medical record (EMR) data is collected from clinical records stored in the Indiana Network for Patient Care (INPC). The INPC includes most of the Regenstrief Medical Record System (RMRS) clinical data from Wishard Health Services, Indiana University (IU) hospitals, and Methodist Hospital. RMRS is one of the nation’s first electronic medical record systems and the keystone of many activities at Regenstrief Institute. The data in the RMRS have been used previously for health care research and management purposes.

INPC is a 17-year-old health information exchange operated in Indiana. Regenstrief Institute made a pioneering commitment to standards, interoperability, and the interchange of clinical data for clinical, public health, and research purposes. Investigators at the Regenstrief Institute created the Indianapolis Network for Patient Care (INPC) in 1995 with the goal of providing clinical information at the point of care for the treatment of patients. The INPC includes clinical data from over 80 hospitals, the public health departments, local laboratories and imaging centers, and a few large-group practices closely tied to hospital systems, and plans are to expand these efforts.

The INPC repository now carries 660 million discrete observations; 14.5 million text reports; 45 million radiology images; and 450,000 EKG tracings. These are now growing at the respective rates of 88 million, 2 million, 25 million, and 80,000 per year.

The Regenstrief data system standardizes all clinical data, and laboratory test results are mapped to a set of common test codes (LOINC) with standard units of measure for patient care, public health, and research purposes. Each institution has the same file structure in the Regenstrief system and shares the same term dictionary, which contains the codes, names (and other attributes) for tests, drugs, coded answers, etc.

4.2 Cohort Definition

We will identify the cohort from medical records stored in INPC. We will use ICD9 code to identify cases of hypertension. To alleviate the concern about underdiagnoses of hypertension, we also examined recorded blood pressure in the medical records. An individual is considered to be hypertensive if his/her blood pressure meets the criteria for hypertension set forth by the Seventh Report of the Joint National Committee on Prevention, Detection, Evaluation, and Treatment of High Blood Pressure (JNC 7).

To meet the Standards for Privacy of Individually Identifiable Health Information (“Privacy Rules”) set by the Health Insurance Portability and Accountability Act of 1996, we sought and obtained permission from Indiana University Institutional Review Board to use de-identified data in this research. To comply with the privacy rules, we removed all identifiable information from the study data, including names, addresses, telephone numbers, social-security numbers, hospital identification numbers, etc. We also removed all calendar dates from the medical records and only kept calendar years of the recorded events such as medication prescriptions and dispensing, laboratory tests, outpatient and inpatient visits. To maintain the temporal sequence of recorded events, we designate the date of the initial diagnosis as the index date for that patient, i.e., time zero. All other record dates are indicated by the number of calendar days from the index date. We recognize that the determination of time of “initial” diagnosis is not foolproof as it is conceivable that the patient could have been diagnosed earlier than our medical records showed.

Our cohort for hypertension patients was extracted for a five year time period between 2004 to 2012. A total of 812,571 patients were identified as hypertension. But considering the fact of incompleteness, the total data available for use will be far less than this number, which may around 20k to 30k.

4.2 Potential Problems of EMR Data Application

Missing Data. Ms sing data is always an issue in observational studies, and it is even more severe for EMR data. Many variables, for example, the record of whether patient have had antibiotic recently, only a small portion (<1%) has the answer ‘yes’, but a large portion is missed. We need to make a proper assumption that those missed data are actually ‘no’.
Drug Adherence. In observational studies, we have no idea of whether patients take medicine regularly as required or not. So we adopt two frequently used adherence measures: medication possession ratio (MRP) and proportion of days covered (PDC). Both MPR and PDC are validated adherence measures accepted by the International Society for Pharmacoeconomics and Outcomes Research (ISPOR). We calculated MPR and PDC using medication transaction and/or pharmacy claim data stored in the electronic records of the INPC.

The drug level adherence may be used as the covariate in estimation of propensity score. For patient level drug adherence, it may not make a lot of sense to use as a personal characteristics since we would not want to find some therapies that works for patients who do not adhere to the prescription, but rather we may use it to stratify the cohort and only focus on the sub-cohort with appropriate level of drug adherence.
Treatment Duration. The strict definition of causal effect is to measure the difference of potential outcomes for the same person, at the same time point after treatment. To accomplish the estimation, we already loose the constraint to compare the potential outcomes from different patients. But for observational studies, we lose the control for the treatment duration as well. This may raise some criticism but considering the immediate effect of antihypertensives, the outcomes may not change a lot with the duration which could alleviate the issue.

Appendix

Proof of Equation (4): $\begin{aligned} τ & \equiv E [Y^{(1)}] - E [Y^{(0)}] \\ = E_{X} {E [Y^{(1)} | X] - E [Y^{(0)} | X]} \\ = E_{X} {E [Y^{(1)} | T = 1, X] - E [Y^{(0)} | T = 0, X]} (I g n o r a b i l i t y) \\ = E_{X} {E [Y | T = 1, X] - E [Y | T = 0, X]} (C o n s i s t e n c y) \end{aligned}$ Proof of Equation (5): $\begin{aligned} τ & = E_{X} {E [Y | T = 1, X] - E [Y | T = 0, X]} \\ = E_{X} {\frac{E [Y | T = 1, X] \cdot E [T = 1 | X]}{E [T = 1 | X]}} - E_{X} {\frac{E [Y | T = 0, X] \cdot E [T = 0 | X]}{E [T = 0 | X]}} \\ = E_{X} {E [\frac{T Y}{π (X)} | X]} - E_{X} {E [\frac{(1 - T) Y}{1 - π (X)} | X]} \\ = E [\frac{T Y}{π (X)}] - E [\frac{(1 - T) Y}{1 - π (X)}] . \end{aligned}$

References

Abrevaya, Jason, Yu-Chin Hsu, and Robert P Lieli. 2015. “Estimating Conditional Average Treatment Effects.” Journal of Business & Economic Statistics 33 (4): 485–505.

Athey, Susan, and Guido Imbens. 2016. “Recursive Partitioning for Heterogeneous Causal Effects.” Proceedings of the National Academy of Sciences 113 (27): 7353–60.

Athey, Susan, and Guido W Imbens. 2015. “Machine Learning Methods for Estimating Heterogeneous Causal Effects.” Stat 1050 (5): 1–26.

Athey, Susan, Julie Tibshirani, Stefan Wager, and others. 2019. “Generalized Random Forests.” The Annals of Statistics 47 (2): 1148–78.

Bang, Heejung, and James M Robins. 2005. “Doubly Robust Estimation in Missing Data and Causal Inference Models.” Biometrics 61 (4): 962–73.

Breiman, Leo. 1996. “Bagging Predictors.” Machine Learning 24 (2): 123–40.

———. 2017. Classification and Regression Trees. Routledge.

Chen, Shuai, Lu Tian, Tianxi Cai, and Menggang Yu. 2017. “A General Statistical Framework for Subgroup Identification and Comparative Treatment Scoring.” Biometrics 73 (4): 1199–1209.

Chipman, Hugh A, Edward I George, and Robert E McCulloch. 1998. “Bayesian Cart Model Search.” Journal of the American Statistical Association 93 (443): 935–48.

———. 2002. “Bayesian Treed Models.” Machine Learning 48 (1-3): 299–320.

Chipman, Hugh A, Edward I George, Robert E McCulloch, and others. 2010. “BART: Bayesian Additive Regression Trees.” The Annals of Applied Statistics 4 (1): 266–98.

Dorie, Vincent, Jennifer Hill, Uri Shalit, Marc Scott, Dan Cervone, and others. 2019. “Automated Versus Do-It-Yourself Methods for Causal Inference: Lessons Learned from a Data Analysis Competition.” Statistical Science 34 (1): 43–68.

Freedman, David A, and Richard A Berk. 2008. “Weighting Regressions by Propensity Scores.” Evaluation Review 32 (4): 392–409.

Freund, Yoav, Robert E Schapire, and others. 1996. “Experiments with a New Boosting Algorithm.” In Icml, 96:148–56. Citeseer.

Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics, 1189–1232.

Hill, Jennifer L. 2011. “Bayesian Nonparametric Modeling for Causal Inference.” Journal of Computational and Graphical Statistics 20 (1): 217–40.

Hirano, Keisuke, Guido W Imbens, and Geert Ridder. 2003. “Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score.” Econometrica 71 (4): 1161–89.

Imai, Kosuke, and Marc Ratkovic. 2014. “Covariate Balancing Propensity Score.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 (1): 243–63.

Imai, Kosuke, Marc Ratkovic, and others. 2013. “Estimating Treatment Effect Heterogeneity in Randomized Program Evaluation.” The Annals of Applied Statistics 7 (1): 443–70.

Imbens, Guido W, and Donald B Rubin. 2015. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press.

Kang, Joseph DY, Joseph L Schafer, and others. 2007. “Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data.” Statistical Science 22 (4): 523–39.

Knaus, Michael, Michael Lechner, and Anthony Strittmatter. 2018. “Machine Learning Estimation of Heterogeneous Causal Effects: Empirical Monte Carlo Evidence.”

Künzel, Sören R, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. 2019. “Metalearners for Estimating Heterogeneous Treatment Effects Using Machine Learning.” Proceedings of the National Academy of Sciences 116 (10): 4156–65.

Neyman, Jerzy S. 1923. “On the Application of Probability Theory to Agricultural Experiments. Essay on Principles. Section 9.(tlanslated and Edited by Dm Dabrowska and Tp Speed, Statistical Science (1990), 5, 465-480).” Annals of Agricultural Sciences 10: 1–51.

Nie, Xinkun, and Stefan Wager. 2017. “Quasi-Oracle Estimation of Heterogeneous Treatment Effects.” arXiv Preprint arXiv:1712.04912.

Pearl, Judea. 2009. Causality. Cambridge university press.

Powers, Scott, Junyang Qian, Kenneth Jung, Alejandro Schuler, Nigam H Shah, Trevor Hastie, and Robert Tibshirani. 2018. “Some Methods for Heterogeneous Treatment Effect Estimation in High Dimensions.” Statistics in Medicine 37 (11): 1767–87.

Qian, Min, and Susan A Murphy. 2011. “Performance Guarantees for Individualized Treatment Rules.” Annals of Statistics 39 (2): 1180.

Robins, James M, Miguel Angel Hernan, and Babette Brumback. 2000. “Marginal Structural Models and Causal Inference in Epidemiology.” LWW.

Robinson, Peter M. 1988. “Root-N-Consistent Semiparametric Regression.” Econometrica: Journal of the Econometric Society, 931–54.

Rosenbaum, Paul R, and Donald B Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1): 41–55.

Rubin, Donald B. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology 66 (5): 688.

———. 1980. “Randomization Analysis of Experimental Data: The Fisher Randomization Test Comment.” Journal of the American Statistical Association 75 (371): 591–93.

Signorovitch, James Edward. 2007. “Identifying Informative Biological Markers in High-Dimensional Genomic Data and Clinical Trials.” PhD thesis, Harvard University.

Su, Xiaogang, Chih-Ling Tsai, Hansheng Wang, David M Nickerson, and Bogong Li. 2009. “Subgroup Analysis via Recursive Partitioning.” Journal of Machine Learning Research 10 (Feb): 141–58.

Tian, Lu, Ash A Alizadeh, Andrew J Gentles, and Robert Tibshirani. 2014. “A Simple Method for Estimating Interactions Between a Treatment and a Large Number of Covariates.” Journal of the American Statistical Association 109 (508): 1517–32.

Wager, Stefan, and Susan Athey. 2018. “Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests.” Journal of the American Statistical Association 113 (523): 1228–42.

Zeileis, Achim, Torsten Hothorn, and Kurt Hornik. 2008. “Model-Based Recursive Partitioning.” Journal of Computational and Graphical Statistics 17 (2): 492–514.

Zhang, Baqun, Anastasios A Tsiatis, Eric B Laber, and Marie Davidian. 2012. “A Robust Method for Estimating Optimal Treatment Regimes.” Biometrics 68 (4): 1010–8.

Zhao, Yingqi, Donglin Zeng, A John Rush, and Michael R Kosorok. 2012. “Estimating Individualized Treatment Rules Using Outcome Weighted Learning.” Journal of the American Statistical Association 107 (499): 1106–18.