Notice: the causal inference described here follows the idea of Potential Outcome Framework (Neyman 1923; Rubin 1974). For another big camp of causal inference studies, i.e., Structural Causal Model (SCM) (Pearl 2009), please refer to this well-written paper here.
Heterogeneous Treatment Effect Estimation
1. Introduction
The need for causal inference arises in many different fields. Economists want to quantify the causal effects of interest rate cut on the economy. Trade policy makers want to know whether increased tariff causes changes in trade deficit. Healthcare providers want to assess whether a therapy causes changes in specific patient outcomes. In this proposal, I examine the estimation of causal effects therapeutic treatments by using observational data in a population where the treatment effects vary by observed patient characteristics. Such estimation is essential for the practice of precision medicine, which by definition targets patients with specific characteristics to maximize the therapeutic benefit.
This work is proposed within the Rubin-Neyman’s (Neyman 1923; Rubin 1974; Imbens and Rubin 2015) potential outcome framework. An essential feature of the potential outcome concept is that a given individual could have multiple potential outcomes, each under a different treatment; but since the subject only receives one treatment, only one of these potential outcomes is realized. The lack of observation on all potential outcomes within the subject makes it difficult to assess the effect of the treatment within the subject.
Denote the potential outcome from treatment
Assumption 1. SUTVA
1) Treatment applied to one unit does not affect the outcome for other units;
2) There is no hidden variation, for example, different forms or versions of each treatment level, for a single treatment
and then the observed outcome for unit
The causal (or treatment) effect is defined by the comparison between potential outcomes. For unit
For any observed potential outcome, we need to think of the reason why this outcome is realized instead of others. Clearly if the treatment is not randomly assigned, there must be some reason for a unit to take such treatment, which means that the treatment assignment is informative. Therefore, the estimation of ATE or CATE varies under different assignment mechanism. The discussion for randomized experiment is listed in subsection 1.1 and observational studies in 1.2.
1.1 Randomized Experiments
In randomized experiment, the treatment assignment is totally random which means
In order to see the subgroup treatment effect, we need to estimate CATE. The estimation of CATE in randomized experiments is the simplified version of that in observational studies, which just replace the propensity score (will be introduced in subsection 1.2) by constant values.
1.2 Observational Studies
In observational studies, the treatment assignment is related to the outcome and no longer fully random, which means
Assumption 2. Unconfoundedness
Assumption 3. Common Support
With assumption (2), consistency could be deduced
Almost every work aiming at finding ATE or CATE start from one of the above equations. For example, equation (4) and CATE1 are the foundation of conditional mean response methods, whereby the target is transformed to estimating conditional mean response
The focus of this dissertation is the estimation of CATE in observational studies.
2 Existing Methods
Matching is the most straightforward and original way of seeking CATE. Under the Assumption 1, 2 and 3, for an individual, as long as we could find another one with the exactly the same covariates
Since it is almost impossible to find a perfect match, especially in high dimensional space, people try to figure out a way that only match on important covariates to ease the process. This branch of studies finally ruled by the milestone work from Rosenbaum and Rubin (1983) who illustrate that only matching by propensity score is good enough
However, as long as using paired matching, people always suffers from the larger variance coming from a single estimate. This leads to subclassification approach that seeks a group of units share similar, or same, which is the perfect scenario, propensity score. As long as at least one unit is in the different treatment group, the difference between the averaged outcome within each subgroup equals the desired treatment effect at given propensity score. It could help to reduce the estimation error by taking the average but if there is not enough samples at each propensity level, the benefit is quite limited.
Matching or subclassification based on propensity score is tailored mainly for the estimation of ATE. After matching/subclassification at each covariate level, they take the average over all samples to achieve the final ATE
Hirano, Imbens, and Ridder (2003) borrow the strength of equation (5) to directly estimate the ATE which only need to plug in the pre-estimated propensity score
To sum, to estimate CATE, or to recommend the optimal treatment, there are mainly two problems to solve: 1) how to define the loss metric and 2) how to model and achieve the estimators and/or the nuisance parameters in the loss function. For loss functions determination, there could be either direct way or indirect way, which will be mainly discussed in subsection 2.1. For example, in methods described by equation (10), the loss function is actually measuing the loss in estimated propensity score, which is an indirect way. Moreover, a proper modeling and estimation of
The structure of this section is designed as follows. The loss metrics that are frequently adopted in current literature will be discussed in subsection 2.1. In subsection 2.2, we will demonstrate the estimation algorithms that used to solve the loss functions.
2.1 Loss Metric
The most different and difficult part in causal inference is the true IATE could never be observed. So no matter what estimation algorithm–either parametric or non-parametric method–is used, a direct evaluation of the loss in estimated treatment effect
Indirect Loss Function
Propensity Score Based Methods
Aforementioned matching, weighting, and subclassification methods belong to this category. Among these methods, an accurate estimation of the propensity score is the essence but estimating
Conditional Mean Based Methods
Another set of straightforward methods follow directly from equation (6) that the estimation of CATE
Any machine learning algorithms could be used to fit conditional means, for example, kernel regression models with
Propensity Score + Conditional Mean
Most of the research works involve estimating both response surface/conditional mean and treatment assignment mechanism/propensity score for observational studies. For example, Robins, Hernan, and Brumback (2000) propose a weighted least squares (WLS) regression estimator for modeling the conditional mean response and the weight is the inverse of propensity score. However, for these methods, if the model of
Individual Treatment Rule (ITR) Based Methods
In many applications, an accurate treatment effect estimation may not be the prior interest and people only want to know which treatment is the best for an individual. This leads to a different idea that not estimating treatment effect directly but seeking the optimal treatment assignment. The individual treatment rule (ITR) based methods pursue the best overall outcomes. If larger
Apparently, one challenge in IRT based methods is the optimization of non-convex and non-continuous objective function. Usually it is not easy to be solved by many off-the-shelf algorithms. Besides, the estimators for CATE is not readily obtained as a by-product which might be its drawback if people want CATE as well. In Qian and Murphy (2011), they use the plain version, and avoid computing issue by first figuring out conditional mean estimators, and then determine the optimal region
Direct Loss Function Approximation
The main idea is that even though we cannot observe
Modified Outcome Methods (MOM)
Suppose there exists
Signorovitch (2007) first explores this idea in his PHD dissertation based on the relationship
Similarly, adopting equation (12), we can develop the DR version of MOM. We do not further explore the possible proxies here. The main issue for MOM is that the outcome is required to be continuous and cannot be extended to a various type of outcomes. Therefore, MCM methods are proposed to fix this issue.
Modified Covariate Methods (MCM)
MCM is first developed in Tian et al. (2014) who modified MOM methods by adjusting the proxy loss function a little bit
For example, the MCM version of the loss function (13) is thus
Note that only the linear structure on
R-learner
R-learner (Nie and Wager 2017) is the only method that does not directly rely on equation (4) - (9). Technically, however R-learner can still be regarded as a MOM based method. We highlight it here because it is different from previous methods and more importantly, it builds the technical foundation of our proposed method. R-learner method mainly bases on Robinson’s (Robinson 1988) transformation
Overall, the direct loss function based methods are usually more popular as it directly measures the loss of
2.2 Estimation Algorithm
Construction of objective loss function transforms the causal inference problem in observational studies into a mathematical optimization problem. So the next step is to figure out the estimators that appear in the loss function, they are
Parametric Methods
The estimation is generally carried out by working with generalized linear model (GLM). For instance, it is very common to estimate propensity score by a logistic regression
Kernel-based Methods
As one of the most popular non-parametric methods, the kernel-based model is readily used for estimating the propensity score, for example,
Tree-based Methods
Athey and Imbens (2015), Athey and Imbens (2016), Wager and Athey (2018), Athey et al. (2019), provided a very comprehensive view of incorporating tree-related algorithm (CART, Breiman (2017); bagging, Breiman (1996); boosting, Freund, Schapire, and others (1996); gradient boosting, Friedman (2001)) into the estimation of CATE. These methods could also be further split into two branches: regression trees, and causal trees. In our framework, regression trees are deemed as using indirect loss metric during tree construction, while causal trees adopt direct loss metric.
The regression trees based methods follow the idea of estimating
Aside from estimating conditional means by traditional regression tree methods, Bayesian Additive Regression Trees (BART) method was also adopted in Hill (2011). BART (Chipman, George, and McCulloch 1998, 2002; Chipman et al. 2010) has a similar structure to boosting tree (Freund, Schapire, and others 1996) but each tree is considered as a function of mapping covariates to the outcomes. So each tree is a model and tree structures, including depth of the tree, the mean value at final leaves, are the model parameters. With the assumption that units in the same leaf is normally distributed, the likelihood for a set of trees could be written out and with proper prior assumption for the parameters, posterior distribution is calculated via an iterative Bayesian backfitting MCMC algorithm. Even though this method adopts T-learner, an indirect loss metric, it is one of the top performers (based on bias and RMSE) among 15 black-box algorithms in estimation of CATE across 20 data sets, according to Dorie et al. (2019).
Causal trees, on the other hand, are no longer targeting at homogeneous outcomes during the tree construction but looking into the causal effect. Therefore, the loss functions to get minimized at each step are the direct loss metrics. For instance, considering MOM-IPW case (Athey and Imbens 2015; Powers et al. 2018),
This direct application is very straightforward theoretically, but may involve some bias if directly plug in the pre-estimated propensity score to adjust for the outcomes of a subpopulation in a single leaf. So a weighted average way of estimating
Notice that the above estimation methods could be used simultaneously to solve a single objective function. For example, in Nie’s work, they used a logistic regression with LASSO in estimation of propensity score and a linear model for
Support Vector Machine (SVM)
SVM is only used in solving ITR problem due to its strength in handling discontinuous loss function (Zhao et al. 2012; Imai, Ratkovic, and others 2013). But SVM usually is hard to extend to multiple treatment scenarios. So we will not explore further on SVM considering our target is the recommendation of the optimal treatment from a number of alternatives.
3 Proposed Method
The current methods mainly discuss the scenario of two arms, treatment versus control. Even though in many literature, authors claim that their methods could be extended to multiple treatment cases, none of them really implemented it. Our proposed method is tailored for such complicated scenario and at the same time provide an alternative approach to the existing methods. Following the structure in Section 2, the proposed method will also be explored in two parts: loss metric and estimation algorithm.
Following the notation in Section 1, assuming there are a total of
3.1 Proposed Loss Metric
We extended the R-learner (Nie and Wager 2017) into multiple treatments scenario following the Robinson’s decomposition (Robinson 1988). The reason we adopt this loss metric is 1) this loss metric targets directly at minimizing the estimation error of
Under Assumption 1,
3.2 Proposed Estimation Algorithm
To accomplish the optimization, we adopt a two-step estimation process according to the shape of the loss function. First we need to estimate
In order to handle the potential high dimension structure of the data, we adopt Generalized Additive Model (GAM) along with B-spline basis for the estimation of both conditional mean and propensity score. For the modeling of causal treatment effect, we combine the GAM with tree based learning algorithm to utilize the strength of both methods.
Estimation of Conditional Mean Outcome
Suppose there are
Estimation of Propensity Score
For each propensity score, due to the binary outcomes, we adopt the logit link function
Proposed Models for Causal Treatment Effects
In Nie’s work, they simply assume a linear structure of
In the two-step estimation procedure, after obtaining
Instead of direct optimizing the loss function (16) with LASSO algorithm just as in Nie’s work, we adopt a tree-based recursive partition algorithm to figure out the optimal causal treatment effect. The idea is to recursively split the node to grow a tree which could achieve the homogeneous leaves after completion and within each node, we estimate the treatment effects by equation (17). The details of generating trees will be further demonstrated in Section 3.3.
The main reason to adopt tree-based algorithm is the natural feature of tree-based algorithm that generate homogeneous leaves and assign heterogeneous units to different leaves perfectly matches our requirement of seeking heterogeneous treatment effects. Our ultimate goal is to find out the optimal covariate specific treatment, rather than figuring out the most accurate estimation of treatment effects, which is almost impossible. So we prefer not to directly solve equation (16) by LASSO but to subclassify homogeneous patients.
Our proposed method also differs from the current existing tree based algorithms. Causal tree based algorithms split the node according to the estimation of
Aside from those advantages comparing to the existing methods, adopting a tree structure helps to alleviate the potential problems of GAM. GAM is known for its great flexibility but has several limitations: 1) it is an additive structure which is not able to account for interaction effects; 2) GAM with B-splines regression is prone to overfit. For the first issue, tree structure, however, could capture the non-linearity even
3.3 Estimation Process
Denote
Step 1: Split the current node into two parts, left subnode and right subnode, at a given value of one covariate. The samples in left and right subnode is denoted by set
and respectively, each with size and . At each subnode, we first fit and and further obtain the model parameters by solving equation (16). Then the estimation of the individual treatment effect is accessible by equation (17) which is called ‘pseudo’ causal effects, since they are mainly used for facilitating the tree build-up. Denote the obtained ‘pseudo’ outcomes in left subnode, and for the right subnode,Step 2: Then we calculate two-sample Hotelling’s T-Square statistic to test the null hypothesis that the average treatment effect is the same in left and right subnode, that is
. This T-Square statistic is where with Since Hotelling’s two-sample t-squared statistic is related to F-distribution the p-value of the test is computed accordingly, denoted by . However, a simple comparison of p-value at all possible split values is not fair unless adjusting for the total number of distinct values in the covariate (to handle multiple comparison), and the depth of the node (to pursue balanced tree and avoid overfitting). We recommend to use the adjusted logworth, that is After going through all the potential split points, the one provides the largest will be adopted to be the next split.Step 3: Repeat Step 1 and 2 until no further split could be made, or a certain stopping rule is met.
Step 4: After completing the growth of the tree, we need to prune the tree to the right size to avoid overfitting. Pruning and cross-validation can be a burden when statistical tests are used for splitting (Zeileis, Hothorn, and Hornik 2008; Athey and Imbens 2016). In our work, we propose to apply a different criterion in validation and pruning so that to achieve ‘honest’ trees as proposed by Athey and Imbens (2016). Other than T-Square statistics, we adopt a direct measure of treatment effect
to evaluate whether a split could help to maximize the overall benefits. The splits that could not improve the treatment effects in validation set will be removed.Step 5: For the final tree, each terminal node defines a homogeneous sub-population for treatment effects. Thus, the covariate specific treatment effects among all
treatments could be treated as constants instead of modeling by GAM, and estimated by minimizing the loss function
To reduce the variability of estimators from a single tree, we repeat Step 1 - 5 to generate a total of
4 Application
In this section, I will introduce the application of proposed method on seeking the optimal antihypertensive therapies. Nowadays, hundreds of different type of antihypertensives are available in the market. The benefit of the antihypertensives varies across people due to different personal characteristics. To make things worse, for some person, only a combination of several antihypertensive drugs, which we call it an antihypertensive therapy, make effect. Our target is to find out the best antihypertensive therapy according to patient’s characteristics by our proposed method. In the following sessions, we will introduce our data system, definition of cohort, and the potential problems when applying our method to the real observational data.
4.1 Data Source
Our electronic medical record (EMR) data is collected from clinical records stored in the Indiana Network for Patient Care (INPC). The INPC includes most of the Regenstrief Medical Record System (RMRS) clinical data from Wishard Health Services, Indiana University (IU) hospitals, and Methodist Hospital. RMRS is one of the nation’s first electronic medical record systems and the keystone of many activities at Regenstrief Institute. The data in the RMRS have been used previously for health care research and management purposes.
INPC is a 17-year-old health information exchange operated in Indiana. Regenstrief Institute made a pioneering commitment to standards, interoperability, and the interchange of clinical data for clinical, public health, and research purposes. Investigators at the Regenstrief Institute created the Indianapolis Network for Patient Care (INPC) in 1995 with the goal of providing clinical information at the point of care for the treatment of patients. The INPC includes clinical data from over 80 hospitals, the public health departments, local laboratories and imaging centers, and a few large-group practices closely tied to hospital systems, and plans are to expand these efforts.
The INPC repository now carries 660 million discrete observations; 14.5 million text reports; 45 million radiology images; and 450,000 EKG tracings. These are now growing at the respective rates of 88 million, 2 million, 25 million, and 80,000 per year.
The Regenstrief data system standardizes all clinical data, and laboratory test results are mapped to a set of common test codes (LOINC) with standard units of measure for patient care, public health, and research purposes. Each institution has the same file structure in the Regenstrief system and shares the same term dictionary, which contains the codes, names (and other attributes) for tests, drugs, coded answers, etc.
4.2 Cohort Definition
We will identify the cohort from medical records stored in INPC. We will use ICD9 code to identify cases of hypertension. To alleviate the concern about underdiagnoses of hypertension, we also examined recorded blood pressure in the medical records. An individual is considered to be hypertensive if his/her blood pressure meets the criteria for hypertension set forth by the Seventh Report of the Joint National Committee on Prevention, Detection, Evaluation, and Treatment of High Blood Pressure (JNC 7).
To meet the Standards for Privacy of Individually Identifiable Health Information (“Privacy Rules”) set by the Health Insurance Portability and Accountability Act of 1996, we sought and obtained permission from Indiana University Institutional Review Board to use de-identified data in this research. To comply with the privacy rules, we removed all identifiable information from the study data, including names, addresses, telephone numbers, social-security numbers, hospital identification numbers, etc. We also removed all calendar dates from the medical records and only kept calendar years of the recorded events such as medication prescriptions and dispensing, laboratory tests, outpatient and inpatient visits. To maintain the temporal sequence of recorded events, we designate the date of the initial diagnosis as the index date for that patient, i.e., time zero. All other record dates are indicated by the number of calendar days from the index date. We recognize that the determination of time of “initial” diagnosis is not foolproof as it is conceivable that the patient could have been diagnosed earlier than our medical records showed.
Our cohort for hypertension patients was extracted for a five year time period between 2004 to 2012. A total of 812,571 patients were identified as hypertension. But considering the fact of incompleteness, the total data available for use will be far less than this number, which may around 20k to 30k.
4.2 Potential Problems of EMR Data Application
Missing Data. Ms sing data is always an issue in observational studies, and it is even more severe for EMR data. Many variables, for example, the record of whether patient have had antibiotic recently, only a small portion (<1%) has the answer ‘yes’, but a large portion is missed. We need to make a proper assumption that those missed data are actually ‘no’.
Drug Adherence. In observational studies, we have no idea of whether patients take medicine regularly as required or not. So we adopt two frequently used adherence measures: medication possession ratio (MRP) and proportion of days covered (PDC). Both MPR and PDC are validated adherence measures accepted by the International Society for Pharmacoeconomics and Outcomes Research (ISPOR). We calculated MPR and PDC using medication transaction and/or pharmacy claim data stored in the electronic records of the INPC.
The drug level adherence may be used as the covariate in estimation of propensity score. For patient level drug adherence, it may not make a lot of sense to use as a personal characteristics since we would not want to find some therapies that works for patients who do not adhere to the prescription, but rather we may use it to stratify the cohort and only focus on the sub-cohort with appropriate level of drug adherence.
Treatment Duration. The strict definition of causal effect is to measure the difference of potential outcomes for the same person, at the same time point after treatment. To accomplish the estimation, we already loose the constraint to compare the potential outcomes from different patients. But for observational studies, we lose the control for the treatment duration as well. This may raise some criticism but considering the immediate effect of antihypertensives, the outcomes may not change a lot with the duration which could alleviate the issue.
References
Abrevaya, Jason, Yu-Chin Hsu, and Robert P Lieli. 2015. “Estimating Conditional Average Treatment Effects.” Journal of Business & Economic Statistics 33 (4): 485–505.
Athey, Susan, and Guido Imbens. 2016. “Recursive Partitioning for Heterogeneous Causal Effects.” Proceedings of the National Academy of Sciences 113 (27): 7353–60.
Athey, Susan, and Guido W Imbens. 2015. “Machine Learning Methods for Estimating Heterogeneous Causal Effects.” Stat 1050 (5): 1–26.
Athey, Susan, Julie Tibshirani, Stefan Wager, and others. 2019. “Generalized Random Forests.” The Annals of Statistics 47 (2): 1148–78.
Bang, Heejung, and James M Robins. 2005. “Doubly Robust Estimation in Missing Data and Causal Inference Models.” Biometrics 61 (4): 962–73.
Breiman, Leo. 1996. “Bagging Predictors.” Machine Learning 24 (2): 123–40.
———. 2017. Classification and Regression Trees. Routledge.
Chen, Shuai, Lu Tian, Tianxi Cai, and Menggang Yu. 2017. “A General Statistical Framework for Subgroup Identification and Comparative Treatment Scoring.” Biometrics 73 (4): 1199–1209.
Chipman, Hugh A, Edward I George, and Robert E McCulloch. 1998. “Bayesian Cart Model Search.” Journal of the American Statistical Association 93 (443): 935–48.
———. 2002. “Bayesian Treed Models.” Machine Learning 48 (1-3): 299–320.
Chipman, Hugh A, Edward I George, Robert E McCulloch, and others. 2010. “BART: Bayesian Additive Regression Trees.” The Annals of Applied Statistics 4 (1): 266–98.
Dorie, Vincent, Jennifer Hill, Uri Shalit, Marc Scott, Dan Cervone, and others. 2019. “Automated Versus Do-It-Yourself Methods for Causal Inference: Lessons Learned from a Data Analysis Competition.” Statistical Science 34 (1): 43–68.
Freedman, David A, and Richard A Berk. 2008. “Weighting Regressions by Propensity Scores.” Evaluation Review 32 (4): 392–409.
Freund, Yoav, Robert E Schapire, and others. 1996. “Experiments with a New Boosting Algorithm.” In Icml, 96:148–56. Citeseer.
Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics, 1189–1232.
Hill, Jennifer L. 2011. “Bayesian Nonparametric Modeling for Causal Inference.” Journal of Computational and Graphical Statistics 20 (1): 217–40.
Hirano, Keisuke, Guido W Imbens, and Geert Ridder. 2003. “Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score.” Econometrica 71 (4): 1161–89.
Imai, Kosuke, and Marc Ratkovic. 2014. “Covariate Balancing Propensity Score.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 (1): 243–63.
Imai, Kosuke, Marc Ratkovic, and others. 2013. “Estimating Treatment Effect Heterogeneity in Randomized Program Evaluation.” The Annals of Applied Statistics 7 (1): 443–70.
Imbens, Guido W, and Donald B Rubin. 2015. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press.
Kang, Joseph DY, Joseph L Schafer, and others. 2007. “Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data.” Statistical Science 22 (4): 523–39.
Knaus, Michael, Michael Lechner, and Anthony Strittmatter. 2018. “Machine Learning Estimation of Heterogeneous Causal Effects: Empirical Monte Carlo Evidence.”
Künzel, Sören R, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. 2019. “Metalearners for Estimating Heterogeneous Treatment Effects Using Machine Learning.” Proceedings of the National Academy of Sciences 116 (10): 4156–65.
Neyman, Jerzy S. 1923. “On the Application of Probability Theory to Agricultural Experiments. Essay on Principles. Section 9.(tlanslated and Edited by Dm Dabrowska and Tp Speed, Statistical Science (1990), 5, 465-480).” Annals of Agricultural Sciences 10: 1–51.
Nie, Xinkun, and Stefan Wager. 2017. “Quasi-Oracle Estimation of Heterogeneous Treatment Effects.” arXiv Preprint arXiv:1712.04912.
Pearl, Judea. 2009. Causality. Cambridge university press.
Powers, Scott, Junyang Qian, Kenneth Jung, Alejandro Schuler, Nigam H Shah, Trevor Hastie, and Robert Tibshirani. 2018. “Some Methods for Heterogeneous Treatment Effect Estimation in High Dimensions.” Statistics in Medicine 37 (11): 1767–87.
Qian, Min, and Susan A Murphy. 2011. “Performance Guarantees for Individualized Treatment Rules.” Annals of Statistics 39 (2): 1180.
Robins, James M, Miguel Angel Hernan, and Babette Brumback. 2000. “Marginal Structural Models and Causal Inference in Epidemiology.” LWW.
Robinson, Peter M. 1988. “Root-N-Consistent Semiparametric Regression.” Econometrica: Journal of the Econometric Society, 931–54.
Rosenbaum, Paul R, and Donald B Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1): 41–55.
Rubin, Donald B. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology 66 (5): 688.
———. 1980. “Randomization Analysis of Experimental Data: The Fisher Randomization Test Comment.” Journal of the American Statistical Association 75 (371): 591–93.
Signorovitch, James Edward. 2007. “Identifying Informative Biological Markers in High-Dimensional Genomic Data and Clinical Trials.” PhD thesis, Harvard University.
Su, Xiaogang, Chih-Ling Tsai, Hansheng Wang, David M Nickerson, and Bogong Li. 2009. “Subgroup Analysis via Recursive Partitioning.” Journal of Machine Learning Research 10 (Feb): 141–58.
Tian, Lu, Ash A Alizadeh, Andrew J Gentles, and Robert Tibshirani. 2014. “A Simple Method for Estimating Interactions Between a Treatment and a Large Number of Covariates.” Journal of the American Statistical Association 109 (508): 1517–32.
Wager, Stefan, and Susan Athey. 2018. “Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests.” Journal of the American Statistical Association 113 (523): 1228–42.
Zeileis, Achim, Torsten Hothorn, and Kurt Hornik. 2008. “Model-Based Recursive Partitioning.” Journal of Computational and Graphical Statistics 17 (2): 492–514.
Zhang, Baqun, Anastasios A Tsiatis, Eric B Laber, and Marie Davidian. 2012. “A Robust Method for Estimating Optimal Treatment Regimes.” Biometrics 68 (4): 1010–8.
Zhao, Yingqi, Donglin Zeng, A John Rush, and Michael R Kosorok. 2012. “Estimating Individualized Treatment Rules Using Outcome Weighted Learning.” Journal of the American Statistical Association 107 (499): 1106–18.