Preamble
One of the main tenants of practical data science is performing statistical inference on data sets which are assumed to be representations of populations of activity, natural or man-made. A specific case is called causal inference, which probably the core interest of decision makers and probably the main reason why businesses or industry funds data science projects in the first place. Either by trying to find a cause of an event or understand the impact of data science and AI products. A recent popular book of "Why" once again put causal inference into the major market [pearl2018]. However, the nightmare for a data scientist is that there is a fundamental limitation of performing such an inference. It stems from the fact that, the condition whereby to test causal link, i.e., establishing cause and effect, on a given data point cannot occur simultaneously. This is the so-called FPCI, the fundamental problem of causal inference [Holland86] and it can only be resolved approximately. This implies a data point cannot be both treated and not-treated simultaneously. For example, a patient cannot have a condition that receives treatment or no treatment at the same time, or a customer cannot be in the campaign or excluded at the same time. This post provides possible approaches in tackling it with pointers to resource and software tools. The core problem is a missing data issue, but note that the missingness here originates from not a measurement error but fundamentally non-existing data point, so it is not only simple data imputation issue. More extensive resource list can be found, a resource list called looper is available on GitHub, A resource list for causality in statistics, data science and physics. [looper]
Figure: Looper the movie revolves around the causal loop (Wikimedia) |
Rubin causal model: A-null test with the inherent contradiction
The question of why as in what is the cause of an event inherently a physics question and goes into space-time concept of general relativity and in classical mechanics in general. Popular time-travel movies, such as looper, see Figure, causality loops creates a curious non-intuitive phenomenon. From a data analysis perspective, the Rubin causal model [Rubin1974] asserts that the causal effect can be quantified with A-Null test. What does this mean? Cause, i.e., treatment as a medical connotation, is the cause of the event, i.e., the effect or the outcome, can be quantified by the algebraic difference between the expected value of the outcome and it's counterfactual, a fancy name for what would have been the outcome in the absence of the treatment. Unfortunately, this is inherently contradictory as mentioned above, so-called FPCI, the fundamental problem of causal inference [Holland86]. On a single sample, i.e. event, this A-null test cannot be applied because data is missing causally. In estimating the A-null test, two groups are designed, so that they are drawn from the same population, recall the law of large numbers. One group receives the treatment and the other does not by design. Over time the effect of the treatment measured on both groups and Rubins A-Null test is applied. Not surprisingly they are called control and treatment groups and this procedure is called Average Treatment Effects (ATE). The mathematical exposure of this procedure is well established and can be found in standard texts, see Looper, but mathematically ATE reads $\mathbb{E}(Y_{i} | T_{i} = 1) - \mathbb{E}(Y_{i} | T_{i} = 0) $.
Matching or balancing
Randomized control trails (RCT) constructs control and treatment groups. To compute ATE robustly, the statistical properties of covariates, set of features used in the study, from control and treatment must be similar. Any technique try to address the balancing issue is called matching [stuart2010]. Prominienet example in this direction is propensity score matching, using similarity distances between covariates, implemented in R via matching package based on genetic search.
Imputation approach: Causal inference as missing data problemMatching or balancing
Randomized control trails (RCT) constructs control and treatment groups. To compute ATE robustly, the statistical properties of covariates, set of features used in the study, from control and treatment must be similar. Any technique try to address the balancing issue is called matching [stuart2010]. Prominienet example in this direction is propensity score matching, using similarity distances between covariates, implemented in R via matching package based on genetic search.
One way to overcome FPCI is applying an imputation to missing data, predicting outcome value in case of control sample, vice versa. Using advanced imputation techniques one can resolve FPCI. One prominent example is via Multiple Imputation, implemented in R mice package.
Summary
We have reviewed the fundamental problem of causal inference (FPCI) . For causality resources, a resource list called looper is available on GitHub, A resource list for causality in statistics, data science and physics, please sent a pull request for additions.Note that causality is a large research area from Bayesian Networks, Pearl's do calculus and uplift modelling.
References
[pearl2018] Judea Pearl and Dana Mackenzie, The Book of Why: The New Science of Cause and Effect, Basic Books; 1 edition (May 15, 2018)
[Holland1986] Holland, Paul W. (1986). "Statistics and Causal Inference". J. Amer. Statist. Assoc. 81 (396): 945–960.
[Rubin1974] Rubin DB (1974). “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology, 66, 688–701.
[stuart2010] Elizabeth A. Stuart, Matching methods for causal inference: A review and a look forward Stat Sci. 2010 Feb 1; 25(1): 1–21.
[looper] A resource list for causality in statistics, data science and physics
http:github.com/msuzen/looper
.