For most of human history, the vast majority of childbirths took place “at home.” Around the 17th Century in Europe, there emerged a trend of women going to hospitals and other medical clinics to give birth. However, that transition was associated with a vast increase in postpartum mortality; some clinics saw more than ten percent of mothers die of septic shock shortly following childbirth.
In the mid-1800s, a Hungarian physician named Ignaz Semmelweis worked at the Vienna General Hospital in Austria-Hungary, which had two of these maternal clinics. The first clinic had postpartum mortality rates of 10-15%; the second had postpartum mortality around 3-4%. Semmelweis observed that doctors at the first clinic would often come to deliver infants after immediately working with ill patients or conducting autopsies of deceased ones, whereas doctors in the second worked exclusively with childbirth, and posited that this trend caused the difference. He kept detailed records of which mothers survived or died in childbirth, which clinic they attended, and which physicians helped with the birth – essentially, one of the first difference-in-means studies of the scientific era.
Furthermore, he posited that doctors in the first clinic were the vector of transmission of sepsis from mother to mother (a remarkable insight before the widespread acceptance of the germ theory of disease). Since those doctors often worked directly with the blood of infected or deceased patients directly prior to assisting a childbirth, they might bring infection directly to the mothers. Semmelweis thus ordered doctors in the first clinic to adopt a fairly strict regimen of washing their hands between treating patients, and rates of postpartum mortality in the first clinic fell to 3-4% – the rates of the second clinic the entire time – over the following years. In so doing, Semmelweis 1) saved hundreds of lives directly and millions of lives indirectly, and 2) conducted one of the first “causal inference” studies in all of non-laboratory science.
In scientific enterprise, causal inference is a framework
wherein we describe relationships between real-world phenomena with both
technical and substantive precision. That is, we might say some
phenomenon \(X\)
causes \(Y\) if:
In other words, we might say \(X\)
causes \(Y\) if we can demonstrate both
the magnitude of the relationship between \(X\) and \(Y\) and how \(X\) brings about a change in \(Y\).
For this class, we will primarily be referring to causal inference in
the context of the Neyman-Rubin Causal Model, which utilizes the
following notation:
Some key assumptions:
Big picture: we wish to recover the individual causal effect \(\tau_i = Y_i(1) - Y_i(0)\), but we only get to observe one potential outcome per unit. Maybe the solution is to think about average treatment effects (ATEs) instead. The population average treatment (PATE) is \(\mathbb{E}\left[T_i\right]\), however, runs into the same problem, and that we don’t get to observe the population of interest.
As a next step, perhaps we use the sample analogue: the Sample Average Treatment Effect (SATE): \[\begin{equation*} \hat{\tau} = \frac{1}{N}\displaystyle \sum_{i = 1}^{N} \left(Y_i(1) - Y_i(0)\right) \end{equation*}\]
…but, again, we only observe one potential outcome per unit.
If we cannot directly recover the SATE, how can we therefore extrapolate
to the PATE or even \(T_i\)? It turns
out, the solution (under certain assumptions) is to use
Difference-In-Means:
\[\begin{equation*}
\hat{\tau} \equiv \frac{1}{N_1} \displaystyle \sum_{i = 1}^{N} T_i Y_i -
\frac{1}{N_0} \sum_{i = 1}^{N}(1 - T_i) Y_i
\end{equation*}\]
The difference-in-means estimator \(\hat{\tau}\) is, quite literally, the difference in the mean value of \(Y\) for treated units (i.e. \(T = 1\)) and untreated units (\(T = 0\)). Let \(\mathcal{O} \equiv \left\{Y_i(0), Y_i(1) \right\}_{i = 1}^{N}\). It can be shown that \(\hat{\tau} \, | \, \mathcal{O}\) is unbiased for the SATE:
\[\begin{equation*} \begin{aligned} \mathbb{E}\left[\hat{\tau} \, | \, \mathcal{O} \right] & = \mathbb{E}\left[\frac{1}{N_1} \displaystyle \sum_{i = 1}^{N} T_i Y_i - \frac{1}{N_0} \sum_{i = 1}^{N}(1 - T_i) Y_i \, | \, \mathcal{O} \right] \, \text{by definition of} \, \hat{\tau}\\ & = \frac{1}{N_1} \displaystyle \sum_{i = 1}^{N} \mathbb{E}\left[T_i Y_i \, | \, \mathcal{O}\right]- \frac{1}{N_0} \sum_{i = 1}^{N} \mathbb{E}\left[(1 - T_i) Y_i \, | \, \mathcal{O}\right] \, \text{by linearity of} \, \mathbb{E}\\ & = \frac{1}{N_1} \displaystyle \sum_{i = 1}^{N} \mathbb{E}\left[T_i \, | \, \mathcal{O}\right] Y_i(1) - \frac{1}{N_0} \sum_{i = 1}^{N} \mathbb{E}\left[(1 - T_i) Y_i(0) \, | \, \mathcal{O}\right] \, \text{by consistency}\\ & = \frac{1}{N_1} \displaystyle \sum_{i = 1}^{N} \mathbb{E}\left[T_i \, | \, \mathcal{O} \right] Y_i(1) - \frac{1}{N_0} \sum_{i = 1}^{N} \left(1 - \mathbb{E}\left[T_i \, | \, \mathcal{O} \right] \right) Y_i(0) \, \text{by algebra}\\ & = \frac{1}{N_1} \displaystyle \sum_{i = 1}^{N} \mathbb{P}(T_i \, | \, \mathcal{O}) Y_i(1) - \frac{1}{N_0} \sum_{i = 1}^{N} \left(1 - \mathbb{P}(T_i \, | \, \mathcal{O}) \right) Y_i(0) \, \text{because} \, \mathbb{E}\left[T_i \, | \, \mathcal{O}\right] = 1 \times \mathbb{P}(T_i = 1 \, | \, \mathcal{O}) + 0 \times \mathbb{P}(T_i = 0 \, | \, \mathcal{O})\\ & = \frac{1}{N_1} \displaystyle \sum_{i = 1}^{N} \mathbb{P}(T_i) Y_i(1) - \frac{1}{N_0} \sum_{i = 1}^{N} \left(1 - \mathbb{P}(T_i)\right)Y_i(0) \, \text{because} \, T_i \, \text{and} \, \mathcal{O} \, \text{are orthogonal (or unconfounded)}\\ & = \frac{1}{N_1} \displaystyle \sum_{i = 1}^{N} \frac{N_1}{N} Y_i(1) - \frac{1}{N} \sum_{i = 1}^{N} \frac{N_0}{N} Y_i(0) \, \text{by definition of } \, \mathbb{P}(T_i)\\ & = \frac{1}{N} \displaystyle \sum_{i = 1}^{N} \left[Y_i(1) - Y_i(0)\right] \, \text{by algebra}\\ & = \text{SATE} \, _{\square} \end{aligned} \end{equation*}\]
Note that we cannot directly identify \(\mathbb{V}(\hat{\tau})\) because it is a function of potential outcomes. However, we typically use a more conservative estimator \(\hat{\mathbb{V}}^{*}(\hat{\tau}) = \frac{\hat{\sigma}_{1}^{2}}{N_1} + \frac{\hat{\sigma}_{0}^{2}}{N_0}\).
It turns out that, if we are careful, we may recover treatment effects using the linear model. Let \(Y_i(T_i) = \alpha + \beta \times T_i + \epsilon_i\) with \(\mathbb{E}\left[\epsilon_i \right] = 0\). Because of our assumption about unconfoundedness, this becomes \(Y_i = \alpha + \beta \times T_i + \epsilon_i\), where \(T_i\) is the treatment indicator, \(\alpha = \mathbb{E}\left[Y_i(0) \right]\) is the average outcome for untreated units, and \(\beta = \mathbb{E}\left[(Y_i(1) - Y_i(0)\right]\) is the expected difference in outcomes for treated and untreated units – in other words, a measure of a treatment effect.
In an experimental setting, we are free to utilize either regression or difference-in-means as long as treatment was randomly assigned. In observational (i.e. natural experiments) studies, our framework becomes \(Y_i(T_i) = \alpha + \beta \times T_i + X_{i}^{\top} \times \gamma + \epsilon_i\), where \(X_i\) is a vector of pre-treatment confounders. In this framework, specifying \(X_i\) correctly gives us conditional ignorability. Be careful: \(\gamma\) doesn’t have any sort of causal interpretation, and \(X_i\) mustn’t include any measures of post-treatment variables lest we sever a causal path.
Let \(Y = \textbf{X}\beta + \epsilon\) and \(\textbf{Z}\) be a battery of instruments satisfying all our typical instrumental variables assumptions. In the simple case, we may recover: \[\begin{equation*} \beta = \frac{\text{Cov}(Y,Z)}{\text{Cov}(X,Z)} = \frac{\frac{\text{Cov}(Y, Z)}{\mathbb{V}(Z)}}{\frac{\text{Cov}(X,Z)}{\mathbb{V}(Z)}} \end{equation*}\]
A more involved approach, two-stage-least-squares (2SLS), can recover \(\beta\) as well. Why revisit this?
In experimental designs (natural or lab-based), we often cannot force units to follow the randomized treatment assignment. If we are clever (and careful), we may utilize the instrumental variables approach to recover estimates of \(\beta\) given this problem. We’ll shift our thinking to what is typically called encouragement design with intention-to-treat (ITT) as the outcome of interest.
Let \(Z_i \in \{0,1\}\) be the
randomized encouragement, i.e., something of a “nudge” for
units to take or abstain from treatment. In the potential outcomes
framework, we now have:
Thus, our potential outcomes \(Y_i\) are a function of both encouragement and actual treatment, or \(Y_i = Y_i\left(Z_i, T_i(Z_i)\right)\).
Some necessary assumptions:
- No-interference assumption for \(T_i(Z_i)\) and \(Y_i(Z_i, T_i)\)
- Randomization of encouragement: \(\{Y_i(1),
Y_i(0), T_i(1), T_i(0)\} \perp Z_i\)
- Potential outcomes are orthogonal to treatment, conditioning on
encouragement: \(\{Y_i(1), Y_i(0)\} \perp T_i
\, | \, Z_i = z\); this isn’t true
Principal Stratification
There are four types of units in this approach who may (or may not)
muddle the relationship between encouragement and treatment:
- Compliers, who always respond to encouragement
- Always-takers, who always take treatment regardless of
encouragement
- Never-takers, who never take treatment regardless of
encouragement
- Defiers, who always respond to the opposite of
encouragement
We can categorize the ambiguity of which units are which via a two-way table:
\(Z_i = 1\) | \(Z_i = 0\) | |
---|---|---|
\(T_i = 1\) | Complier/Always-taker | Defier/Always-taker |
\(T_i = 0\) | Defier/Never-taker | Complier/Never-taker |
For any permutation of treatment \(T_i\) and \(Z_i\), we are unsure precisely the strata
of a given unit. Thus we need further assumptions:
Given these assumptions, we can update our table:
\(Z_i = 1\) | \(Z_i = 0\) | |
---|---|---|
\(T_i = 1\) | Complier/Always-taker | Always-taker |
\(T_i = 0\) | Never-taker | Complier/Never-taker |
Given these assumptions, we can isolate the probability that a unit
is in a given strata, and thus extract the ITT:
Collectively, these imply: \[\begin{equation*} \mathbb{P}(\text{Complier}) = \mathbb{P}(\text{Always-taker or Complier}) - \mathbb{P}(\text{Always-Taker}) = \mathbb{E}\left[T_i \, | \, Z_i = 1 \right] - \mathbb{E}\left[T_i \, | \, Z_i = 0 \right] \end{equation*}\]
Next, we leverage the law of total probability to isolate the ITT: \[\begin{equation*} \begin{aligned} \mathbb{E}\left[\text{ITT}\right] & = \mathbb{E}\left[\text{ITT} \, | \, \text{Complier}\right] \times \mathbb{P}(\text{Complier})\\ & + \mathbb{E}\left[\text{ITT} \, | \, \text{Always-taker}\right] \times \mathbb{P}(\text{Always-taker})\\ & + \mathbb{E}\left[\text{ITT} \, | \, \text{Never-taker}\right] \times \mathbb{P}(\text{Never-taker})\\ & = \mathbb{E}\left[\text{ITT} \, | \, \text{Complier}\right] \times \mathbb{P}(\text{Complier})\\ & + 0 \, \text{by exclusion}\\ & = \mathbb{E}\left[\text{ITT} \, | \, \text{Complier}\right] \times \mathbb{P}(\text{Complier})\\ \implies \mathbb{E}\left[\text{ITT} \, | \, \text{Complier}\right] & = \frac{\text{ITT}}{\mathbb{P}(\text{Complier})}\\ & = \frac{\mathbb{E}\left[Y_i \, | \, Z_i = 1\right] - \mathbb{E}\left[Y_i \, | \, Z_i = 0 \right]}{\mathbb{E}\left[T_i \, | \, Z_i = 1\right] - \mathbb{E}\left[T_i \, | \, Z_i = 0 \right]}\\ & = \frac{\text{Cov}(Y_i, Z_i)}{\text{Cov}(T_i, Z_i)} \end{aligned} \end{equation*}\]
The estimated version of \(\frac{\text{Cov}(Y_i, Z_i)}{\text{Cov}(T_i,
Z_i)}\) is the Wald estimator, and is identical to the
2SLS estimator, where encouragement \(Z\) is the instrument. It can be shown that
the Wald estimator converges in probability to the true ITT\(_c\). Chapter 4 of Angrist and Pischke’s
Mostly Harmless Econometrics covers this in great detail.
Big picture: given that units do not always take or abstain from treatment when they are “supposed to”, we may utilize the instrumental variables framework to recover estimates of the causal effect of encouragement on outcomes.