A Motivating Example #
Consider the following example from Athey (2019):
$\ldots$ Imagine that you have a data set that contains data about prices and occupancy rates of hotels. Prices are easy to obtain through price comparison sites, but occupancy rates are typically not made public by hotels. Imagine first that a hotel chain wishes to form an estimate of the occupancy rates of competitors, based on publicly available prices. This is a prediction problem: the goal is to get a good estimate of occupancy rates, where posted prices and other factors (such as events in the local area, weather, and so on) are used to predict occupancy. For such a model, you would expect to find that higher posted prices are predictive of higher occupancy rates, since hotels tend to raise their prices as they fill up (using yield management software).
$\ldots$ In contrast, imagine that a hotel chain wishes to estimate how occupancy would change if the hotel raised prices across the board (that is, if it reprogrammed the yield management software to shift prices up by 5 percent in every state of the world). This is a question of causal inference.
Potential Outcome Model #
Suppose that we are interested in the causal effect of a treatment variable $D_i\in\{0,1\}$ on an outcome variable $Y_i$ for individuals $i=1,\ldots, n$. The individuals with $D_i=1$ are in the treatment group, and others with $D_i=0$ are in the control group.
We assume that $Y_i$ and $D_i$ are associated via a potential outcome model $$Y_i=\begin{cases}Y_i^{(1)}& D_i=1\\Y_i^{(0)}& D_i=0\end{cases}$$ or, equivalently, $$Y_i=D_iY_i^{(1)}+(1-D_i)Y_i^{(0)}.$$ The treatment effect for individual $i$ is then given by $$\delta_i=Y_i^{(1)}-Y_i^{(0)}.$$ However, this effect is counterfactual because we only observe one state $Y_i^{(D_i)}\in\{Y_i^{(1)},Y_i^{(0)}\}$ but not both.
Average Treatment Effect #
The impossibility of observing the counterfactual effect is the fundamental problem of causal inference. The following table shows which effects are counterfactual for different groups.
Group | $Y^{(1)}$ | $Y^{(0)}$ |
---|---|---|
Treatment ($D$=1) | Observable as $Y$ | Counterfactual |
Control ($D$=0) | Counterfactual | Observable as $Y$ |
However, we may still estimate the average treatment effect (ATE) given by $$\tau=\frac{1}{n}\sum_{i=1}^{n}\tau_i$$ where $$\tau_i=\mathbb{E}\delta_i=\mathbb{E}Y_i^{(1)}-\mathbb{E}Y_i^{(0)}.$$ We assume that the individual treatment and potential outcome variables are independent and identically distributed, that is, $(Y_i^{(1)},Y_i^{(0)},D_i)\stackrel{iid}{\sim} (Y^{(1)},Y^{(0)},D)$ for all $i$. This means that $\tau_i=\tau$ are the same over individuals and $$\tau=\mathbb{E}Y^{(1)}-\mathbb{E}Y^{(0)}.$$ Denote the index sets for the treatment and control groups respectively by $$\mathcal{I}_1=\{1\leq i\leq n:D_i=1\}\\\mathcal{I}_0=\{1\leq i\leq n:D_i=0\}.$$
Replacing the expected values by the sample averages yields the estimator $$\begin{align*}\widehat{\tau}=&\frac{1}{|\mathcal{I}_1|}\sum_{i\in\mathcal{I_1}}Y_i-\frac{1}{|\mathcal{I}_0|}\sum_{i\in\mathcal{I_0}}Y_i\\=&\frac{1}{|\mathcal{I}_1|}\sum_{i\in\mathcal{I_1}}Y_i^{(1)}-\frac{1}{|\mathcal{I}_0|}\sum_{i\in\mathcal{I_0}}Y_i^{(0)}\end{align*}.$$ If the (conditional) law of large number holds and denoting convergence in probability by ‘$\xrightarrow{\mathbb{P}}$’, $$\widehat{\tau}\xrightarrow{\mathbb{P}}~\mathbb{E}[Y^{(1)}|D=1]-\mathbb{E}[Y^{(0)}|D=0].$$
The limit is not $\tau$ in general unless we assume that $(Y^{(1)},Y^{(0)})$ is independent of $D$ and thus $$\begin{align*}&\mathbb{E}[Y^{(1)}|D=1]-\mathbb{E}[Y^{(0)}|D=0]\\&\stackrel{\text{Independence}}{=}\mathbb{E}[Y^{(1)}]-\mathbb{E}[Y^{(0)}]=\tau.\end{align*}$$
Assuming a complete independence between potential outcomes $(Y^{(1)},Y^{(0)})$ and the treatment $D$ is, however, not realistic as there usually exists confounding factors (or, confounders) influencing both $Y$ and $D$. For example, let $Y$ be the college education and $D$ be the income, which may be both affected by parents’ income and education.
Heterogeneous Treatment Effects #
Confounding implies that treatment assignment is not random but self-selected based on characteristics. Suppose that we have an exhaustive collection of the confounders given by $X=(X_1,\ldots,X_d)$. Define the conditional average treatment effect (CATE) by $$\begin{align*}\tau(x)=&\mathbb{E}[Y^{(1)}-Y^{(0)}|X=x]\\=& \mathbb{E}[Y^{(1)}|X=x]-\mathbb{E}[Y^{(0)}|X=x]\\=:&\mu^{(1)}(x)-\mu^{(0)}(x)\end{align*}$$ where $\mu^{(1)}(x)$ and $\mu^{(0)}(x)$ are the regression functions for the treatment and control group respectively.
Then we may estimate $\tau(x)$ consistently under the unconfoundedness assumption:
In other words, given any fixed feature value $X=x$, the independence assumption holds. However, $(Y^{(1)},Y^{(0)})$ and $D$ can still be dependent unconditionally via the random features $X$.Unconfoundedness Assumption #
$$(Y^{(1)},Y^{(0)})~\perp D |X=x,\quad x\in\mathcal{X}$$ where $A\perp B$ means ‘$A$ is independence of $B$’ and $\mathcal{X}$ is the domain set.
By the law of iterated expectations, one can then recover the ATE from CATE via
$$\tau=\mathbb{E}[\tau(X)].$$ The sample analogy is $$\widehat{\tau}=\frac{1}{n}\sum_{i=1}^{n}\widehat{\tau}(X_i),$$ where $\widehat{\tau}(x)$ denotes an estimator of the CATE $\tau(x)$.
To recover $\tau(x)$ for all $x\in\mathcal{X}$, we need the overlap condition to ensure that we observe both treatment and control groups.
Overlap Condition #
For some lower bound $\varepsilon>0$ and for all $x\in\mathcal{X}$, $$\varepsilon<\mathbb{P}[D=1|X=x]<1-\varepsilon.$$ In other words, there are always non-trivial assignment probabilities to both treatment and control groups conditional on any realized value of confounders.