Influence Functions #
The regression methods only compare the outcomes from different groups, while the classification methods focus on treatment assignments. They emphasize two different sources of information in the potential outcome model. One may want to integrate both the outcome and treatment information to improve the estimation efficiency.
First let us introduce the (uncentered) influence functions given by
$$\begin{align*}&\psi_1(y,t,x)\\=&\phi_1(y,t,x)+\left(\frac{t}{p(x)}-1\right)\cdot \mu^{(1)}(x)\\=&\frac{t}{p(x)}\cdot y+\left(\frac{t}{p(x)}-1\right)\cdot \mathbb{E}[Y^{(1)}|X=x]\end{align*}$$ and, similarly, $$\begin{align*}&\psi_0(y,t,x)\\=&\phi_0(y,t,x)+\left(\frac{1-t}{1-p(x)}-1\right)\cdot \mu^{(0)}(x)\\=&\frac{1-t}{1-p(x)}\cdot y+\left(\frac{1-t}{1-p(x)}-1\right)\cdot \mathbb{E}[Y^{(0)}|X=x].\end{align*}$$
The functions $\phi_1$ and $\phi_0$ are the same as for the classification methods, and the correction terms are associated with the group-wise regression functions $\mu^{(1)}$ and $\mu^{(0)}$.
We can verify that, under the unconfoundedness assumption, $$\begin{align*}&\mathbb{E}[\psi_1(Y,D,X)|X=x]\\=&\mathbb{E}[\phi_1(Y,D,X)|X=x]\\=&\mu^{(1)}(x)\end{align*}$$ and, similarly, $$\begin{align*}&\mathbb{E}[\psi_0(Y,D,X)|X=x]\\=&\mathbb{E}[\phi_0(Y,D,X)|X=x]\\=&\mu^{(0)}(x).\end{align*}.$$ Using the same arguments for the classification methods, the CATE now equals to $$\begin{align*}\tau(x)=&\mathbb{E}[\psi_1(Y,D,X)|X=x]\\&-\mathbb{E}[\psi_0(Y,D,X)|X=x]\end{align*}$$ and the ATE equals to $$\begin{align*}\tau=&\mathbb{E}[\psi_1(Y,D,X)]\\&-\mathbb{E}[\psi_0(Y,D,X)].\end{align*}$$
Estimating the Influence Functions and the ATE #
Now we can combine the machine learning estimators of
- the regression functions $\widehat{\mu}^{(1)}(x)$ and $\widehat{\mu}^{(0)}(x)$; and
- the conditional propensity scores $\widehat{p}(x)\in (0,1)$
under the overlap condition to construct sample influence functions $$\begin{align*}&\widehat{\psi}_1(y,t,x)\\=&\frac{t}{\widehat{p}(x)}\cdot y+\left(\frac{t}{\widehat{p}(x)}-1\right)\cdot \widehat{\mu}^{(1)}(x)\end{align*},$$ and $$\begin{align*}&\widehat{\psi}_0(y,t,x)\\=&\frac{1-t}{1-\widehat{p}(x)}\cdot y+\left(\frac{1-t}{1-\widehat{p}(x)}-1\right)\cdot \widehat{\mu}^{(0)}(x).\end{align*}$$
Then we can estimate the ATE via the sample analogy given by $$\begin{align*}\widehat{\tau}=&\frac{1}{n}\sum_{i=1}^{n}\widehat{\psi}_1(Y_i,D_i,X_i)\\&-\frac{1}{n}\sum_{i=1}^{n}\widehat{\psi}_0(Y_i,D_i,X_i).\end{align*}$$
We refer to Farrell et al. (2021) for an application using deep neural networks.