Suppose the true model in which we are interested is given by \(Y = \textbf{X}\beta + \epsilon\) with all of our usual assumptions about linearity. We say that there is endogeneity in the linear model if \(\beta\) is our parameter of interest and we have \(\mathbb{E}\left[\epsilon \,|\, \textbf{X} \right] \not = 0\).
Recall how we showed that \(\hat{\beta}\) is an unbiased estimator for \(\beta\):
\[\begin{equation*} \begin{aligned} \mathbb{E}\left[\hat{\beta}- \beta \,|\, \textbf{X} \right] & = \mathbb{E}\left[ (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}Y- \beta \,|\, \textbf{X} \right]\\ & = \mathbb{E}\left[ (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} (\textbf{X}\beta + \epsilon) - \beta \,|\, \textbf{X} \right]\\ & = \mathbb{E}\left[ (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \textbf{X}\beta + (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \epsilon - \beta \,|\, \textbf{X} \right]\\ & = \mathbb{E}\left[\beta + (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \epsilon - \beta \,|\, \textbf{X} \right]\\ & = \mathbb{E}\left[(\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \epsilon \,|\, \textbf{X} \right]\\ & = (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \mathbb{E}\left[\epsilon \,|\, \textbf{X} \right] \end{aligned} \end{equation*}\]
If we have endogeneity, or \(\mathbb{E}\left[\epsilon \,|\, \textbf{X} \right] \not = 0\), then \(\mathbb{E}\left[\hat{\beta}- \beta \,|\, \textbf{X}\right] \not = 0\) and \(\hat{\beta}\) is biased.
Generally, we categorize three sources of endogeneity:
Omitted Variables
Measurement Error
Simultaneity/Reverse Causality
Suppose the true model is given by \(Y = \textbf{X}\beta + \textbf{Z}\theta + \epsilon\), but we only estimate \(Y = \textbf{X}\gamma + \nu\). In this context, \(\textbf{Z}\) is an omitted variable because it is not considered in the regression. Then, even if it is the case that \(\mathbb{E}\left[\epsilon \,|\, \textbf{X},\, \textbf{Z}\right] = 0\), \(\hat{\gamma}\) will still be biased.
\[\begin{equation*} \begin{aligned} \mathbb{E}\left[\hat{\gamma} - \beta \,|\, \textbf{X},\, \textbf{Z} \right] & = \mathbb{E}\left[(\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}Y- \beta \,|\, \textbf{X},\, \textbf{Z} \right]\\ & = \mathbb{E}\left[(\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} (\textbf{X}\beta + \textbf{Z}\theta + \epsilon) - \beta \,|\, \textbf{X},\, \textbf{Z} \right]\\ & = \mathbb{E}\left[(\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}\textbf{X}\beta + (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}\textbf{Z}\theta + (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}\epsilon - \beta \,|\, \textbf{X},\, \textbf{Z} \right]\\ & = \mathbb{E}\left[\beta + (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}\textbf{Z}\theta + (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}\epsilon - \beta \,|\, \textbf{X}, \, \textbf{Z} \right]\\ & = \underbrace{(\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}\textbf{Z}\theta}_{\not = \, 0} + \underbrace{\mathbb{E}\left[(\textbf{X}^{\top}\textbf{X})^{-1} \textbf{X}^{\top}\epsilon \,|\, \textbf{X},\, \textbf{Z}\right]}_{= \, 0}\\ & \not = 0 \end{aligned} \end{equation*}\]
(A fuller breakdown of the multivariate case by Pischke can be found here)
Suppose the true model is given by \(Y_i = \beta_0 + X_i\beta_1 + \epsilon_i\). However, problems with measurement error arise when at least one of \(Y_i\) or \(X_i\) is measured incorrectly. In other words, suppose that we observe \(Y_{i}^{*} = Y_i + \varepsilon_{Yi}\) or \(X_{i}^{*} = X_i + \varepsilon_{Xi}\), where \(\varepsilon\) characterizes the noise associated with observation \(i\).
In the case where \(Y_i\) is measured incorrectly, the model we actually estimate is \(Y_{i}^{*} = \gamma_0 + X_i\gamma_i + \nu_i\). We showed in class that, in such a case, \(\hat{\gamma} = \beta_1 + \frac{\text{Cov}(X,\varepsilon_{Yi})}{\mathbb{V}(X)}\). Then, \(\hat{\gamma}\) is an unbiased estimator for \(\beta\) if \(\text{Cov}(X, \varepsilon_{Yi}) = 0\), or that \(X\) is uncorrelated with the measurement error for \(Yi\).
In the case where \(X_i\) is measured incorrectly, the model we actually estimate is \(Y_i = \gamma_0 + X_{i}^{*}\gamma_1 + \nu_i\). If \(\varepsilon_{Xi} \perp \!\!\! \perp X\), we may proceed – if not, then we directly violate exogeneity. For the independent case, we showed in class that \(\hat{\gamma} = \beta_1 \frac{\mathbb{V}(X)}{\mathbb{V}(X) + \mathbb{V}(\varepsilon_{Xi})} < \beta_1\), meaning that our estimator will be of some smaller magnitude than the true \(\beta\). This is known as attenuation bias, and while we might not like it, it may or may not be problematic.
Finally, in the case where \(Y_i\) and \(X_i\) are measured incorrectly, both issues arise.
Suppose the true model is given by \(Y_i = \beta_0 + X_i \beta_1 + \epsilon_i\), that we satisfy our Gauss-Markov assumptions, and that we correctly specify our estimator and recover a statistically-significant \(\hat{\beta}= \beta\). We claim, “Aha! A unit change in \(X\) causes a \(\hat{\beta}\) change in \(Y\)!”. Beyond caveats of correlation not equaling causation, why might this be problematic?
This can arise if we fail to specify a direction for a causal mechanism – in other words, through what means might a unit change in \(X\) cause a \(\hat{\beta}\) (on average) change in \(Y\)? Are we sure that the real parameter of interest isn’t \(\hat{\gamma}\), which implies that a unit change in \(Y\) causes a \(\hat{\gamma}\) change in \(X\)?
This is a contrived example of reverse causation, in which we misspecify our relationship of interest by failing to correctly specify the direction of the relationship of interest.
Given endogeneity, one (possible) solution is to use some instrumental variable \(Z\) which purges endogeneity from our regressor \(X\). Let \(\text{Cov}(Y,Z) = \text{Cov}(\alpha + \beta X + \gamma U + \nu) = \beta\,\text{Cov}(X,Z) + \gamma\, \text{Cov}(U,Z) + \text{Cov}(\nu, Z)\). We commonly say that \(Z\) is subject to two restrictions:
Relevance (or “strong instrument”): \(\text{Cov}(X,Z) \not= 0\), or that \(Z\) actually does affect \(X\).
Exclusion: \(\text{Cov}(U, Z) = \text{Cov}(\nu, Z) = 0\), or that \(Z\) is related to \(Y\) only through \(X\)
In simple linear regression, where \(X\) is \(n \times 1\), we have that \(\beta = \frac{\text{Cov}(Y,Z)}{\text{Cov}(X,Z)}\). In terms of estimation, we can use the plug-in estimator \[\hat{\beta}_{\text{IV}} = \frac{\hat{\text{Cov}(Y,Z)}}{\hat{\text{Cov}(X,Z)}} = \frac{\frac{\hat{\text{Cov}(Y,Z)}}{\hat{\mathbb{V}(Z)}}}{\frac{\hat{\text{Cov}(Y,Z)}}{\hat{\mathbb{V}(Z)}}}\]
Which, recall, is the ratio of two regressions: the numerator is just the regression of \(Y\) on \(Z\), and the denominator the regression of \(X\) on \(Z\).
In the more general case, we leverage the projection matrix for \(Z\) to purge endogeneity from multivariate \(X\). Let \(Y = \textbf{X}\beta + \epsilon\), such that \(\mathbb{E}\left[\epsilon\right] = \textbf{0}\) and \(X\) is \(N \times (K+1)\). Under endogeneity, we have \(\mathbb{E}\left[\epsilon_i \,|\, \textbf{X}\right] \not = 0\), so we leverage instrumental variable(s) \(\textbf{Z}\), which is \(N \times L\) such that \(L\) is the number of instruments. We need:
Exogeneity over \(\textbf{Z}\), or that \(\mathbb{E}\left[\epsilon \,|\, \textbf{Z}\right] = 0\)
Exclusion
Rank condition: \(\textbf{Z}^{\top}\textbf{X}\) and \(\textbf{Z}^{\top}\textbf{Z}\) are of full rank.
If we satisfy these conditions, we proceed. Stage 1 of two-stage least squares (2SLS) requires that we regress \(\textbf{X}\) on \(\textbf{Z}\) and obtain fitted values \(\hat{\textbf{X}}\). In matrix notation, we have the Projection Matrix \(\textbf{P}_{\textbf{Z}} = \textbf{Z}(\textbf{Z}^{\top}\textbf{Z})^{-1}\textbf{Z}^{\top}\) to do this for us, which results in \(\hat{\textbf{X}} = \textbf{P}_{\textbf{Z}}\textbf{X}\).
In Stage 2, we regress \(Y\) on \(\hat{\textbf{X}}\), which is mechanically identical to normal OLS. By the exact same process as normal OLS, it can be shown that \(\hat{\beta}_{\text{2SLS}} = (\hat{\textbf{X}}^{\top}\hat{\textbf{X}})^{-1} \hat{\textbf{X}}^{\top}Y\).
(This section was taken directly from Jin Kim’s Lab exercise from Spring 2021)
We will focus our attention in a simulated data exercise. The appealing of doing this is that we know the true data generating process. Moreover, we can actually asses how good or bad OLS and IV perform.
Consider the following two equations: \[ Y = 0.3 + 0.5 X + u\]
and
\[ X = 0.9 + Z + v\]
Where \(Z \sim \mathcal{N}(2, 1)\) and \((u, v)\) are distributed multivariate normal with means 0, variances 1 and correlation equal to 0.8.
Because \(X\) has the error term \(v\), this serves the purpose of making \(X\) correlated with \(u\) - which, as you may recall, is exactly what we assume is not the case in vanilla OLS.
As always when we do anything random (like simulate data), we will set our seed.
## For replicability purposes:
set.seed(1986)
## Sample Size
n <- 3000
## Let's generate Z
Z <- rnorm(n, 2, 1)
head(Z)
## [1] 1.953749 2.280001 2.253171 1.035889 2.492227 1.301254
## Let's generate the error terms:
require('mvtnorm')
## Variance covariance Matrix for the errors: u and v
sigma <- diag(2)
sigma[2,1] <- sigma[1,2] <- 0.8
sigma
## [,1] [,2]
## [1,] 1.0 0.8
## [2,] 0.8 1.0
## Let's generate the errors:
Errors <- rmvnorm(n, mean= c(0,0), sigma=sigma)
head(Errors)
## [,1] [,2]
## [1,] 1.7641519 2.2511067
## [2,] 1.8354363 0.8997512
## [3,] -1.2042891 -0.6476522
## [4,] -0.2824119 -0.3789279
## [5,] -0.1687027 -0.3485712
## [6,] 0.1872254 0.4863826
#checking to make sure errors are correlated correctly
cor(Errors)
## [,1] [,2]
## [1,] 1.0000000 0.7958107
## [2,] 0.7958107 1.0000000
u <- Errors[,1]
v <- Errors[,2]
## Let's generate X and Y
X <- 0.9 + Z + v
Y <- 0.3 + 0.5*X + u
## We can easily check that Z and u are orthogonal
cor(Z,u)
## [1] -0.0155411
## Similarly, we can see that X by construction is
## Endogenous
cor(X,u)
## [1] 0.5484201
Fit a regular (naive, aka not taking endogeneity into account) OLS model of Y ~ X. How well does it do?
## OLS estimates: awfully biased
summary(lm(Y ~ X))
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4133 -0.5620 0.0178 0.5436 2.7395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.81471 0.03505 -23.24 <2e-16 ***
## X 0.88683 0.01077 82.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8282 on 2998 degrees of freedom
## Multiple R-squared: 0.6933, Adjusted R-squared: 0.6932
## F-statistic: 6778 on 1 and 2998 DF, p-value: < 2.2e-16
As we can note, we are far from obtaining the 0.3 (Intercept) and 0.5 (Slope) as the coefficients for our model.
Let’s try to get better estimates of the coefficients.
Use what we talked about in class and fit two separate regressions on \(Z\).
First, calculate our estimate of the true \(\beta_1\) using the covariances. Then, recalculate it using the \(\hat{\beta}_1\)’s from the two different regressions.
Why does this ratio of coefficients work? (Think about what a coefficient is a ratio of.)
# ratio of covariances
cov(Y,Z)/var(Z)*var(Z)/cov(X, Z)
## [1] 0.4845842
## Running 2 regressions:
fit_y_z <- lm(Y ~ Z)
fit_x_z <- lm(X ~ Z)
## Recovering the truth?
beta1 <- fit_y_z$coef[2]/fit_x_z$coef[2]
beta1
## Z
## 0.4845842
beta0 <- fit_y_z$coef[1] - beta1*fit_x_z$coef[1]
beta0
## (Intercept)
## 0.3661613
That’s a lot closer. Now, what about 2SLS? First fit a regression of \(X \sim Z\), get the fitted values \(\hat{X}\) from that regression, and the regress \(Y \sim \hat{X}\).
## Let's project X onto Z - first stage
first_stage <- lm(X ~ Z)
x_hat <- predict(first_stage)
## Let's run the regression of Y on x_hat
summary(lm(Y ~ x_hat))
##
## Call:
## lm(formula = Y ~ x_hat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9374 -0.9726 0.0040 0.9612 4.9025
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.36616 0.08026 4.562 5.26e-06 ***
## x_hat 0.48458 0.02588 18.722 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.415 on 2998 degrees of freedom
## Multiple R-squared: 0.1047, Adjusted R-squared: 0.1044
## F-statistic: 350.5 on 1 and 2998 DF, p-value: < 2.2e-16
Both methods recover the same point estimates, which are closer to the true parameters.
We can also try this using matrix algebra. Recreate \(\widehat{\beta}_{2SLS}\) using the matrices. NOTE: You will have to make both \(Z\) and \(X\) into matrices (cbind(1, X)
and cbind(1, Z)
) because the \(\mathbf{X}\) and \(\mathbf{Z}\) on slide 23 represent the design matrices of the two regressions (in each stage).
X_mat <- cbind(1, X)
Z_mat <- cbind(1, Z)
P_z <- Z_mat %*% solve(t(Z_mat) %*% Z_mat) %*% t(Z_mat)
betas_2SLS <- solve(t(X_mat) %*% P_z %*%
X_mat) %*% (t(X_mat) %*% P_z %*% Y)
betas_2SLS
## [,1]
## 0.3661613
## X 0.4845842
Let’s compare these results with the canned function for 2SLS:
## This is one of the many packages that have a function to estimate 2SLS
library(AER)
## Let's run the model
summary(ivreg(Y ~ X | Z))
##
## Call:
## ivreg(formula = Y ~ X | Z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.645080 -0.686737 0.004101 0.703585 3.417332
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.36616 0.05686 6.44 1.39e-10 ***
## X 0.48458 0.01834 26.43 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.002 on 2998 degrees of freedom
## Multiple R-Squared: 0.5507, Adjusted R-squared: 0.5505
## Wald test: 698.4 on 1 and 2998 DF, p-value: < 2.2e-16
Note that the std. errors are quite different from those using 2SLS by hand. As noted in class, we need to adjust the std. errors if we are calculating 2SLS by hand. Refer to slides p. 14-16.
#first caculate sigma squared hat
sigma2_hat <- sum((Y - X_mat %*% betas_2SLS)^2) /
(n - length(betas_2SLS))
#now standard errors
se_2SLS <- sqrt(diag(sigma2_hat * solve(t(X_mat) %*% P_z %*% X_mat)))
cbind(se_2SLS, summary(ivreg(Y ~ X | Z))$coef[,2])
## se_2SLS
## 0.05685624 0.05685624
## X 0.01833639 0.01833639