Instrumental Variables

Assume a linear model of the following form:

\[\begin{equation*} Y = \textbf{E}\theta + \textbf{W}\gamma + \epsilon \end{equation*}\]

where \(\textbf{E}\) represents an \(N \times R\) matrix of endogenous covariates; \(\textbf{W}\) a matrix of \((N \times C) + 1\) exogenous covariates.

Let \(\textbf{X} = \left[\textbf{E} \, \textbf{W}\right]\) be a matrix of size \((N \times K) + 1\). This means that \(R + C + 1 = K + 1\). In other words, \(\textbf{X}\) includes both our endogenous and exogenous covariates. Thus, we have

\[\begin{equation*} \begin{aligned} Y = \textbf{X}\beta + \epsilon \, \text{where} \, \mathbb{E}\left[\epsilon \right] = \textbf{0} \, \text{and} \, \textbf{X} \, \text{is} \, (N \times K) + 1 \end{aligned} \end{equation*}\]

This model is endogenous because \(\mathbb{E}\left[\epsilon_i \, | \, \textbf{X}\right] \not= \textbf{0}\) because of the endogenous partial \(\textbf{E}\).

The matrix of instruments \(\textbf{Z}\) we use is \(N \times L\) with \(L\) the number of instruments such that \(L \geq R\). We need:

  1. Exogeneity: \(\mathbb{E}\left[\epsilon \, | \, \textbf{Z}\right] = 0\)
  2. Exclusion
  3. Rank condition: \(\textbf{Z}^{\top}\textbf{X}\) and \(\textbf{Z}^{\top}\textbf{Z}\) are of full rank

If we satisfy these conditions, we proceed. Stage 1 of two-stage least squares (2SLS) requires that we regress \(\textbf{X}\) on \(\textbf{Z}\) and obtain fitted values \(\hat{\textbf{X}}\). In matrix notation, we have the Projection Matrix \(\textbf{P}_{\textbf{Z}} = \textbf{Z}(\textbf{Z}^{\top}\textbf{Z})^{-1}\textbf{Z}^{\top}\) to do this for us, which results in \(\hat{\textbf{X}} = \textbf{P}_{\textbf{Z}}\textbf{X}\).

In Stage 2, we regress \(Y\) on \(\hat{\textbf{X}}\), which is mechanically identical to normal OLS. By the exact same process as normal OLS, it can be shown that \(\hat{\beta}_{\text{2SLS}} = (\hat{\textbf{X}}^{\top}\hat{\textbf{X}})^{-1} \hat{\textbf{X}}^{\top}Y\).

Manual 2SLS

(This section was taken directly from Jin Kim’s Lab exercise from Spring 2021)

We will focus our attention in a simulated data exercise. The appealing of doing this is that we know the true data generating process. Moreover, we can actually asses how good or bad OLS and IV perform.

Consider the following two equations: \[ Y = 0.3 + 0.5 X + u\]

and

\[ X = 0.9 + Z + v\]

Where \(Z \sim \mathcal{N}(2, 1)\) and \((u, v)\) are distributed multivariate normal with means 0, variances 1 and correlation equal to 0.8.

Because \(X\) has the error term \(v\), this serves the purpose of making \(X\) correlated with \(u\) - which, as you may recall, is exactly what we assume is not the case in vanilla OLS.

As always when we do anything random (like simulate data), we will set our seed.

## For replicability purposes:
set.seed(1986)

## Sample Size
n <- 3000

## Let's generate Z
Z <- rnorm(n, 2, 1)
head(Z)
## [1] 1.953749 2.280001 2.253171 1.035889 2.492227 1.301254
## Let's generate the error terms:
require('mvtnorm')

## Variance covariance Matrix for the errors: u and v
sigma <- diag(2)
sigma[2,1] <- sigma[1,2] <- 0.8
sigma
##      [,1] [,2]
## [1,]  1.0  0.8
## [2,]  0.8  1.0
## Let's generate the errors:
Errors <- rmvnorm(n, mean= c(0,0), sigma=sigma)
head(Errors)
##            [,1]       [,2]
## [1,]  1.7641519  2.2511067
## [2,]  1.8354363  0.8997512
## [3,] -1.2042891 -0.6476522
## [4,] -0.2824119 -0.3789279
## [5,] -0.1687027 -0.3485712
## [6,]  0.1872254  0.4863826
#checking to make sure errors are correlated correctly
cor(Errors)
##           [,1]      [,2]
## [1,] 1.0000000 0.7958107
## [2,] 0.7958107 1.0000000
u <- Errors[,1]
v <- Errors[,2]

## Let's generate X and Y
X <- 0.9 + Z + v
Y <- 0.3 + 0.5*X + u

## We can easily check that Z and u are orthogonal
cor(Z,u)
## [1] -0.0155411
## Similarly, we can see that X by construction is
## Endogenous
cor(X,u)
## [1] 0.5484201

Fit a regular (naive, aka not taking endogeneity into account) OLS model of Y ~ X. How well does it do?

## OLS estimates: awfully biased
summary(lm(Y ~ X))
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4133 -0.5620  0.0178  0.5436  2.7395 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.81471    0.03505  -23.24   <2e-16 ***
## X            0.88683    0.01077   82.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8282 on 2998 degrees of freedom
## Multiple R-squared:  0.6933, Adjusted R-squared:  0.6932 
## F-statistic:  6778 on 1 and 2998 DF,  p-value: < 2.2e-16

As we can note, we are far from obtaining the 0.3 (Intercept) and 0.5 (Slope) as the coefficients for our model.

Let’s try to get better estimates of the coefficients.

Use what we talked about in class and fit two separate regressions on \(Z\).

First, calculate our estimate of the true \(\beta_1\) using the covariances. Then, recalculate it using the \(\hat{\beta}_1\)’s from the two different regressions.

Why does this ratio of coefficients work? (Think about what a coefficient is a ratio of.)

# ratio of covariances
cov(Y,Z)/var(Z)*var(Z)/cov(X, Z)
## [1] 0.4845842
## Running 2 regressions:
fit_y_z <- lm(Y ~ Z)
fit_x_z <- lm(X ~ Z)

## Recovering the truth?
beta1 <- fit_y_z$coef[2]/fit_x_z$coef[2]
beta1
##         Z 
## 0.4845842
beta0 <- fit_y_z$coef[1] - beta1*fit_x_z$coef[1]
beta0
## (Intercept) 
##   0.3661613

That’s a lot closer. Now, what about 2SLS? First fit a regression of \(X \sim Z\), get the fitted values \(\hat{X}\) from that regression, and the regress \(Y \sim \hat{X}\).

## Let's project X onto Z - first stage
first_stage <- lm(X ~ Z)
x_hat <- predict(first_stage)

## Let's run the regression of Y on x_hat
summary(lm(Y ~ x_hat))
## 
## Call:
## lm(formula = Y ~ x_hat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9374 -0.9726  0.0040  0.9612  4.9025 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.36616    0.08026   4.562 5.26e-06 ***
## x_hat        0.48458    0.02588  18.722  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.415 on 2998 degrees of freedom
## Multiple R-squared:  0.1047, Adjusted R-squared:  0.1044 
## F-statistic: 350.5 on 1 and 2998 DF,  p-value: < 2.2e-16

Both methods recover the same point estimates, which are closer to the true parameters.

We can also try this using matrix algebra. Recreate \(\widehat{\beta}_{2SLS}\) using the matrices. NOTE: You will have to make both \(Z\) and \(X\) into matrices (cbind(1, X) and cbind(1, Z)) because the \(\mathbf{X}\) and \(\mathbf{Z}\) on slide 23 represent the design matrices of the two regressions (in each stage).

X_mat <- cbind(1, X)
Z_mat <- cbind(1, Z)

P_z <- Z_mat %*% solve(t(Z_mat) %*% Z_mat) %*% t(Z_mat)

betas_2SLS <- solve(t(X_mat) %*% P_z %*%
       X_mat) %*% (t(X_mat) %*% P_z %*% Y)
betas_2SLS
##        [,1]
##   0.3661613
## X 0.4845842

Let’s compare these results with the canned function for 2SLS:

## This is one of the many packages that have a function to estimate 2SLS
library(AER)

## Let's run the model
summary(ivreg(Y  ~ X | Z))
## 
## Call:
## ivreg(formula = Y ~ X | Z)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -3.645080 -0.686737  0.004101  0.703585  3.417332 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.36616    0.05686    6.44 1.39e-10 ***
## X            0.48458    0.01834   26.43  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.002 on 2998 degrees of freedom
## Multiple R-Squared: 0.5507,  Adjusted R-squared: 0.5505 
## Wald test: 698.4 on 1 and 2998 DF,  p-value: < 2.2e-16

Note that the std. errors are quite different from those using 2SLS by hand. As noted in class, we need to adjust the std. errors if we are calculating 2SLS by hand. Refer to slides p. 14-16.

#first calculate sigma squared hat
sigma2_hat <- sum((Y - X_mat %*% betas_2SLS)^2) / 
  (n - length(betas_2SLS))

#now standard errors
se_2SLS <- sqrt(diag(sigma2_hat * solve(t(X_mat) %*% P_z %*% X_mat)))

cbind(se_2SLS, summary(ivreg(Y  ~ X | Z))$coef[,2])
##      se_2SLS           
##   0.05685624 0.05685624
## X 0.01833639 0.01833639

Inference for the Linear Model

Big picture: we wish to make some claim about a relationship in the world. We can characterize this relationship between outcome \(Y\) and explanatory factor(s) \(X\) in a linear framework, wherein \(\beta\) is the parameter of interest. \(\beta\) itself is unknown, but we use data to estimate some \(\hat{\beta}\) that is informative of the true \(\beta\). In other words, \(\hat{\beta}\) allows to make inferences about the nature of \(\beta\).

There are two approaches to inference when obtaining OLS parameters:
- Finite-sample inference: relies on the necessary OLS assumptions and the normality of errors
- Large-sample (asymptotic) inference: relies on linearity and the asymptotic properties of OLS

Finite-Sample Inference

Assume a linear model \(Y = \textbf{X}\beta + \epsilon\), where we estimate \(\hat{\beta}= (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}Y\). We might assume that \(\epsilon\) follows a normal distribution, or \(\epsilon \sim \mathcal{N}(\textbf{0}, \, \sigma^2 \textbf{I}_{N})\). Since \(\hat{\beta}- \beta = (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}\epsilon\), we have:

\[\begin{equation*} \begin{aligned} \epsilon & \sim \mathcal{N}(\textbf{0}, \, \sigma^2 \textbf{I}_{N})\\ \implies (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}\epsilon & \sim \mathcal{N}(\textbf{0},\, (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}\sigma^2 \textbf{X} (\textbf{X}^{\top}\textbf{X})^{-1})\\ \implies (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}\epsilon & \sim \mathcal{N}(\textbf{0},\, (\textbf{X}^{\top}\textbf{X})^{-1}\sigma^2)\\ \implies \frac{\hat{\beta}- \beta}{\text{s.e.}(\hat{\beta}_{k})} & \sim \mathcal{N}(0, \, 1) \end{aligned} \end{equation*}\]

Given that \(\hat{\sigma}^2 = \frac{\hat{\epsilon}^{\top}\hat{\epsilon}}{N - K - 1}\), we showed in class that \(\frac{\hat{\beta}- \beta}{\hat{\text{s.e.}(\hat{\beta}_{k})}} \sim t_{N - K - 1}\) We can use this value to construct confidence intervals around some coefficient of interest \(\beta_k\).

Confidence Intervals, Revisited

Definition — Confidence Interval: a confidence interval is an interval that will contain the true value of interest in some fixed proportion of repeated samples. When estimating \(\beta_k\), we can construct a \(100 \, (1 - \alpha) \%\) confidence interval for \(\beta_k\) defined as \(\left[\hat{\beta}_k \, \pm \, t_{\frac{\alpha}{2}, N - K - 1} \, \times \, \hat{\text{s.e.}\hat{\beta}_k}\right]\), where \(t_{\frac{\alpha}{2}, N - K - 1}\) is the \(1 - \frac{\alpha}{2}\) quantile of the \(t_{N - K - 1}\) distribution.

Some notes:\

  1. The probability that the true value of interest is within a particular confidence interval is either 0 or 1
  2. Confidence intervals are random, while the truth is fixed
  3. Interpretation: over repeated samples, the probability that the confidence interval contains the true value of interest is exactly \((1 - \alpha) \times 100\%\)

Hypothesis Testing

All of statistical hypothesis testing is a thought experiment: given some claim or conjecture, how might we disprove or falsify it?

In practice, this consists of four steps:
1) Generating a null hypothesis (commonly annotated \(H_0\)) and its complimentary alternative hypothesis (commonly annotated \(H_A\) or \(H_1\))
2) Calculating an appropriate test statistic
3) Using the test statistic to calculate a probability called a p-value
4) Using the p-value to decide whether or not to reject the null hypothesis

Definition — Alternative Hypothesis: the conjecture or statement about the population parameter of interest in which we seek to affirm or empirically support. In the OLS context, this is usually given by \(\hat{\beta}_k\).

Definition — Null Hypothesis: a conjecture or statement about the population parameter of interest. This is often created using assumptions that the claim in which we are interested is wrong or incorrect. In other words, it is the framework through which we will attempt to falsify the alternative hypothesis. In the OLS context, this is usually given by \(\beta_{k}^{0}\).

The approach commonly used is a probabilistic proof by contradiction: assuming that the null hypothesis is true, under what circumstances might our estimation be inconsistent with that truth? In other words, what magnitude of \(H_A\) is sufficient to cast doubt on the validity of the null?

The test statistic for such a hypothesis test, across varying values of \(\beta_k\), is given by \[\begin{equation*} t_0 = \frac{\hat{\beta}_k - \beta_{k}^{0}}{\hat{\text{s.e.}\hat{\beta}_k}} \end{equation*}\]

In turn, this statistic is evaluated with its relevant \(t\)-distribution to construct a p-value, which we use to decide whether or not to reject the null hypothesis.

Definition — p-value: the p-value of a statistical test is, over repeated samples, the probability of observing a value as extreme as our observed \(t_0\). The threshold \(\alpha\) for a p-value is the threshold below which we reject the null hypothesis. Common thresholds are include \(p > 0.10\), as “not being statistically significant” and \(p \leq 0.05\) being statistically significant. The substantive meaning of these thresholds is admittedly arbitrary, but useful within the enterprise of social science.

An important distinction: the p-value is not the probability that the null hypothesis is true, or equivalently, not the probability that the alternative hypothesis is false.