Assume a linear model of the following form:
\[\begin{equation*} Y = \textbf{E}\theta + \textbf{W}\gamma + \epsilon \end{equation*}\]
where \(\textbf{E}\) represents an
\(N \times R\) matrix of
endogenous covariates; \(\textbf{W}\) a matrix of \((N \times C) + 1\)
exogenous covariates.
Let \(\textbf{X} = \left[\textbf{E} \, \textbf{W}\right]\) be a matrix of size \((N \times K) + 1\). This means that \(R + C + 1 = K + 1\). In other words, \(\textbf{X}\) includes both our endogenous and exogenous covariates. Thus, we have
\[\begin{equation*} \begin{aligned} Y = \textbf{X}\beta + \epsilon \, \text{where} \, \mathbb{E}\left[\epsilon \right] = \textbf{0} \, \text{and} \, \textbf{X} \, \text{is} \, (N \times K) + 1 \end{aligned} \end{equation*}\]
This model is endogenous because \(\mathbb{E}\left[\epsilon_i \, | \,
\textbf{X}\right] \not= \textbf{0}\) because of the endogenous
partial \(\textbf{E}\).
The matrix of instruments \(\textbf{Z}\) we use is \(N \times L\) with \(L\) the number of instruments such that
\(L \geq R\). We need:
If we satisfy these conditions, we proceed. Stage 1 of
two-stage least squares (2SLS) requires that we regress \(\textbf{X}\) on \(\textbf{Z}\) and obtain fitted values \(\hat{\textbf{X}}\). In matrix notation, we
have the Projection Matrix \(\textbf{P}_{\textbf{Z}} =
\textbf{Z}(\textbf{Z}^{\top}\textbf{Z})^{-1}\textbf{Z}^{\top}\)
to do this for us, which results in \(\hat{\textbf{X}} =
\textbf{P}_{\textbf{Z}}\textbf{X}\).
In Stage 2, we regress \(Y\) on \(\hat{\textbf{X}}\), which is mechanically
identical to normal OLS. By the exact same process as normal OLS, it can
be shown that \(\hat{\beta}_{\text{2SLS}} =
(\hat{\textbf{X}}^{\top}\hat{\textbf{X}})^{-1}
\hat{\textbf{X}}^{\top}Y\).
(This section was taken directly from Jin Kim’s Lab exercise from Spring 2021)
We will focus our attention in a simulated data exercise. The appealing of doing this is that we know the true data generating process. Moreover, we can actually asses how good or bad OLS and IV perform.
Consider the following two equations: \[ Y = 0.3 + 0.5 X + u\]
and
\[ X = 0.9 + Z + v\]
Where \(Z \sim \mathcal{N}(2, 1)\) and \((u, v)\) are distributed multivariate normal with means 0, variances 1 and correlation equal to 0.8.
Because \(X\) has the error term \(v\), this serves the purpose of making \(X\) correlated with \(u\) - which, as you may recall, is exactly what we assume is not the case in vanilla OLS.
As always when we do anything random (like simulate data), we will set our seed.
## For replicability purposes:
set.seed(1986)
## Sample Size
n <- 3000
## Let's generate Z
Z <- rnorm(n, 2, 1)
head(Z)
## [1] 1.953749 2.280001 2.253171 1.035889 2.492227 1.301254
## Let's generate the error terms:
require('mvtnorm')
## Variance covariance Matrix for the errors: u and v
sigma <- diag(2)
sigma[2,1] <- sigma[1,2] <- 0.8
sigma
## [,1] [,2]
## [1,] 1.0 0.8
## [2,] 0.8 1.0
## Let's generate the errors:
Errors <- rmvnorm(n, mean= c(0,0), sigma=sigma)
head(Errors)
## [,1] [,2]
## [1,] 1.7641519 2.2511067
## [2,] 1.8354363 0.8997512
## [3,] -1.2042891 -0.6476522
## [4,] -0.2824119 -0.3789279
## [5,] -0.1687027 -0.3485712
## [6,] 0.1872254 0.4863826
#checking to make sure errors are correlated correctly
cor(Errors)
## [,1] [,2]
## [1,] 1.0000000 0.7958107
## [2,] 0.7958107 1.0000000
u <- Errors[,1]
v <- Errors[,2]
## Let's generate X and Y
X <- 0.9 + Z + v
Y <- 0.3 + 0.5*X + u
## We can easily check that Z and u are orthogonal
cor(Z,u)
## [1] -0.0155411
## Similarly, we can see that X by construction is
## Endogenous
cor(X,u)
## [1] 0.5484201
Fit a regular (naive, aka not taking endogeneity into account) OLS model of Y ~ X. How well does it do?
## OLS estimates: awfully biased
summary(lm(Y ~ X))
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4133 -0.5620 0.0178 0.5436 2.7395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.81471 0.03505 -23.24 <2e-16 ***
## X 0.88683 0.01077 82.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8282 on 2998 degrees of freedom
## Multiple R-squared: 0.6933, Adjusted R-squared: 0.6932
## F-statistic: 6778 on 1 and 2998 DF, p-value: < 2.2e-16
As we can note, we are far from obtaining the 0.3 (Intercept) and 0.5 (Slope) as the coefficients for our model.
Let’s try to get better estimates of the coefficients.
Use what we talked about in class and fit two separate regressions on \(Z\).
First, calculate our estimate of the true \(\beta_1\) using the covariances. Then, recalculate it using the \(\hat{\beta}_1\)’s from the two different regressions.
Why does this ratio of coefficients work? (Think about what a coefficient is a ratio of.)
# ratio of covariances
cov(Y,Z)/var(Z)*var(Z)/cov(X, Z)
## [1] 0.4845842
## Running 2 regressions:
fit_y_z <- lm(Y ~ Z)
fit_x_z <- lm(X ~ Z)
## Recovering the truth?
beta1 <- fit_y_z$coef[2]/fit_x_z$coef[2]
beta1
## Z
## 0.4845842
beta0 <- fit_y_z$coef[1] - beta1*fit_x_z$coef[1]
beta0
## (Intercept)
## 0.3661613
That’s a lot closer. Now, what about 2SLS? First fit a regression of \(X \sim Z\), get the fitted values \(\hat{X}\) from that regression, and the regress \(Y \sim \hat{X}\).
## Let's project X onto Z - first stage
first_stage <- lm(X ~ Z)
x_hat <- predict(first_stage)
## Let's run the regression of Y on x_hat
summary(lm(Y ~ x_hat))
##
## Call:
## lm(formula = Y ~ x_hat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9374 -0.9726 0.0040 0.9612 4.9025
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.36616 0.08026 4.562 5.26e-06 ***
## x_hat 0.48458 0.02588 18.722 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.415 on 2998 degrees of freedom
## Multiple R-squared: 0.1047, Adjusted R-squared: 0.1044
## F-statistic: 350.5 on 1 and 2998 DF, p-value: < 2.2e-16
Both methods recover the same point estimates, which are closer to the true parameters.
We can also try this using matrix algebra. Recreate \(\widehat{\beta}_{2SLS}\) using the
matrices. NOTE: You will have to make both \(Z\) and \(X\) into matrices (cbind(1, X)
and cbind(1, Z)
) because the \(\mathbf{X}\) and \(\mathbf{Z}\) on slide 23 represent the
design matrices of the two regressions (in each stage).
X_mat <- cbind(1, X)
Z_mat <- cbind(1, Z)
P_z <- Z_mat %*% solve(t(Z_mat) %*% Z_mat) %*% t(Z_mat)
betas_2SLS <- solve(t(X_mat) %*% P_z %*%
X_mat) %*% (t(X_mat) %*% P_z %*% Y)
betas_2SLS
## [,1]
## 0.3661613
## X 0.4845842
Let’s compare these results with the canned function for 2SLS:
## This is one of the many packages that have a function to estimate 2SLS
library(AER)
## Let's run the model
summary(ivreg(Y ~ X | Z))
##
## Call:
## ivreg(formula = Y ~ X | Z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.645080 -0.686737 0.004101 0.703585 3.417332
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.36616 0.05686 6.44 1.39e-10 ***
## X 0.48458 0.01834 26.43 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.002 on 2998 degrees of freedom
## Multiple R-Squared: 0.5507, Adjusted R-squared: 0.5505
## Wald test: 698.4 on 1 and 2998 DF, p-value: < 2.2e-16
Note that the std. errors are quite different from those using 2SLS by hand. As noted in class, we need to adjust the std. errors if we are calculating 2SLS by hand. Refer to slides p. 14-16.
#first calculate sigma squared hat
sigma2_hat <- sum((Y - X_mat %*% betas_2SLS)^2) /
(n - length(betas_2SLS))
#now standard errors
se_2SLS <- sqrt(diag(sigma2_hat * solve(t(X_mat) %*% P_z %*% X_mat)))
cbind(se_2SLS, summary(ivreg(Y ~ X | Z))$coef[,2])
## se_2SLS
## 0.05685624 0.05685624
## X 0.01833639 0.01833639
Big picture: we wish to make some claim about a relationship in the
world. We can characterize this relationship between outcome \(Y\) and explanatory factor(s) \(X\) in a linear framework, wherein \(\beta\) is the parameter of interest. \(\beta\) itself is unknown, but we use data
to estimate some \(\hat{\beta}\) that
is informative of the true \(\beta\).
In other words, \(\hat{\beta}\) allows
to make inferences about the nature of \(\beta\).
There are two approaches to inference when obtaining OLS
parameters:
- Finite-sample inference: relies on the necessary OLS assumptions and
the normality of errors
- Large-sample (asymptotic) inference: relies on linearity and the
asymptotic properties of OLS
Assume a linear model \(Y = \textbf{X}\beta
+ \epsilon\), where we estimate \(\hat{\beta}=
(\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}Y\). We might
assume that \(\epsilon\) follows a
normal distribution, or \(\epsilon \sim
\mathcal{N}(\textbf{0}, \, \sigma^2 \textbf{I}_{N})\). Since
\(\hat{\beta}- \beta =
(\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}\epsilon\), we
have:
\[\begin{equation*} \begin{aligned} \epsilon & \sim \mathcal{N}(\textbf{0}, \, \sigma^2 \textbf{I}_{N})\\ \implies (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}\epsilon & \sim \mathcal{N}(\textbf{0},\, (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}\sigma^2 \textbf{X} (\textbf{X}^{\top}\textbf{X})^{-1})\\ \implies (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}\epsilon & \sim \mathcal{N}(\textbf{0},\, (\textbf{X}^{\top}\textbf{X})^{-1}\sigma^2)\\ \implies \frac{\hat{\beta}- \beta}{\text{s.e.}(\hat{\beta}_{k})} & \sim \mathcal{N}(0, \, 1) \end{aligned} \end{equation*}\]
Given that \(\hat{\sigma}^2 =
\frac{\hat{\epsilon}^{\top}\hat{\epsilon}}{N - K - 1}\), we
showed in class that \(\frac{\hat{\beta}-
\beta}{\hat{\text{s.e.}(\hat{\beta}_{k})}} \sim t_{N - K - 1}\)
We can use this value to construct confidence intervals around
some coefficient of interest \(\beta_k\).
Definition — Confidence Interval: a confidence
interval is an interval that will contain the true value of
interest in some fixed proportion of repeated samples. When estimating
\(\beta_k\), we can construct a \(100 \, (1 - \alpha) \%\) confidence
interval for \(\beta_k\) defined as
\(\left[\hat{\beta}_k \, \pm \,
t_{\frac{\alpha}{2}, N - K - 1} \, \times \,
\hat{\text{s.e.}\hat{\beta}_k}\right]\), where \(t_{\frac{\alpha}{2}, N - K - 1}\) is the
\(1 - \frac{\alpha}{2}\) quantile of
the \(t_{N - K - 1}\)
distribution.
Some notes:\
All of statistical hypothesis testing is a thought experiment: given
some claim or conjecture, how might we disprove or
falsify it?
In practice, this consists of four steps:
1) Generating a null hypothesis (commonly annotated \(H_0\)) and its complimentary
alternative hypothesis (commonly annotated \(H_A\) or \(H_1\))
2) Calculating an appropriate test statistic
3) Using the test statistic to calculate a probability called a
p-value
4) Using the p-value to decide whether or not to reject the null
hypothesis
Definition — Alternative Hypothesis: the conjecture
or statement about the population parameter of interest in which we seek
to affirm or empirically support. In the OLS context, this is usually
given by \(\hat{\beta}_k\).
Definition — Null Hypothesis: a conjecture or
statement about the population parameter of interest. This is often
created using assumptions that the claim in which we are interested is
wrong or incorrect. In other words, it is the framework through which we
will attempt to falsify the alternative hypothesis. In the OLS context,
this is usually given by \(\beta_{k}^{0}\).
The approach commonly used is a probabilistic proof by contradiction:
assuming that the null hypothesis is true, under what circumstances
might our estimation be inconsistent with that truth? In other words,
what magnitude of \(H_A\) is sufficient
to cast doubt on the validity of the null?
The test statistic for such a hypothesis test, across varying values of \(\beta_k\), is given by \[\begin{equation*} t_0 = \frac{\hat{\beta}_k - \beta_{k}^{0}}{\hat{\text{s.e.}\hat{\beta}_k}} \end{equation*}\]
In turn, this statistic is evaluated with its relevant \(t\)-distribution to construct a p-value,
which we use to decide whether or not to reject the null
hypothesis.
Definition — p-value: the p-value of a
statistical test is, over repeated samples, the probability of observing
a value as extreme as our observed \(t_0\). The threshold \(\alpha\) for a p-value is the threshold
below which we reject the null hypothesis. Common thresholds are include
\(p > 0.10\), as “not being
statistically significant” and \(p \leq
0.05\) being statistically significant. The substantive meaning
of these thresholds is admittedly arbitrary, but useful within the
enterprise of social science.
An important distinction: the p-value is not the probability that the null hypothesis is true, or equivalently, not the probability that the alternative hypothesis is false.