We can show that \(\hat{\beta}\) is an unbiased estimate for \(\beta\). First, we substitue in \(\hat{\beta}= (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}Y\):
\[\begin{equation*} \begin{aligned} \mathbb{E}\left[\hat{\beta}- \beta \,|\, \textbf{X} \right] & = \mathbb{E}\left[ (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}Y- \beta \,|\, \textbf{X} \right] \end{aligned} \end{equation*}\]
Next, substitute in \(Y = \textbf{X}\beta + \epsilon\):
\[\begin{equation*} \begin{aligned} & = \mathbb{E}\left[ (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} (\textbf{X}\beta + \epsilon) - \beta \,|\, \textbf{X} \right] \end{aligned} \end{equation*}\]
Now we distribute. By exogeneity, we assume \(\textbf{X}^{\top}\epsilon = 0\).
\[\begin{equation*} \begin{aligned} & = \mathbb{E}\left[ (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \textbf{X}\beta + (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \epsilon - \beta \,|\, \textbf{X} \right]\\ & = \mathbb{E}\left[\beta + 0_{K} - \beta \,|\, \textbf{X} \right]\\ & = 0_{K + 1} \end{aligned} \end{equation*}\]
We can derive the variance of \(\hat{\beta}\) in a similar manner:
\[\begin{equation*} \begin{aligned} \mathbb{V}\left(\hat{\beta}| \textbf{X} \right) & = \mathbb{V}\left((\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}Y\,|\, \textbf{X} \right)\\ & = \mathbb{V}\left( (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \textbf{X}\beta + (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \epsilon - \beta \,|\, \textbf{X} \right) \end{aligned} \end{equation*}\]
Let’s break this down. First, \(\mathbb{V}\left(
(\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \textbf{X}\beta
\,|\, \textbf{X}\right) = 0\) since \((\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}
\textbf{X} = \textbf{I}\) and \(\beta\) is a constant. Then, we are left
with \((\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}
\mathbb{V}\left(\epsilon \,|\, \textbf{X}\right) \textbf{X}
(\textbf{X}^{\top}\textbf{X})^{-1}\), which we get by leveraging
the fact that, for some matrices \(A\)
and \(B\), \(\mathbb{V}\left(AB \,|\, A \right) = A
\mathbb{V}\left( B \,|\, A \right) A^{\top}\).
Thus, \[\begin{equation*} \begin{aligned} \mathbb{V}\left(\hat{\beta}| \textbf{X} \right) & = (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \mathbb{V}\left(\epsilon \,|\, \textbf{X}\right) \textbf{X} (\textbf{X}^{\top}\textbf{X})^{-1} \end{aligned} \end{equation*}\]
From here, our result changes depending on our assumptions.
Assume homoskedastic errors, or \(\mathbb{V}\left(\epsilon \,|\, \textbf{X} \right) = \sigma^2 \textbf{I}\): \[\begin{equation*} \begin{aligned} \mathbb{V}\left(\hat{\beta}| \textbf{X} \right) & = (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \mathbb{V}\left(\epsilon \,|\, \textbf{X}\right) \textbf{X} (\textbf{X}^{\top}\textbf{X})^{-1}\\ & = (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \sigma^2 \textbf{I} \textbf{X} (\textbf{X}^{\top}\textbf{X})^{-1}\\ & = (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \textbf{X} (\textbf{X}^{\top}\textbf{X})^{-1} \sigma^2\\ & = (\textbf{X}^{\top}\textbf{X})^{-1} \sigma^2 \end{aligned} \end{equation*}\]
Assume heteroskedastic errors, or \(\mathbb{V}\left(\epsilon \,|\, \textbf{X} \right) = \textbf{D}\) such that \(\textbf{D}_{ii} = \mathbb{E}\left[\epsilon_{i}^{2} \,|\, \textbf{X} \right]\) along the diagonal: \[\begin{equation*} \begin{aligned} \mathbb{V}\left(\hat{\beta}\,|\, \textbf{X} \right) & = (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \textbf{D} \textbf{X} (\textbf{X}^{\top}\textbf{X})^{-1} \end{aligned} \end{equation*}\]
Theorem: Let \(Y =
\textbf{X}\beta + \epsilon\). The ordinary least squares
(OLS) estimator is given by \(\hat{\beta}=
(\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}Y\). Among the
class of linear and unbiased estimators, \(\hat{\beta}\) has the minimum
variance.
We commonly say that the OLS estimator is BLUE, or the
Best Linear
Unbiased Estimator,
where “best” means it has the minimum variance.
Proof: Let \(\tilde{\beta} = \textbf{C}Y\), with \(\textbf{CX} = \textbf{I}\) and \(\textbf{C} = (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} + \textbf{D}\), such that \(\textbf{DX} = 0_{K+1 \times K+1}\). We will compare this estimator to \(\hat{\beta}\). First, we need \(\tilde{\beta}\) to be unbiased:
\[\begin{equation*} \begin{aligned} \mathbb{E}\left[\tilde{\beta} - \beta \,|\, \textbf{X} \right] & = \mathbb{E}\left[ (\textbf{X}^{\top}\textbf{X})^{-1} \textbf{X}^{\top} + \textbf{D})Y - \beta \,|\, \textbf{X} \right] \end{aligned} \end{equation*}\]
Substituting in \(Y = \textbf{X}\beta + \epsilon\) and distributing:
\[\begin{equation*} \begin{aligned} \mathbb{E}\left[\tilde{\beta} - \beta \,|\, \textbf{X} \right] & = \mathbb{E}\left[ (\textbf{X}^{\top}\textbf{X})^{-1} \textbf{X}^{\top} + \textbf{D}) (\textbf{X}\beta + \epsilon) - \beta \,|\, \textbf{X} \right]\\ & = \mathbb{E}\left[(\textbf{X}^{\top}\textbf{X})^{-1} \textbf{X}^{\top}\textbf{X} \beta + \textbf{DX}\beta + (\textbf{X}^{\top}\textbf{X})^{-1} \textbf{X}^{\top}\epsilon + \textbf{D}\epsilon - \beta \,|\, \textbf{X} \right]\\ & = \mathbb{E}\left[\textbf{I}\beta + \textbf{DX}\beta + (\textbf{X}^{\top}\textbf{X})^{-1} \textbf{X}^{\top}\epsilon + \textbf{D}\epsilon - \beta \,|\, \textbf{X} \right]\\ & = \mathbb{E}\left[\textbf{DX}\beta + (\textbf{X}^{\top}\textbf{X})^{-1} \textbf{X}^{\top}\epsilon + \textbf{D}\epsilon \,|\, \textbf{X} \right] \end{aligned} \end{equation*}\]
Since we are assuming a) \(\textbf{DX} = 0\) and b) mean-zero errors, or \(\mathbb{E}\left[\epsilon\right] = 0\), this becomes \[\begin{equation*} \begin{aligned} \mathbb{E}\left[\tilde{\beta} - \beta \,|\, \textbf{X} \right] & = \mathbb{E}\left[0 \,|\, \textbf{X} \right] + \mathbb{E}\left[ (\textbf{X}^{\top}\textbf{X})^{-1} \textbf{X}^{\top}\epsilon \,|\, \textbf{X} \right] + \mathbb{E}\left[\textbf{D}\epsilon \,|\, \textbf{X} \right]\\ & = 0 + 0 + 0\\ & = 0 \end{aligned} \end{equation*}\]
So \(\tilde{\beta}\) is unbiased! Now, we need to show that its variance is not better than the variance of \(\hat{\beta}\): \[\begin{equation*} \begin{aligned} \mathbb{V}\left(\tilde{\beta} \,|\, \textbf{X} \right) & = \mathbb{V}\left(\textbf{C}Y \,|\, \textbf{X} \right)\\ & = \textbf{C} \mathbb{V}\left(Y \,| \, \textbf{X} \right) \textbf{C}^{\top} \end{aligned} \end{equation*}\]
We know that \(\mathbb{V}\left(Y \,|\, \textbf{X} \right) = \sigma^2 \textbf{I}_{N}\), so this becomes: \[\begin{equation*} \begin{aligned} \mathbb{V}\left(\tilde{\beta} \,|\, \textbf{X} \right) & = \sigma^2 \textbf{CC}^{\top} \end{aligned} \end{equation*}\]
Let’s consider \(\textbf{CC}^{\top}\), remembering that \(\textbf{C} = (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} + \textbf{D}\): \[\begin{equation*} \begin{aligned} \textbf{CC}^{\top} & = \left( (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} + \textbf{D} \right) \left( (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} + \textbf{D} \right)^{\top}\\ & = \left( (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} + \textbf{D} \right) \left(\textbf{X}(\textbf{X}^{\top}\textbf{X})^{-1} + \textbf{D}^{\top} \right)\\ & = (\textbf{X}^{\top}\textbf{X})^{-1} \textbf{X}^{\top}\textbf{X} (\textbf{X}^{\top}\textbf{X})^{-1} + \textbf{DX} (\textbf{X}^{\top} \textbf{X})^{-1} + (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}\textbf{D}^{\top} + \textbf{DD}^{\top} \end{aligned} \end{equation*}\]
Since \(\textbf{DX} = 0\) and \(\textbf{X}^{\top}\textbf{D}^{\top} = 0\), we are left with: \[\begin{equation*} \begin{aligned} \textbf{CC}^{\top} & = (\textbf{X}^{\top}\textbf{X})^{-1} + \textbf{DD}^{\top} \end{aligned} \end{equation*}\]
Now, substitute this back into our variance: \[\begin{equation*} \begin{aligned} \mathbb{V}\left(\tilde{\beta} \,|\, \textbf{X} \right) & = \sigma^2 \left[(\textbf{X}^{\top}\textbf{X})^{-1} + \textbf{DD}^{\top} \right] \geq \sigma^2 (\textbf{X}^{\top}\textbf{X})^{-1} = \mathbb{V}\left(\hat{\beta}\,|\, \textbf{X} \right) \end{aligned} \end{equation*}\]
This is our contradiction! Since \(\textbf{DD}^{\top}\) is positive, \(\mathbb{V}\left(\tilde{\beta} \,|\, \textbf{X} \right) \geq \mathbb{V}\left(\hat{\beta}\,|\, \textbf{X} \right)\), meaning it cannot be the least biased. Thus, \(\hat{\beta}\) is the only unbiased linear estimator that has the minimum variance.
Recall that the matrix algebra solution for OLS coefficients can be expressed as \[\begin{equation*} \hat{\beta}= (X^{\top}X)^{-1}X^{\top}y \end{equation*}\]
We also have the variance \(\sigma^2(X^{\top}X)^{-1}\) and variance-covariance matrix for those estimators: \[\begin{equation*} \hat{\sigma}^2 = \frac{e^{\top}e}{n - k - 1} \end{equation*}\]
We’ll estimate the regression manually. A few of the commands we will
use are:
* %*%
is the matrix multiplication operator
* t()
will transpose a matrix
* solve()
inverts a square matrix
* diag()
gets the diagonal of a matrix
Start with deriving the coefficients. Remember to add a column of 1’s for the intercept.
# This is graduate program admissions data from UCLA
my_data <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
# for more information about the lm() function, use `?lm`
my_model <- lm(data = my_data, # This tells lm() where the data is coming from
# This specifies the form of the model
formula = admit ~ gre + gpa + rank
)
# We can view a summary of the model using summary()
summary(my_model)
##
## Call:
## lm(formula = admit ~ gre + gpa + rank, data = my_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6617 -0.3417 -0.1947 0.5061 0.9556
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1824127 0.2169695 -0.841 0.4010
## gre 0.0004424 0.0002101 2.106 0.0358 *
## gpa 0.1510402 0.0633854 2.383 0.0176 *
## rank -0.1095019 0.0237617 -4.608 5.48e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4448 on 396 degrees of freedom
## Multiple R-squared: 0.09601, Adjusted R-squared: 0.08916
## F-statistic: 14.02 on 3 and 396 DF, p-value: 1.054e-08
# We can view other components of the model using the following:
# Coefficients
coef(my_model)
## (Intercept) gre gpa rank
## -0.1824126752 0.0004424258 0.1510402328 -0.1095019242
# Fitted values and residuals -- be careful, these will return a LOT of outputs
# predict(my_model)
# residuals(my_model)
# Residual standard error -- a couple of ways to do this
sigma(my_model)
## [1] 0.4448224
summary(my_model)$sigma
## [1] 0.4448224
typeof(my_data)
## [1] "list"
# Our data is a list -- we want this as a double or matrix
my_matrix <- as.matrix(my_data)
# Let's create X and Y, where X are columns 2, 3, and 4
# and Y is column 1 from our data
# cbind() binds columns into a single object. In this case, we are creating a column
# of 1's and combining them with columns 2, 3, and 4 of my_matrix
X <- cbind(1, my_matrix[,2:4])
# Y is just the first column of my_matrix
y <- my_matrix[,1]
# Computing beta-hat. We'll break this down into two components,
# (XTX)^-1 and XTy
# Computing (XTX)^-1 in two steps: the inner part and the outer part
first_half_inner <- t(X) %*% X
first_half <- solve(first_half_inner)
# Computing XTy
second_half <- t(X) %*% y
# Now beta-hat, which is their product
beta_hat <- first_half %*% second_half
What type of object is beta_hat
? How does it compare to
the results from summary(lm())?
# In one command, we're going to use cbind() to combine our manual
# calculation for beta_hat with the results from my_model
cbind(beta_hat, coef(my_model))
## [,1] [,2]
## -0.1824126752 -0.1824126752
## gre 0.0004424258 0.0004424258
## gpa 0.1510402328 0.1510402328
## rank -0.1095019242 -0.1095019242
Now let’s calculate the variance. We’ll start by finding \(\hat{\sigma}^2\). We’ll need to calculate the residuals, which we can derive using the coefficients we’ve just reported. We’ll use them to calculate predicted values and then derive the residuals from there.
# The predicted values are just our data times their respective beta_hat estimates
preds <- X %*% beta_hat
# Our residuals, then our the difference between the observed y and the predicted ones
resids <- y - preds
# Use the definition of sigma_hat. We'll do this in two parts, although you can do it in one
sigma_hat_numerator <- t(resids) %*% resids
sigma_hat_denominator <- nrow(my_data) - ncol(X)
sigma_hat <- sigma_hat_numerator / sigma_hat_denominator
# The variance-covariance matrix is estimated using sigma_hat and (XTX)^-1,
# which from earlier we called `first_half`
var_covar <- as.numeric(sigma_hat) * first_half
# And the standard errors are the square root of the diagonal of that matrix
manual_std_errors <- sqrt(diag(var_covar))
Finally, let’s compare results. Notice that the standard errors themselves aren’t stored in the lm object, but the variance-covariance matrix is. You cna extract that matrix using summary(), or using vcov().
# Using vcov()
my_model_std_errors <- sqrt(diag(vcov(my_model)))
# Reporting them side-by-side
cbind(my_model_std_errors, manual_std_errors)
## my_model_std_errors manual_std_errors
## (Intercept) 0.2169694971 0.2169694971
## gre 0.0002100836 0.0002100836
## gpa 0.0633853549 0.0633853549
## rank 0.0237616794 0.0237616794
They’re the same!