Unbiasedness of \(\hat{\beta}\)

We can show that \(\hat{\beta}\) is an unbiased estimate for \(\beta\). First, we substitue in \(\hat{\beta}= (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}Y\):

\[\begin{equation*} \begin{aligned} \mathbb{E}\left[\hat{\beta}- \beta \,|\, \textbf{X} \right] & = \mathbb{E}\left[ (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}Y- \beta \,|\, \textbf{X} \right] \end{aligned} \end{equation*}\]

Next, substitute in \(Y = \textbf{X}\beta + \epsilon\):

\[\begin{equation*} \begin{aligned} & = \mathbb{E}\left[ (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} (\textbf{X}\beta + \epsilon) - \beta \,|\, \textbf{X} \right] \end{aligned} \end{equation*}\]

Now we distribute. By exogeneity, we assume \(\textbf{X}^{\top}\epsilon = 0\).

\[\begin{equation*} \begin{aligned} & = \mathbb{E}\left[ (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \textbf{X}\beta + (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \epsilon - \beta \,|\, \textbf{X} \right]\\ & = \mathbb{E}\left[\beta + 0_{K} - \beta \,|\, \textbf{X} \right]\\ & = 0_{K + 1} \end{aligned} \end{equation*}\]


Variance of \(\hat{\beta}\)

We can derive the variance of \(\hat{\beta}\) in a similar manner:

\[\begin{equation*} \begin{aligned} \mathbb{V}\left(\hat{\beta}| \textbf{X} \right) & = \mathbb{V}\left((\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}Y\,|\, \textbf{X} \right)\\ & = \mathbb{V}\left( (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \textbf{X}\beta + (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \epsilon - \beta \,|\, \textbf{X} \right) \end{aligned} \end{equation*}\]

Let’s break this down. First, \(\mathbb{V}\left( (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \textbf{X}\beta \,|\, \textbf{X}\right) = 0\) since \((\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \textbf{X} = \textbf{I}\) and \(\beta\) is a constant. Then, we are left with \((\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \mathbb{V}\left(\epsilon \,|\, \textbf{X}\right) \textbf{X} (\textbf{X}^{\top}\textbf{X})^{-1}\), which we get by leveraging the fact that, for some matrices \(A\) and \(B\), \(\mathbb{V}\left(AB \,|\, A \right) = A \mathbb{V}\left( B \,|\, A \right) A^{\top}\).

Thus, \[\begin{equation*} \begin{aligned} \mathbb{V}\left(\hat{\beta}| \textbf{X} \right) & = (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \mathbb{V}\left(\epsilon \,|\, \textbf{X}\right) \textbf{X} (\textbf{X}^{\top}\textbf{X})^{-1} \end{aligned} \end{equation*}\]

From here, our result changes depending on our assumptions.

  1. Assume homoskedastic errors, or \(\mathbb{V}\left(\epsilon \,|\, \textbf{X} \right) = \sigma^2 \textbf{I}\): \[\begin{equation*} \begin{aligned} \mathbb{V}\left(\hat{\beta}| \textbf{X} \right) & = (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \mathbb{V}\left(\epsilon \,|\, \textbf{X}\right) \textbf{X} (\textbf{X}^{\top}\textbf{X})^{-1}\\ & = (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \sigma^2 \textbf{I} \textbf{X} (\textbf{X}^{\top}\textbf{X})^{-1}\\ & = (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \textbf{X} (\textbf{X}^{\top}\textbf{X})^{-1} \sigma^2\\ & = (\textbf{X}^{\top}\textbf{X})^{-1} \sigma^2 \end{aligned} \end{equation*}\]

  2. Assume heteroskedastic errors, or \(\mathbb{V}\left(\epsilon \,|\, \textbf{X} \right) = \textbf{D}\) such that \(\textbf{D}_{ii} = \mathbb{E}\left[\epsilon_{i}^{2} \,|\, \textbf{X} \right]\) along the diagonal: \[\begin{equation*} \begin{aligned} \mathbb{V}\left(\hat{\beta}\,|\, \textbf{X} \right) & = (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} \textbf{D} \textbf{X} (\textbf{X}^{\top}\textbf{X})^{-1} \end{aligned} \end{equation*}\]


The Gauss-Markov Theorem

Theorem: Let \(Y = \textbf{X}\beta + \epsilon\). The ordinary least squares (OLS) estimator is given by \(\hat{\beta}= (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}Y\). Among the class of linear and unbiased estimators, \(\hat{\beta}\) has the minimum variance.

We commonly say that the OLS estimator is BLUE, or the Best Linear Unbiased Estimator, where “best” means it has the minimum variance.

Proof: Let \(\tilde{\beta} = \textbf{C}Y\), with \(\textbf{CX} = \textbf{I}\) and \(\textbf{C} = (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} + \textbf{D}\), such that \(\textbf{DX} = 0_{K+1 \times K+1}\). We will compare this estimator to \(\hat{\beta}\). First, we need \(\tilde{\beta}\) to be unbiased:

\[\begin{equation*} \begin{aligned} \mathbb{E}\left[\tilde{\beta} - \beta \,|\, \textbf{X} \right] & = \mathbb{E}\left[ (\textbf{X}^{\top}\textbf{X})^{-1} \textbf{X}^{\top} + \textbf{D})Y - \beta \,|\, \textbf{X} \right] \end{aligned} \end{equation*}\]

Substituting in \(Y = \textbf{X}\beta + \epsilon\) and distributing:

\[\begin{equation*} \begin{aligned} \mathbb{E}\left[\tilde{\beta} - \beta \,|\, \textbf{X} \right] & = \mathbb{E}\left[ (\textbf{X}^{\top}\textbf{X})^{-1} \textbf{X}^{\top} + \textbf{D}) (\textbf{X}\beta + \epsilon) - \beta \,|\, \textbf{X} \right]\\ & = \mathbb{E}\left[(\textbf{X}^{\top}\textbf{X})^{-1} \textbf{X}^{\top}\textbf{X} \beta + \textbf{DX}\beta + (\textbf{X}^{\top}\textbf{X})^{-1} \textbf{X}^{\top}\epsilon + \textbf{D}\epsilon - \beta \,|\, \textbf{X} \right]\\ & = \mathbb{E}\left[\textbf{I}\beta + \textbf{DX}\beta + (\textbf{X}^{\top}\textbf{X})^{-1} \textbf{X}^{\top}\epsilon + \textbf{D}\epsilon - \beta \,|\, \textbf{X} \right]\\ & = \mathbb{E}\left[\textbf{DX}\beta + (\textbf{X}^{\top}\textbf{X})^{-1} \textbf{X}^{\top}\epsilon + \textbf{D}\epsilon \,|\, \textbf{X} \right] \end{aligned} \end{equation*}\]

Since we are assuming a) \(\textbf{DX} = 0\) and b) mean-zero errors, or \(\mathbb{E}\left[\epsilon\right] = 0\), this becomes \[\begin{equation*} \begin{aligned} \mathbb{E}\left[\tilde{\beta} - \beta \,|\, \textbf{X} \right] & = \mathbb{E}\left[0 \,|\, \textbf{X} \right] + \mathbb{E}\left[ (\textbf{X}^{\top}\textbf{X})^{-1} \textbf{X}^{\top}\epsilon \,|\, \textbf{X} \right] + \mathbb{E}\left[\textbf{D}\epsilon \,|\, \textbf{X} \right]\\ & = 0 + 0 + 0\\ & = 0 \end{aligned} \end{equation*}\]

So \(\tilde{\beta}\) is unbiased! Now, we need to show that its variance is not better than the variance of \(\hat{\beta}\): \[\begin{equation*} \begin{aligned} \mathbb{V}\left(\tilde{\beta} \,|\, \textbf{X} \right) & = \mathbb{V}\left(\textbf{C}Y \,|\, \textbf{X} \right)\\ & = \textbf{C} \mathbb{V}\left(Y \,| \, \textbf{X} \right) \textbf{C}^{\top} \end{aligned} \end{equation*}\]

We know that \(\mathbb{V}\left(Y \,|\, \textbf{X} \right) = \sigma^2 \textbf{I}_{N}\), so this becomes: \[\begin{equation*} \begin{aligned} \mathbb{V}\left(\tilde{\beta} \,|\, \textbf{X} \right) & = \sigma^2 \textbf{CC}^{\top} \end{aligned} \end{equation*}\]

Let’s consider \(\textbf{CC}^{\top}\), remembering that \(\textbf{C} = (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} + \textbf{D}\): \[\begin{equation*} \begin{aligned} \textbf{CC}^{\top} & = \left( (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} + \textbf{D} \right) \left( (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} + \textbf{D} \right)^{\top}\\ & = \left( (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top} + \textbf{D} \right) \left(\textbf{X}(\textbf{X}^{\top}\textbf{X})^{-1} + \textbf{D}^{\top} \right)\\ & = (\textbf{X}^{\top}\textbf{X})^{-1} \textbf{X}^{\top}\textbf{X} (\textbf{X}^{\top}\textbf{X})^{-1} + \textbf{DX} (\textbf{X}^{\top} \textbf{X})^{-1} + (\textbf{X}^{\top}\textbf{X})^{-1}\textbf{X}^{\top}\textbf{D}^{\top} + \textbf{DD}^{\top} \end{aligned} \end{equation*}\]

Since \(\textbf{DX} = 0\) and \(\textbf{X}^{\top}\textbf{D}^{\top} = 0\), we are left with: \[\begin{equation*} \begin{aligned} \textbf{CC}^{\top} & = (\textbf{X}^{\top}\textbf{X})^{-1} + \textbf{DD}^{\top} \end{aligned} \end{equation*}\]

Now, substitute this back into our variance: \[\begin{equation*} \begin{aligned} \mathbb{V}\left(\tilde{\beta} \,|\, \textbf{X} \right) & = \sigma^2 \left[(\textbf{X}^{\top}\textbf{X})^{-1} + \textbf{DD}^{\top} \right] \geq \sigma^2 (\textbf{X}^{\top}\textbf{X})^{-1} = \mathbb{V}\left(\hat{\beta}\,|\, \textbf{X} \right) \end{aligned} \end{equation*}\]

This is our contradiction! Since \(\textbf{DD}^{\top}\) is positive, \(\mathbb{V}\left(\tilde{\beta} \,|\, \textbf{X} \right) \geq \mathbb{V}\left(\hat{\beta}\,|\, \textbf{X} \right)\), meaning it cannot be the least biased. Thus, \(\hat{\beta}\) is the only unbiased linear estimator that has the minimum variance.


Deriving the Variance in R

Recall that the matrix algebra solution for OLS coefficients can be expressed as \[\begin{equation*} \hat{\beta}= (X^{\top}X)^{-1}X^{\top}y \end{equation*}\]

We also have the variance \(\sigma^2(X^{\top}X)^{-1}\) and variance-covariance matrix for those estimators: \[\begin{equation*} \hat{\sigma}^2 = \frac{e^{\top}e}{n - k - 1} \end{equation*}\]

We’ll estimate the regression manually. A few of the commands we will use are:
* %*% is the matrix multiplication operator
* t() will transpose a matrix
* solve() inverts a square matrix
* diag() gets the diagonal of a matrix

Start with deriving the coefficients. Remember to add a column of 1’s for the intercept.

# This is graduate program admissions data from UCLA
my_data <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

# for more information about the lm() function, use `?lm`

my_model <- lm(data = my_data, # This tells lm() where the data is coming from
               
               # This specifies the form of the model
               formula = admit ~ gre + gpa + rank
               )

# We can view a summary of the model using summary()
summary(my_model)
## 
## Call:
## lm(formula = admit ~ gre + gpa + rank, data = my_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6617 -0.3417 -0.1947  0.5061  0.9556 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.1824127  0.2169695  -0.841   0.4010    
## gre          0.0004424  0.0002101   2.106   0.0358 *  
## gpa          0.1510402  0.0633854   2.383   0.0176 *  
## rank        -0.1095019  0.0237617  -4.608 5.48e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4448 on 396 degrees of freedom
## Multiple R-squared:  0.09601,    Adjusted R-squared:  0.08916 
## F-statistic: 14.02 on 3 and 396 DF,  p-value: 1.054e-08
# We can view other components of the model using the following:

# Coefficients
coef(my_model)
##   (Intercept)           gre           gpa          rank 
## -0.1824126752  0.0004424258  0.1510402328 -0.1095019242
# Fitted values and residuals -- be careful, these will return a LOT of outputs
# predict(my_model)
# residuals(my_model)

# Residual standard error -- a couple of ways to do this
sigma(my_model)
## [1] 0.4448224
summary(my_model)$sigma
## [1] 0.4448224
typeof(my_data)
## [1] "list"
# Our data is a list -- we want this as a double or matrix
my_matrix <- as.matrix(my_data)

# Let's create X and Y, where X are columns 2, 3, and 4
# and Y is column 1 from our data

# cbind() binds columns into a single object. In this case, we are creating a column
# of 1's and combining them with columns 2, 3, and 4 of my_matrix
X <- cbind(1, my_matrix[,2:4])

# Y is just the first column of my_matrix
y <- my_matrix[,1]

# Computing beta-hat. We'll break this down into two components,
# (XTX)^-1 and XTy

# Computing (XTX)^-1 in two steps: the inner part and the outer part
first_half_inner <- t(X) %*% X
first_half <- solve(first_half_inner)

# Computing XTy
second_half <- t(X) %*% y

# Now beta-hat, which is their product
beta_hat <- first_half %*% second_half

What type of object is beta_hat? How does it compare to the results from summary(lm())?

# In one command, we're going to use cbind() to combine our manual
# calculation for beta_hat with the results from my_model

cbind(beta_hat, coef(my_model))
##               [,1]          [,2]
##      -0.1824126752 -0.1824126752
## gre   0.0004424258  0.0004424258
## gpa   0.1510402328  0.1510402328
## rank -0.1095019242 -0.1095019242

Now let’s calculate the variance. We’ll start by finding \(\hat{\sigma}^2\). We’ll need to calculate the residuals, which we can derive using the coefficients we’ve just reported. We’ll use them to calculate predicted values and then derive the residuals from there.

# The predicted values are just our data times their respective beta_hat estimates
preds <- X %*% beta_hat

# Our residuals, then our the difference between the observed y and the predicted ones
resids <- y - preds

# Use the definition of sigma_hat. We'll do this in two parts, although you can do it in one
sigma_hat_numerator <- t(resids) %*% resids
sigma_hat_denominator <- nrow(my_data) - ncol(X)

sigma_hat <- sigma_hat_numerator / sigma_hat_denominator

# The variance-covariance matrix is estimated using sigma_hat and (XTX)^-1,
# which from earlier we called `first_half`

var_covar <- as.numeric(sigma_hat) * first_half

# And the standard errors are the square root of the diagonal of that matrix
manual_std_errors <- sqrt(diag(var_covar))

Finally, let’s compare results. Notice that the standard errors themselves aren’t stored in the lm object, but the variance-covariance matrix is. You cna extract that matrix using summary(), or using vcov().

# Using vcov()
my_model_std_errors <- sqrt(diag(vcov(my_model)))

# Reporting them side-by-side
cbind(my_model_std_errors, manual_std_errors)
##             my_model_std_errors manual_std_errors
## (Intercept)        0.2169694971      0.2169694971
## gre                0.0002100836      0.0002100836
## gpa                0.0633853549      0.0633853549
## rank               0.0237616794      0.0237616794

They’re the same!