A crucial property of the OLS estimator is that it minimizes the
residual sum of squares (RSS). We’re going to use the matrix algebra
representation of linear model to derive \(\hat{\beta}\). Let \(Y = \textbf{X}\beta + \epsilon\). We have
that \(Y\) is \(n \times 1\), \(\textbf{X}\) is \(n \times (k + 1)\), and that \(\beta\) is \((k +
1) \times 1\), where \(n\) is
the number of observations and \(k\) is
the number of covariates in the model.
We are given that the RSS is \(\displaystyle \sum_{i = 1}^{N} \epsilon_{i}^{2}\). We can expand this expression as follows: \[\begin{equation*} \begin{aligned} RSS & = \displaystyle \sum_{i = 1}^{N} \epsilon_{i}^{2}\\ & = \epsilon^{\top}\epsilon\\ & = (Y - \textbf{X}\beta)^{\top}(Y -\textbf{X}\beta)\\ & = (Y^{\top} - \beta^{\top} \textbf{X}^{\top}) (Y - \textbf{X}\beta)\\ & = Y^{\top}Y - Y^{\top}\textbf{X}\beta - \beta^{\top}\textbf{X}^{\top}Y + \beta^{\top}\textbf{X}^{\top}\textbf{X}\beta \end{aligned} \end{equation*}\]
A key property we will use is that \(Y^{\top}\textbf{X}\beta =
\beta^{\top}\textbf{X}^{\top}Y\). Why is this true? Well, start
with \(\beta^{\top}\textbf{X}^{\top}Y\). We can
use the property \(\textbf{B}^{\top}\textbf{A}^{\top} =
(\textbf{AB})^{\top}\) to get \((\textbf{X}\beta)^{\top}Y\). This quantity
is \(1 \times (k + 1) \times (k + 1) \times n
\times n \times 1\), or simply \(1
\times 1\). Notice that \(Y^{\top}\textbf{X}\beta\) is also \(1 \times 1\). A convenient fact is that the
transpose of any \(1 \times 1\) matrix
is itself, meaning that \(Y^{\top}\textbf{X}\beta =
(Y^{\top}\textbf{X}\beta)^{\top} =
\beta^{\top}\textbf{X}^{\top}Y\). They both are \(1 \times 1\) and utilize the exact same set
of inputs but with the order reversed, which helps ground the
intuition.
Because of this, we may continue as such: \[\begin{equation*} \begin{aligned} RSS & = Y^{\top}Y - Y^{\top}\textbf{X}\beta - \beta^{\top}\textbf{X}^{\top}Y + \beta^{\top}\textbf{X}^{\top}\textbf{X}\beta\\ & = Y^{\top}Y - 2Y^{\top}\textbf{X}\beta + \beta^{\top}\textbf{X}^{\top}\textbf{X}\beta \end{aligned} \end{equation*}\]
Now, we take the first-order condition \(\frac{\partial RSS}{\partial \beta} = 0\): \[\begin{equation*} \begin{aligned} \frac{\partial RSS}{\partial \beta} & = \frac{\partial}{\partial \beta} \left[ Y^{\top}Y - 2Y^{\top}\textbf{X}\beta + \beta^{\top}\textbf{X}^{\top}\textbf{X}\beta \right]\\ & = 0 - (2Y^{\top}\textbf{X})^{\top} + 2(\textbf{X}^{\top}\textbf{X})\beta\\ & = -2(\textbf{X}^{\top}Y) + 2(\textbf{X}^{\top}\textbf{X})\beta \end{aligned} \end{equation*}\]
Setting that quantity equal to zero: \[\begin{equation*} \begin{aligned} 0 & = -2(\textbf{X}^{\top}Y) + 2(\textbf{X}^{\top}\textbf{X})\beta \end{aligned} \end{equation*}\]
Dividing both sides by -2, we have:
\[\begin{equation*} \begin{aligned} 0 & = - (\textbf{X}^{\top}Y) + (\textbf{X}^{\top}\textbf{X})\beta \\ \implies (\textbf{X}^{\top}\textbf{X})\beta & = (\textbf{X}^{\top}Y)\\ \implies (\textbf{X}^{\top}\textbf{X})^{-1} (\textbf{X}^{\top}\textbf{X})\beta & = (\textbf{X}^{\top}\textbf{X})^{-1} (\textbf{X}^{\top}Y)\\ \implies \beta & = (\textbf{X}^{\top}\textbf{X})^{-1} (\textbf{X}^{\top}Y) \end{aligned} \end{equation*}\]
Definition: Projection Matrix
The Projection Matrix (or Hat Matrix) is defined as
the matrix that maps \(Y\) onto \(\hat{Y}\), or \(Y\) onto its fitted values. We have: \[\begin{equation*}
\begin{aligned}
\hat{Y} & = \textbf{X} \hat{\beta}\\
& = \textbf{X} (\textbf{X}^{\top} \textbf{X})^{-1}
\textbf{X}^{\top}Y\\
& = \textbf{P}Y
\end{aligned}
\end{equation*}\]
such that \(\textbf{P} = \textbf{X}
(\textbf{X}^{\top} \textbf{X})^{-1} \textbf{X}^{\top}\).
\(\textbf{P}\) has two important
properties: it is symmetric, meaning \(\textbf{P} = \textbf{P}^{\top}\), and
idempotent, meaning \(\textbf{PP} =
\textbf{P}\).
Showing symmetry: \[\begin{equation*} \begin{aligned} \textbf{P}^{\top} & = (\textbf{X} (\textbf{X}^{\top} \textbf{X})^{-1} \textbf{X}^{\top})^{\top}\\ & = (\textbf{X}^{\top})^{\top} \left( (\textbf{X}^{\top}\textbf{X})^{-1} \right)^{\top} \textbf{X}^{\top}\\ & = \textbf{X} \left((\textbf{X}^{\top}\textbf{X})^{\top} \right)^{-1} \textbf{X}^{\top}\\ & = \textbf{X} (\textbf{X}^{\top} \textbf{X})^{-1} \textbf{X}^{\top}\\ & = \textbf{P} \end{aligned} \end{equation*}\]
Showing idempotency: \[\begin{equation*} \begin{aligned} \textbf{PP} & = \textbf{X} (\textbf{X}^{\top} \textbf{X})^{-1} \textbf{X}^{\top} \textbf{X} (\textbf{X}^{\top} \textbf{X})^{-1} \textbf{X}^{\top}\\ & = \textbf{X} \underbrace{(\textbf{X}^{\top} \textbf{X})^{-1} \textbf{X}^{\top} \textbf{X}}_{\textbf{I}} (\textbf{X}^{\top} \textbf{X})^{-1} \textbf{X}^{\top}\\ & = \textbf{X} (\textbf{X}^{\top} \textbf{X})^{-1} \textbf{X}^{\top}\\ & = \textbf{P} \end{aligned} \end{equation*}\]
There are four assumptions we need to make about \(\epsilon\) in order for us to generate unbiased and efficient regression coefficient estimates:
We have the correct model, \(Y =
\textbf{X}\beta + \epsilon\)
Our model is exogenous, or that errors are random with respect to
\(\textbf{X}\), or \(\text{Cov}(\textbf{X}, \epsilon) = 0 \iff
\textbf{X}^{\top}\epsilon = 0_{k}\)
Our errors are homoskedastic and are mean-zero, or \(\mathbb{V}(\epsilon_{i} | X_{i}^{\top}) =
\sigma^2\); \(\mathbb{E}[\epsilon] =
0\)
Our observations are independent, such that \(\text{Cov}(\epsilon_{i}, \epsilon_{j}) =
0\)
Intuitively, what do each of these assumptions mean?
We are going to deconstruct the linear model using R. Load in the following data:
# This is graduate program admissions data from UCLA
my_data <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
# What are the variables? We use colnames() to see
colnames(my_data)
## [1] "admit" "gre" "gpa" "rank"
# It looks like `admit` is a binary variable indicating whether that applicant was admitted
# `gre` is their score on the graduate record examination
# `gpa` is their undergraduate grade point average
# `rank` is the rank of... something
# Let's summarize this data. I'm going to use `dplyr` to do so,
# but all of the `aggregate()` commands we used earlier will
# return the same result
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# dplyr uses what is called a "piping operator" %>%.
# Intuitively, it takes our data and passes it through
# some specified set of filters to get a desired output
# First, `group_by()` groups the data by some variable.
# In this case, we are grouping by whether applicants were admitted
my_data %>% group_by(admit) %>%
# `summarise()` lets us customise our outputs. In this case, we are taking
# the average of `gre` and `gpa`, sorting them by `admit`
summarise(mean_gre = mean(gre, na.rm = TRUE),
mean_gpa = mean(gpa, na.rm = TRUE))
## # A tibble: 2 x 3
## admit mean_gre mean_gpa
## <int> <dbl> <dbl>
## 1 0 573. 3.34
## 2 1 619. 3.49
# We can see that the average gre score for successful applicants was
# almost 50 points higher than unsuccessful applicants, while the average
# gpa was about .14 higher.
Suppose that we are interested in admission status as a function of GRE score, GPA, and rank. We could represent this as a linear model of the following form:
\[\begin{equation*} \text{admit} = \beta_0 + \beta_1 \times \text{gre} + \beta_2 \times \text{gpa} + \beta_3 \times \text{rank} + \epsilon \end{equation*}\]
R can estimate this model using the lm() function.
# for more information about the lm() function, use `?lm`
my_model <- lm(data = my_data, # This tells lm() where the data is coming from
# This specifies the form of the model
formula = admit ~ gre + gpa + rank
)
# We can view a summary of the model using summary()
summary(my_model)
##
## Call:
## lm(formula = admit ~ gre + gpa + rank, data = my_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6617 -0.3417 -0.1947 0.5061 0.9556
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1824127 0.2169695 -0.841 0.4010
## gre 0.0004424 0.0002101 2.106 0.0358 *
## gpa 0.1510402 0.0633854 2.383 0.0176 *
## rank -0.1095019 0.0237617 -4.608 5.48e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4448 on 396 degrees of freedom
## Multiple R-squared: 0.09601, Adjusted R-squared: 0.08916
## F-statistic: 14.02 on 3 and 396 DF, p-value: 1.054e-08
# We can view other components of the model using the following:
# Coefficients
coef(my_model)
## (Intercept) gre gpa rank
## -0.1824126752 0.0004424258 0.1510402328 -0.1095019242
# Fitted values and residuals -- be careful, these will return a LOT of outputs
# predict(my_model)
# residuals(my_model)
# Residual standard error -- a couple of ways to do this
sigma(my_model)
## [1] 0.4448224
summary(my_model)$sigma
## [1] 0.4448224
But what is lm() actually doing? We can deconstruct the function using linear algebra. Recall that the matrix algebra solution for OLS coefficients can be expressed as \[\begin{equation*} \hat{\beta}= (X^{\top}X)^{-1}X^{\top}y \end{equation*}\]
We also have (or will have) the variance \(\sigma^2(X^{\top}X)^{-1}\) and variance-covariance matrix for those estimators: \[\begin{equation*} \hat{\sigma}^2 = \frac{e^{\top}e}{n - k - 1} \end{equation*}\]
We’ll estimate the regression manually. A few of the commands we will
use are:
* %*%
is the matrix multiplication operator
* t()
will transpose a matrix
* solve()
inverts a square matrix
* diag()
gets the diagonal of a matrix
Start with deriving the coefficients. Remember to add a column of 1’s for the intercept.
typeof(my_data)
## [1] "list"
# Our data is a list -- we want this as a double or matrix
my_matrix <- as.matrix(my_data)
# Let's create X and Y, where X are columns 2, 3, and 4
# and Y is column 1 from our data
# cbind() binds columns into a single object. In this case, we are creating a column
# of 1's and combining them with columns 2, 3, and 4 of my_matrix
X <- cbind(1, my_matrix[,2:4])
# Y is just the first column of my_matrix
y <- my_matrix[,1]
# Computing beta-hat. We'll break this down into two components,
# (XTX)^-1 and XTy
# Computing (XTX)^-1 in two steps: the inner part and the outer part
first_half_inner <- t(X) %*% X
first_half <- solve(first_half_inner)
# Computing XTy
second_half <- t(X) %*% y
# Now beta-hat, which is their product
beta_hat <- first_half %*% second_half
What type of object is beta_hat
? How does it compare to
the results from summary(lm())?
# In one command, we're going to use cbind() to combine our manual
# calculation for beta_hat with the results from my_model
cbind(beta_hat, coef(my_model))
## [,1] [,2]
## -0.1824126752 -0.1824126752
## gre 0.0004424258 0.0004424258
## gpa 0.1510402328 0.1510402328
## rank -0.1095019242 -0.1095019242
Now let’s calculate the variance. We’ll start by finding \(\hat{\sigma}^2\). We’ll need to calculate the residuals, which we can derive using the coefficients we’ve just reported. We’ll use them to calculate predicted values and then derive the residuals from there.
# The predicted values are just our data times their respective beta_hat estimates
preds <- X %*% beta_hat
# Our residuals, then our the difference between the observed y and the predicted ones
resids <- y - preds
# Use the definition of sigma_hat. We'll do this in two parts, although you can do it in one
sigma_hat_numerator <- t(resids) %*% resids
sigma_hat_denominator <- nrow(my_data) - ncol(X)
sigma_hat <- sigma_hat_numerator / sigma_hat_denominator
# The variance-covariance matrix is estimated using sigma_hat and (XTX)^-1,
# which from earlier we called `first_half`
var_covar <- as.numeric(sigma_hat) * first_half
# And the standard errors are the square root of the diagonal of that matrix
manual_std_errors <- sqrt(diag(var_covar))
Finally, let’s compare results. Notice that the standard errors themselves aren’t stored in the lm object, but the variance-covariance matrix is. You cna extract that matrix using summary(), or using vcov().
# Using vcov()
my_model_std_errors <- sqrt(diag(vcov(my_model)))
# Reporting them side-by-side
cbind(my_model_std_errors, manual_std_errors)
## my_model_std_errors manual_std_errors
## (Intercept) 0.2169694971 0.2169694971
## gre 0.0002100836 0.0002100836
## gpa 0.0633853549 0.0633853549
## rank 0.0237616794 0.0237616794
They’re the same!
There you have it! lm() isn’t doing anything complicated under the hood – we can do the same in roughly a dozen lines of code.