Political Science 581: Quantitative Political Methodology I

Deriving Properties of the Linear Model

A crucial property of the OLS estimator is that it minimizes the residual sum of squares (RSS). We’re going to use the matrix algebra representation of linear model to derive \(\hat{\beta}\). Let \(Y = \textbf{X}\beta + \epsilon\). We have that \(Y\) is \(n \times 1\), \(\textbf{X}\) is \(n \times (k + 1)\), and that \(\beta\) is \((k + 1) \times 1\), where \(n\) is the number of observations and \(k\) is the number of covariates in the model.

We are given that the RSS is \(\displaystyle \sum_{i = 1}^{N} \epsilon_{i}^{2}\). We can expand this expression as follows: \[\begin{equation*} \begin{aligned} RSS & = \displaystyle \sum_{i = 1}^{N} \epsilon_{i}^{2}\\ & = \epsilon^{\top}\epsilon\\ & = (Y - \textbf{X}\beta)^{\top}(Y -\textbf{X}\beta)\\ & = (Y^{\top} - \beta^{\top} \textbf{X}^{\top}) (Y - \textbf{X}\beta)\\ & = Y^{\top}Y - Y^{\top}\textbf{X}\beta - \beta^{\top}\textbf{X}^{\top}Y + \beta^{\top}\textbf{X}^{\top}\textbf{X}\beta \end{aligned} \end{equation*}\]

A key property we will use is that \(Y^{\top}\textbf{X}\beta = \beta^{\top}\textbf{X}^{\top}Y\). Why is this true? Well, start with \(\beta^{\top}\textbf{X}^{\top}Y\). We can use the property \(\textbf{B}^{\top}\textbf{A}^{\top} = (\textbf{AB})^{\top}\) to get \((\textbf{X}\beta)^{\top}Y\). This quantity is \(1 \times (k + 1) \times (k + 1) \times n \times n \times 1\), or simply \(1 \times 1\). Notice that \(Y^{\top}\textbf{X}\beta\) is also \(1 \times 1\). A convenient fact is that the transpose of any \(1 \times 1\) matrix is itself, meaning that \(Y^{\top}\textbf{X}\beta = (Y^{\top}\textbf{X}\beta)^{\top} = \beta^{\top}\textbf{X}^{\top}Y\). They both are \(1 \times 1\) and utilize the exact same set of inputs but with the order reversed, which helps ground the intuition.

Because of this, we may continue as such: \[\begin{equation*} \begin{aligned} RSS & = Y^{\top}Y - Y^{\top}\textbf{X}\beta - \beta^{\top}\textbf{X}^{\top}Y + \beta^{\top}\textbf{X}^{\top}\textbf{X}\beta\\ & = Y^{\top}Y - 2Y^{\top}\textbf{X}\beta + \beta^{\top}\textbf{X}^{\top}\textbf{X}\beta \end{aligned} \end{equation*}\]

Now, we take the first-order condition \(\frac{\partial RSS}{\partial \beta} = 0\): \[\begin{equation*} \begin{aligned} \frac{\partial RSS}{\partial \beta} & = \frac{\partial}{\partial \beta} \left[ Y^{\top}Y - 2Y^{\top}\textbf{X}\beta + \beta^{\top}\textbf{X}^{\top}\textbf{X}\beta \right]\\ & = 0 - (2Y^{\top}\textbf{X})^{\top} + 2(\textbf{X}^{\top}\textbf{X})\beta\\ & = -2(\textbf{X}^{\top}Y) + 2(\textbf{X}^{\top}\textbf{X})\beta \end{aligned} \end{equation*}\]

Setting that quantity equal to zero: \[\begin{equation*} \begin{aligned} 0 & = -2(\textbf{X}^{\top}Y) + 2(\textbf{X}^{\top}\textbf{X})\beta \end{aligned} \end{equation*}\]

Dividing both sides by -2, we have:

\[\begin{equation*} \begin{aligned} 0 & = - (\textbf{X}^{\top}Y) + (\textbf{X}^{\top}\textbf{X})\beta \\ \implies (\textbf{X}^{\top}\textbf{X})\beta & = (\textbf{X}^{\top}Y)\\ \implies (\textbf{X}^{\top}\textbf{X})^{-1} (\textbf{X}^{\top}\textbf{X})\beta & = (\textbf{X}^{\top}\textbf{X})^{-1} (\textbf{X}^{\top}Y)\\ \implies \beta & = (\textbf{X}^{\top}\textbf{X})^{-1} (\textbf{X}^{\top}Y) \end{aligned} \end{equation*}\]

Definition: Projection Matrix
The Projection Matrix (or Hat Matrix) is defined as the matrix that maps \(Y\) onto \(\hat{Y}\), or \(Y\) onto its fitted values. We have: \[\begin{equation*} \begin{aligned} \hat{Y} & = \textbf{X} \hat{\beta}\\ & = \textbf{X} (\textbf{X}^{\top} \textbf{X})^{-1} \textbf{X}^{\top}Y\\ & = \textbf{P}Y \end{aligned} \end{equation*}\]

such that \(\textbf{P} = \textbf{X} (\textbf{X}^{\top} \textbf{X})^{-1} \textbf{X}^{\top}\).

\(\textbf{P}\) has two important properties: it is symmetric, meaning \(\textbf{P} = \textbf{P}^{\top}\), and idempotent, meaning \(\textbf{PP} = \textbf{P}\).

Showing symmetry: \[\begin{equation*} \begin{aligned} \textbf{P}^{\top} & = (\textbf{X} (\textbf{X}^{\top} \textbf{X})^{-1} \textbf{X}^{\top})^{\top}\\ & = (\textbf{X}^{\top})^{\top} \left( (\textbf{X}^{\top}\textbf{X})^{-1} \right)^{\top} \textbf{X}^{\top}\\ & = \textbf{X} \left((\textbf{X}^{\top}\textbf{X})^{\top} \right)^{-1} \textbf{X}^{\top}\\ & = \textbf{X} (\textbf{X}^{\top} \textbf{X})^{-1} \textbf{X}^{\top}\\ & = \textbf{P} \end{aligned} \end{equation*}\]
Showing idempotency: \[\begin{equation*} \begin{aligned} \textbf{PP} & = \textbf{X} (\textbf{X}^{\top} \textbf{X})^{-1} \textbf{X}^{\top} \textbf{X} (\textbf{X}^{\top} \textbf{X})^{-1} \textbf{X}^{\top}\\ & = \textbf{X} \underbrace{(\textbf{X}^{\top} \textbf{X})^{-1} \textbf{X}^{\top} \textbf{X}}_{\textbf{I}} (\textbf{X}^{\top} \textbf{X})^{-1} \textbf{X}^{\top}\\ & = \textbf{X} (\textbf{X}^{\top} \textbf{X})^{-1} \textbf{X}^{\top}\\ & = \textbf{P} \end{aligned} \end{equation*}\]

Least Squares Assumptions

There are four assumptions we need to make about \(\epsilon\) in order for us to generate unbiased and efficient regression coefficient estimates:

We have the correct model, \(Y = \textbf{X}\beta + \epsilon\)
Our model is exogenous, or that errors are random with respect to \(\textbf{X}\), or \(\text{Cov}(\textbf{X}, \epsilon) = 0 \iff \textbf{X}^{\top}\epsilon = 0_{k}\)
Our errors are homoskedastic and are mean-zero, or \(\mathbb{V}(\epsilon_{i} | X_{i}^{\top}) = \sigma^2\); \(\mathbb{E}[\epsilon] = 0\)
Our observations are independent, such that \(\text{Cov}(\epsilon_{i}, \epsilon_{j}) = 0\)

Intuitively, what do each of these assumptions mean?

Deconstructing the Linear Model using `R`

We are going to deconstruct the linear model using R. Load in the following data:

# This is graduate program admissions data from UCLA
my_data <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

# What are the variables? We use colnames() to see
colnames(my_data)

## [1] "admit" "gre"   "gpa"   "rank"

# It looks like `admit` is a binary variable indicating whether that applicant was admitted
# `gre` is their score on the graduate record examination
# `gpa` is their undergraduate grade point average
# `rank` is the rank of... something

# Let's summarize this data. I'm going to use `dplyr` to do so,
# but all of the `aggregate()` commands we used earlier will
# return the same result

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# dplyr uses what is called a "piping operator" %>%.
# Intuitively, it takes our data and passes it through
# some specified set of filters to get a desired output

# First, `group_by()` groups the data by some variable.
# In this case, we are grouping by whether applicants were admitted
my_data %>% group_by(admit) %>% 
  
# `summarise()` lets us customise our outputs. In this case, we are taking
# the average of `gre` and `gpa`, sorting them by `admit`
  summarise(mean_gre = mean(gre, na.rm = TRUE),
            mean_gpa = mean(gpa, na.rm = TRUE))

## # A tibble: 2 x 3
##   admit mean_gre mean_gpa
##   <int>    <dbl>    <dbl>
## 1     0     573.     3.34
## 2     1     619.     3.49

# We can see that the average gre score for successful applicants was
# almost 50 points higher than unsuccessful applicants, while the average
# gpa was about .14 higher.

Suppose that we are interested in admission status as a function of GRE score, GPA, and rank. We could represent this as a linear model of the following form:

\[\begin{equation*} \text{admit} = \beta_0 + \beta_1 \times \text{gre} + \beta_2 \times \text{gpa} + \beta_3 \times \text{rank} + \epsilon \end{equation*}\]

R can estimate this model using the lm() function.

# for more information about the lm() function, use `?lm`

my_model <- lm(data = my_data, # This tells lm() where the data is coming from
               
               # This specifies the form of the model
               formula = admit ~ gre + gpa + rank
               )

# We can view a summary of the model using summary()
summary(my_model)

## 
## Call:
## lm(formula = admit ~ gre + gpa + rank, data = my_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6617 -0.3417 -0.1947  0.5061  0.9556 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.1824127  0.2169695  -0.841   0.4010    
## gre          0.0004424  0.0002101   2.106   0.0358 *  
## gpa          0.1510402  0.0633854   2.383   0.0176 *  
## rank        -0.1095019  0.0237617  -4.608 5.48e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4448 on 396 degrees of freedom
## Multiple R-squared:  0.09601,    Adjusted R-squared:  0.08916 
## F-statistic: 14.02 on 3 and 396 DF,  p-value: 1.054e-08

# We can view other components of the model using the following:

# Coefficients
coef(my_model)

##   (Intercept)           gre           gpa          rank 
## -0.1824126752  0.0004424258  0.1510402328 -0.1095019242

# Fitted values and residuals -- be careful, these will return a LOT of outputs
# predict(my_model)
# residuals(my_model)

# Residual standard error -- a couple of ways to do this
sigma(my_model)

## [1] 0.4448224

summary(my_model)$sigma

## [1] 0.4448224

But what is lm() actually doing? We can deconstruct the function using linear algebra. Recall that the matrix algebra solution for OLS coefficients can be expressed as \[\begin{equation*} \hat{\beta}= (X^{\top}X)^{-1}X^{\top}y \end{equation*}\]

We also have (or will have) the variance \(\sigma^2(X^{\top}X)^{-1}\) and variance-covariance matrix for those estimators: \[\begin{equation*} \hat{\sigma}^2 = \frac{e^{\top}e}{n - k - 1} \end{equation*}\]

We’ll estimate the regression manually. A few of the commands we will use are:
* %*% is the matrix multiplication operator
* t() will transpose a matrix
* solve() inverts a square matrix
* diag() gets the diagonal of a matrix

Start with deriving the coefficients. Remember to add a column of 1’s for the intercept.

typeof(my_data)

## [1] "list"

# Our data is a list -- we want this as a double or matrix
my_matrix <- as.matrix(my_data)

# Let's create X and Y, where X are columns 2, 3, and 4
# and Y is column 1 from our data

# cbind() binds columns into a single object. In this case, we are creating a column
# of 1's and combining them with columns 2, 3, and 4 of my_matrix
X <- cbind(1, my_matrix[,2:4])

# Y is just the first column of my_matrix
y <- my_matrix[,1]

# Computing beta-hat. We'll break this down into two components,
# (XTX)^-1 and XTy

# Computing (XTX)^-1 in two steps: the inner part and the outer part
first_half_inner <- t(X) %*% X
first_half <- solve(first_half_inner)

# Computing XTy
second_half <- t(X) %*% y

# Now beta-hat, which is their product
beta_hat <- first_half %*% second_half

What type of object is beta_hat? How does it compare to the results from summary(lm())?

# In one command, we're going to use cbind() to combine our manual
# calculation for beta_hat with the results from my_model

cbind(beta_hat, coef(my_model))

##               [,1]          [,2]
##      -0.1824126752 -0.1824126752
## gre   0.0004424258  0.0004424258
## gpa   0.1510402328  0.1510402328
## rank -0.1095019242 -0.1095019242

Now let’s calculate the variance. We’ll start by finding \(\hat{\sigma}^2\). We’ll need to calculate the residuals, which we can derive using the coefficients we’ve just reported. We’ll use them to calculate predicted values and then derive the residuals from there.

# The predicted values are just our data times their respective beta_hat estimates
preds <- X %*% beta_hat

# Our residuals, then our the difference between the observed y and the predicted ones
resids <- y - preds

# Use the definition of sigma_hat. We'll do this in two parts, although you can do it in one
sigma_hat_numerator <- t(resids) %*% resids
sigma_hat_denominator <- nrow(my_data) - ncol(X)

sigma_hat <- sigma_hat_numerator / sigma_hat_denominator

# The variance-covariance matrix is estimated using sigma_hat and (XTX)^-1,
# which from earlier we called `first_half`

var_covar <- as.numeric(sigma_hat) * first_half

# And the standard errors are the square root of the diagonal of that matrix
manual_std_errors <- sqrt(diag(var_covar))

Finally, let’s compare results. Notice that the standard errors themselves aren’t stored in the lm object, but the variance-covariance matrix is. You cna extract that matrix using summary(), or using vcov().

# Using vcov()
my_model_std_errors <- sqrt(diag(vcov(my_model)))

# Reporting them side-by-side
cbind(my_model_std_errors, manual_std_errors)

##             my_model_std_errors manual_std_errors
## (Intercept)        0.2169694971      0.2169694971
## gre                0.0002100836      0.0002100836
## gpa                0.0633853549      0.0633853549
## rank               0.0237616794      0.0237616794

They’re the same!

There you have it! lm() isn’t doing anything complicated under the hood – we can do the same in roughly a dozen lines of code.

Political Science 581: Quantitative Political Methodology I

Lab 5

Jordan Duffin Wong

February 17, 2023

Deriving Properties of the Linear Model

Least Squares Assumptions

Deconstructing the Linear Model using `R`

Political Science 581: Quantitative Political Methodology I

Lab 5

Jordan Duffin Wong

February 17, 2023

Deriving Properties of the Linear Model

Least Squares Assumptions

Deconstructing the Linear Model using R

Deconstructing the Linear Model using `R`