Introducing the Linear Model

When working with data, there are (generally) three assumptions we make about the relationship between our variables of interest:

  1. Linearity: \(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\)
  2. Exogeneity: \(\mathbb{E}[\epsilon_i | X] = \mathbb{E}[\epsilon_i] = 0\)
  3. Homoskedasticity: \(\mathbb{V}(\epsilon_i | X) = \mathbb{V}(\epsilon_i) = \sigma^2\)

Intuitively, what do each of these assumptions mean?

Correlation
Let \(X\) and \(Y\) be random variables. The correlation between \(X\) and \(Y\) is given by \[\begin{equation*} \begin{aligned} \hat{\rho}_{XY} = \frac{\text{Cov}(X,Y)}{\sigma_{X} \sigma_{Y}} = \frac{\displaystyle \sum_{i = 1}^{n}(Y_i - \bar{Y}) \times (X_i - \bar{X})}{\sqrt{\displaystyle \sum_{i = 1}^{n} (X_i - \bar{X})^2} \sqrt{\displaystyle \sum_{i = 1}^{n} (Y_i - \bar{Y})^2}} \end{aligned} \end{equation*}\]

\(\hat{\rho}_{XY}\) is scaled between -1 and 1, and is comparable across different kinds of variables. What are examples of random variables we would expect to have correlations close to 1, 0, and -1?

Sums of Squares

Connecting the Sum of Squares

When \(\hat{Y_i}\) is the fitted value from a linear regression, the total variance to explain equals the explained variance plus the unexplained variance. Specifically, we can express the total sum of squares as the sum of the estimated sum of squares and the residual sum of squares, or:

\[\begin{equation*} \begin{aligned} \displaystyle \underbrace{\sum_{i = 1}^{n} (Y_i - \bar{Y})^2}_{TSS} & = \underbrace{\sum_{i = 1}^{n} (\hat{Y}_i - \bar{Y})^2}_{ESS} + \underbrace{\sum_{i = 1}^{n} (Y_i - \hat{Y}_i)^2}_{RSS} \end{aligned} \end{equation*}\]

\(R^2\): the coefficient of determination
What proportion of the total variation in \(Y_i\) are we explaining with \(\hat{Y}_i\)? We can use the \(R^2\), which is a construct of data, to describe this.

\[\begin{equation*} R^2 = \frac{\displaystyle \sum_{i = 1}^{n} (\hat{Y}_i - \bar{Y})^2}{\displaystyle \sum_{i = 1}^{n} (Y_i - \bar{Y})^2} = \frac{ESS}{TSS} = 1 - \frac{RSS}{TSS} \end{equation*}\]

Some properties:
- \(R^2 \in [0,1]\)
- If \(\hat{\beta}_1 = 0\), then \(R^2 = 0\) (why is this true?)
- If \(Y_i = \hat{Y}_i\) for all \(i\), then \(R^2 = 1\) (why is this true?)


Using the Linear Model in R

We are going to explore the linear model using R. Load in the following data:

# This is graduate program admissions data from UCLA
my_data <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

# What are the variables? We use colnames() to see
colnames(my_data)
## [1] "admit" "gre"   "gpa"   "rank"
# It looks like `admit` is a binary variable indicating whether that applicant was admitted
# `gre` is their score on the graduate record examination
# `gpa` is their undergraduate grade point average
# `rank` is the rank of... something

# Let's summarize this data. I'm going to use `dplyr` to do so,
# but all of the `aggregate()` commands we used earlier will
# return the same result

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# dplyr uses what is called a "piping operator" %>%.
# Intuitively, it takes our data and passes it through
# some specified set of filters to get a desired output

# First, `group_by()` groups the data by some variable.
# In this case, we are grouping by whether applicants were admitted
my_data %>% group_by(admit) %>% 
  
# `summarise()` lets us customise our outputs. In this case, we are taking
# the average of `gre` and `gpa`, sorting them by `admit`
  summarise(mean_gre = mean(gre, na.rm = TRUE),
            mean_gpa = mean(gpa, na.rm = TRUE))
## # A tibble: 2 x 3
##   admit mean_gre mean_gpa
##   <int>    <dbl>    <dbl>
## 1     0     573.     3.34
## 2     1     619.     3.49
# We can see that the average gre score for successful applicants was
# almost 50 points higher than unsuccessful applicants, while the average
# gpa was about .14 higher.

Suppose that we are interested in admission status as a function of GRE score, GPA, and rank. We could represent this as a linear model of the following form:

\[\begin{equation*} \text{admit} = \beta_0 + \beta_1 \times \text{gre} + \beta_2 \times \text{gpa} + \beta_3 \times \text{rank} + \epsilon \end{equation*}\]

R can estimate this model using the lm() function.

# for more information about the lm() function, use `?lm`

my_model <- lm(data = my_data, # This tells lm() where the data is coming from
               
               # This specifies the form of the model
               formula = admit ~ gre + gpa + rank
               )

# We can view a summary of the model using summary()
summary(my_model)
## 
## Call:
## lm(formula = admit ~ gre + gpa + rank, data = my_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6617 -0.3417 -0.1947  0.5061  0.9556 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.1824127  0.2169695  -0.841   0.4010    
## gre          0.0004424  0.0002101   2.106   0.0358 *  
## gpa          0.1510402  0.0633854   2.383   0.0176 *  
## rank        -0.1095019  0.0237617  -4.608 5.48e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4448 on 396 degrees of freedom
## Multiple R-squared:  0.09601,    Adjusted R-squared:  0.08916 
## F-statistic: 14.02 on 3 and 396 DF,  p-value: 1.054e-08
# We can view other components of the model using the following:

# Coefficients
coef(my_model)
##   (Intercept)           gre           gpa          rank 
## -0.1824126752  0.0004424258  0.1510402328 -0.1095019242
# Fitted values and residuals -- be careful, these will return a LOT of outputs
# predict(my_model)
# residuals(my_model)

# Residual standard error -- a couple of ways to do this
sigma(my_model)
## [1] 0.4448224
summary(my_model)$sigma
## [1] 0.4448224

Let’s calculate the TSS, ESS, and RSS of our model.

# Manual TSS
TSS <- sum( (my_data$admit - mean(my_data$admit))^2 )

# Manual ESS
ESS <- sum( (predict(my_model) - mean(my_data$admit))^2 )

# Manual RSS
RSS <- sum( (my_data$admit - predict(my_model))^2 )

# Show TSS = ESS + RSS
TSS - ESS - RSS
## [1] -1.421085e-14
# That's basically zero, right?

Now let’s calculate the \(R^2\).

# Remember R^2 = ESS / TSS = 1 - RSS / TSS
ESS / TSS
## [1] 0.09601333
1 - (RSS / TSS)
## [1] 0.09601333
# Checking with the actual result.
summary(my_model)$r.squared
## [1] 0.09601333