When working with data, there are (generally) three assumptions we make about the relationship between our variables of interest:
Intuitively, what do each of these assumptions mean?
Correlation
Let \(X\) and \(Y\) be random variables. The
correlation between \(X\) and
\(Y\) is given by \[\begin{equation*}
\begin{aligned}
\hat{\rho}_{XY} = \frac{\text{Cov}(X,Y)}{\sigma_{X} \sigma_{Y}} =
\frac{\displaystyle \sum_{i = 1}^{n}(Y_i - \bar{Y}) \times (X_i -
\bar{X})}{\sqrt{\displaystyle \sum_{i = 1}^{n} (X_i - \bar{X})^2}
\sqrt{\displaystyle \sum_{i = 1}^{n} (Y_i - \bar{Y})^2}}
\end{aligned}
\end{equation*}\]
\(\hat{\rho}_{XY}\) is scaled between -1 and 1, and is comparable across different kinds of variables. What are examples of random variables we would expect to have correlations close to 1, 0, and -1?
Sums of Squares
Total Sum of Squares (TSS): How much variance in \(Y_i\) is there to explain?
\[\begin{equation*}
\displaystyle \sum_{i = 1}^{n} (Y_i - \bar{Y})^2
\end{equation*}\]
Explained Sum of Squares (ESS): How much of this variance do we
explain?
\[\begin{equation*}
\displaystyle \sum_{i = 1}^{n} (\hat{Y}_i - \bar{Y})^2
\end{equation*}\]
Residual Sum of Squares (RSS): How much variance is
unexplained?
\[\begin{equation*}
\displaystyle \sum_{i = 1}^{n} (Y_i - \hat{Y}_i)^2
\end{equation*}\]
Connecting the Sum of Squares
When \(\hat{Y_i}\) is the fitted value from a linear regression, the total variance to explain equals the explained variance plus the unexplained variance. Specifically, we can express the total sum of squares as the sum of the estimated sum of squares and the residual sum of squares, or:
\[\begin{equation*} \begin{aligned} \displaystyle \underbrace{\sum_{i = 1}^{n} (Y_i - \bar{Y})^2}_{TSS} & = \underbrace{\sum_{i = 1}^{n} (\hat{Y}_i - \bar{Y})^2}_{ESS} + \underbrace{\sum_{i = 1}^{n} (Y_i - \hat{Y}_i)^2}_{RSS} \end{aligned} \end{equation*}\]
\(R^2\): the coefficient of
determination
What proportion of the total variation in \(Y_i\) are we explaining with \(\hat{Y}_i\)? We can use the \(R^2\), which is a construct of data, to
describe this.
\[\begin{equation*} R^2 = \frac{\displaystyle \sum_{i = 1}^{n} (\hat{Y}_i - \bar{Y})^2}{\displaystyle \sum_{i = 1}^{n} (Y_i - \bar{Y})^2} = \frac{ESS}{TSS} = 1 - \frac{RSS}{TSS} \end{equation*}\]
Some properties:
- \(R^2 \in [0,1]\)
- If \(\hat{\beta}_1 = 0\), then \(R^2 = 0\) (why is this
true?)
- If \(Y_i = \hat{Y}_i\) for all \(i\), then \(R^2 =
1\) (why is this true?)
We are going to explore the linear model using R. Load in the following data:
# This is graduate program admissions data from UCLA
my_data <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
# What are the variables? We use colnames() to see
colnames(my_data)
## [1] "admit" "gre" "gpa" "rank"
# It looks like `admit` is a binary variable indicating whether that applicant was admitted
# `gre` is their score on the graduate record examination
# `gpa` is their undergraduate grade point average
# `rank` is the rank of... something
# Let's summarize this data. I'm going to use `dplyr` to do so,
# but all of the `aggregate()` commands we used earlier will
# return the same result
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# dplyr uses what is called a "piping operator" %>%.
# Intuitively, it takes our data and passes it through
# some specified set of filters to get a desired output
# First, `group_by()` groups the data by some variable.
# In this case, we are grouping by whether applicants were admitted
my_data %>% group_by(admit) %>%
# `summarise()` lets us customise our outputs. In this case, we are taking
# the average of `gre` and `gpa`, sorting them by `admit`
summarise(mean_gre = mean(gre, na.rm = TRUE),
mean_gpa = mean(gpa, na.rm = TRUE))
## # A tibble: 2 x 3
## admit mean_gre mean_gpa
## <int> <dbl> <dbl>
## 1 0 573. 3.34
## 2 1 619. 3.49
# We can see that the average gre score for successful applicants was
# almost 50 points higher than unsuccessful applicants, while the average
# gpa was about .14 higher.
Suppose that we are interested in admission status as a function of GRE score, GPA, and rank. We could represent this as a linear model of the following form:
\[\begin{equation*} \text{admit} = \beta_0 + \beta_1 \times \text{gre} + \beta_2 \times \text{gpa} + \beta_3 \times \text{rank} + \epsilon \end{equation*}\]
R can estimate this model using the lm() function.
# for more information about the lm() function, use `?lm`
my_model <- lm(data = my_data, # This tells lm() where the data is coming from
# This specifies the form of the model
formula = admit ~ gre + gpa + rank
)
# We can view a summary of the model using summary()
summary(my_model)
##
## Call:
## lm(formula = admit ~ gre + gpa + rank, data = my_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6617 -0.3417 -0.1947 0.5061 0.9556
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1824127 0.2169695 -0.841 0.4010
## gre 0.0004424 0.0002101 2.106 0.0358 *
## gpa 0.1510402 0.0633854 2.383 0.0176 *
## rank -0.1095019 0.0237617 -4.608 5.48e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4448 on 396 degrees of freedom
## Multiple R-squared: 0.09601, Adjusted R-squared: 0.08916
## F-statistic: 14.02 on 3 and 396 DF, p-value: 1.054e-08
# We can view other components of the model using the following:
# Coefficients
coef(my_model)
## (Intercept) gre gpa rank
## -0.1824126752 0.0004424258 0.1510402328 -0.1095019242
# Fitted values and residuals -- be careful, these will return a LOT of outputs
# predict(my_model)
# residuals(my_model)
# Residual standard error -- a couple of ways to do this
sigma(my_model)
## [1] 0.4448224
summary(my_model)$sigma
## [1] 0.4448224
Let’s calculate the TSS, ESS, and RSS of our model.
# Manual TSS
TSS <- sum( (my_data$admit - mean(my_data$admit))^2 )
# Manual ESS
ESS <- sum( (predict(my_model) - mean(my_data$admit))^2 )
# Manual RSS
RSS <- sum( (my_data$admit - predict(my_model))^2 )
# Show TSS = ESS + RSS
TSS - ESS - RSS
## [1] -1.421085e-14
# That's basically zero, right?
Now let’s calculate the \(R^2\).
# Remember R^2 = ESS / TSS = 1 - RSS / TSS
ESS / TSS
## [1] 0.09601333
1 - (RSS / TSS)
## [1] 0.09601333
# Checking with the actual result.
summary(my_model)$r.squared
## [1] 0.09601333