OLS provides a lot of guarantees about the asymptotic properties of
our proposed regressors, \(\beta\), but
it doesn’t tell us anything about whether we have a good model.
In particular, it tells us absolutely nothing about how well our model
will perform on data we haven’t seen before.
The bias-variance tradeoff of model selection is a
well-known problem in statistics and machine learning. Intuitively, it
consists of two components:
Think about how these two are related. Generally, we try to reduce
bias by introducing predictors to a model that capture things we think
are important, or confound the relationship in which we are interested.
But doing so increases the variance because, as we introduce more
predictors (or as \(k\) grows large),
the more that the \(n\)-dimensional
regression line responds to changes in those inputs. In the most extreme
case, we would have a predictor for every single point in our data: our
predictions would be perfect, but the associated variance would be
extraordinarily high. Then, if you run new data with this model, it will
respond wildly to even small changes in the predictors – in other words,
it will be extremely noisy.
Another property of this tradeoff is that low-bias models predict well in-sample, but poorly out-of-sample. On the other hand, low-variance models predict worse in-sample but better out-of-sample. This presents a difficult problem for researchers. We generally want our models to have as little bias as possible, but if we overfit, our coefficient estimates may reflect meaningless noise. It also implies that a good fit in-sample does not necessarily imply a good model.
Cross-validation refers to a set of methods for measuring the
performance of a given predictive model on new test data sets. The basic
idea of these techniques consists of dividing our data into two
sets:
The statistic we use to give is the mean squared error, or MSE:
\[\begin{equation*} \text{MSE} = \frac{1}{N} \displaystyle \sum_{i = 1}^{N} \left(Y_i- \hat{Y}_{i}^{-\kappa(i)} \right)^2 \end{equation*}\]
In general, for this framework we wish to minimize the MSE
across different model specifications.
For this approach, we randomly divide the data into \(K\) equally-sized partitions (often called
“folds”) — \(K = 5\) and \(K = 10\) are common choices. For each fold,
we do the following:
Then, we average the MSE over each fold. At the end, we compare the
MSE for each model and (in general) the model that has the smallest MSE
is “best.”
An applied example: suppose we are interested in the effects of
different socioeconomic indicators on fertility (this is an example I
inherited from previous labs so bear with me). The swiss
dataset in base R reporst measures of these concepts for 47
French-speaking provinces in the country of Switzerland around the year
1888.
# Fertility is the outcome of interest
# Agriculture, Examination, Education, Catholic, and Infant Mortality are
# predictors (I'm not going to theorize why)
# Summarizing this data
summary(swiss)[4,]
## Fertility Agriculture Examination Education
## "Mean :70.14 " "Mean :50.66 " "Mean :16.49 " "Mean :10.98 "
## Catholic Infant.Mortality
## "Mean : 41.144 " "Mean :19.94 "
# This data has 47 observations across those six variables
dim(swiss)
## [1] 47 6
Before we even begin to think about model selection, we want to partition this data into \(K\) folds. In this case, we’ll choose \(K = 3\) (not tradition, but inherited from previous labs).
# Setting seed for replicability
set.seed(525)
# This line of code creates designations for each of our three folds.
# That is, each of the 47 observations is randomly assigned to
# "fold 1", "fold 2", or "fold 3".
folds <- sample(1:3, nrow(swiss), replace = TRUE)
head(folds)
## [1] 2 2 1 3 2 2
# This indicates that observations 1 and 2 are in fold 2, observation 3 is in
# fold 1, observation 4 is in fold 3, and so on
# Notice that in this example, with small N, our folds are not evenly-sized!
# That does not matter too much for this exercise, but as N grows large
# the folds will asymptotically approach equal size.
Now, given these fold assignments, we’ll partition our training and test data. For this exercise, we will use folds 2 and 3 as the training data, and fold 1 as the test data.
training_data <- swiss[folds != 1,]
test_data <- swiss[folds == 1,]
Suppose we have two models to compare: one where fertility is a function of agriculture, education, and Catholic, and another where it is a function of different covariates, agriculture, examination, and infant mortality. That is, \[\begin{equation*} \begin{aligned} \text{Model 1:}\, & Y = \beta_0 + \beta_1 \times \text{agriculture} + \beta_2 \times \text{education} + \beta_3 \times \text{catholic} + \epsilon \\ \text{Model 2:}\, & Y = \beta_0 + \beta_1 \times \text{agriculture} + \beta_2 \times \text{examination} + \beta_3 \times \text{infant mortality} + \epsilon \end{aligned} \end{equation*}\]
We will fit these models using our training data (don’t use the full data set!):
# Training models
model_1 <- lm(data = training_data,
Fertility ~ Agriculture + Education + Catholic)
model_2 <- lm(data = training_data,
Fertility ~ Agriculture + Examination + Infant.Mortality)
Next, we fit those models to our test data and calculate the residuals.
# Fitting models to test data
test_fitted_model_1 <- predict(object = model_1,
test_data)
test_fitted_model_2 <- predict(object = model_2,
test_data)
# Calculating, storing, and reporting residuals
resids_model_1 <- (test_data$Fertility - test_fitted_model_1)
resids_model_2 <- (test_data$Fertility - test_fitted_model_2)
# Reporting those
cbind("Model 1 MSE" = mean(resids_model_1^2),
"Model 2 MSE" = mean(resids_model_2^2))
## Model 1 MSE Model 2 MSE
## [1,] 55.66296 70.95336
We’ve done this process for a single fold, but we wish to do so for every fold. A for-loop is useful for iteratively going about this process.
# Creating vectors to store the residuals from each iteration
model_1_MSE_Kfold <- NULL
model_2_MSE_Kfold <- NULL
# Note that objects in the loop are not stored in the global environment
# unless we save those results to an object that is already present
# in the environment
# We're iterating for each of our 3 folds, so this loop
# will be for i in 1:3, or i = 1, 2, 3
for (i in 1:3) {
# Classifying training and test data
training_data <- swiss[folds != i,]
test_data <- swiss[folds == i,]
# Fitting models using the training data
model_1 <- lm(data = training_data,
Fertility ~ Agriculture + Education + Catholic)
model_2 <- lm(data = training_data,
Fertility ~ Agriculture + Examination + Infant.Mortality)
# Finding fitted values using the test data
test_fitted_model_1 <- predict(object = model_1,
test_data)
test_fitted_model_2 <- predict(object = model_2,
test_data)
# Calculating residuals and storing them in the global environment
model_1_resid_Kfold <- (test_data$Fertility - test_fitted_model_1)
model_2_resid_Kfold <- (test_data$Fertility - test_fitted_model_2)
model_1_MSE_Kfold[i] <- mean(model_1_resid_Kfold^2)
model_2_MSE_Kfold[i] <- mean(model_2_resid_Kfold^2)
# You can do these final steps in one line, but I've
# separated them out into two to show 1) how we calculate
# each residual and then 2) how to iteratively store them
# in a global vector
}
Lastly, we compare the averaged MSE for each of these: whichever is lower is “better” in this context.
# How do we interpret these results?
cbind("Model 1 Average MSE:" = mean(model_1_MSE_Kfold),
"Model 2 Average MSE:" = mean(model_2_MSE_Kfold)
)
## Model 1 Average MSE: Model 2 Average MSE:
## [1,] 63.87491 92.18196
Notice that model_1_MSE_Kfold
and
model_2_MSE_Kfold
are both three elements long – why is
this the case?
length(model_1_MSE_Kfold)
## [1] 3
length(model_2_MSE_Kfold)
## [1] 3
This is a special form of \(K\)-fold cross-validation where \(K = N\), meaning that we will leave out just one observation and use that as the test set for the training data, which is from \(N - 1\) observations. This looks the same as earlier, but now we don’t worry about fold designations.
# Creating vectors to store the residuals from each iteration
model_1_MSE_Nfold <- NULL
model_2_MSE_Nfold <- NULL
# Note that objects in the loop are not stored in the global environment
# unless we save those results to an object that is already present
# in the environment
# We're iterating for each of our observations now
for (i in 1:nrow(swiss)) {
# Classifying training and test data
training_data <- swiss[-i,]
test_data <- swiss[i,]
# Fitting models using the training data
model_1 <- lm(data = training_data,
Fertility ~ Agriculture + Education + Catholic)
model_2 <- lm(data = training_data,
Fertility ~ Agriculture + Examination + Infant.Mortality)
# Finding fitted values using the test data
test_fitted_model_1 <- predict(object = model_1,
test_data)
test_fitted_model_2 <- predict(object = model_2,
test_data)
# Calculating residuals and storing them in the global environment
model_1_resid_Nfold <- (test_data$Fertility - test_fitted_model_1)
model_2_resid_Nfold <- (test_data$Fertility - test_fitted_model_2)
model_1_MSE_Nfold[i] <- mean(model_1_resid_Nfold^2)
model_2_MSE_Nfold[i] <- mean(model_2_resid_Nfold^2)
}
# Why are these vectors 47 elements long?
cbind("Model 1 MSE Length" = length(model_1_MSE_Nfold),
"Model 2 MSE Length" = length(model_2_MSE_Nfold))
## Model 1 MSE Length Model 2 MSE Length
## [1,] 47 47
# Reporting results
cbind("Model 1 Average MSE:" = mean(model_1_MSE_Nfold),
"Model 2 Average MSE:" = mean(model_2_MSE_Nfold)
)
## Model 1 Average MSE: Model 2 Average MSE:
## [1,] 65.02934 85.89542
A word of caution: all we’ve done here is compare mean squared errors across two different models. This is good for thinking about model selection and generalizability, but remember that we haven’t actually reported any regression coefficients or made any substantitve interpretations. This exercise is about validating whether a model performs well with new data, not about whether that model says anything informative about the world.
We’ve discussed the problem of multicollinearity in previous
labs; namely, that having multicollinear variables results in \(\textbf{X}^{\top}\textbf{X}\) being
non-invertible, or that \((\textbf{X}^{\top}\textbf{X})^{-1}\) (and
by extension, \(\hat{\beta}\)) does not
exist. One tool to mitigate this problem is to use
Ridge, or Tikhonov regression. This
method makes concessions about the bias of our estimator in exchange for
improved efficiency.
In lecture, we demonstrated that the Ridge estimator \(\hat{\beta}_{R}\) is given by \[\begin{equation*} \hat{\beta}_{R} = (\textbf{X}^{\top}\textbf{X} + \lambda \textbf{I})^{-1}\textbf{X}^{\top}Y \end{equation*}\]
where \(\lambda \geq 0\) is called
the ridge parameter. Notice that as \(\lambda \to 0\), \(\hat{\beta}_{R} \to \hat{\beta}\).
Likewise, as \(\lambda \to \infty\),
\(\hat{\beta}_{R} \to 0\). Recall that
\(\hat{\beta}_{R}\) is a
biased estimate of \(\beta\),
so when using ridge regression we are conceding some bias to make gains
in efficiency.
We’ll revisit the swiss
dataset as an applied example.
To artifically induce multicollinearity, we are going to define a new
variable Infant.Survival
that is simply
100 - Infant.Mortality
. Running a naive regression of
Fertility
on all other covariates, we are going to run into
a problem. We’ll also add multicollinear variables breaking down
(fictional) types of infant mortality.
sw <- swiss # a modified `swiss` dataset
# Made-up multicollinear variables
sw$Infant.Survival <- 100 - swiss$Infant.Mortality
sw$Infant.Mortality.Disease <- 0.65 * swiss$Infant.Mortality
sw$Infant.Mortality.Starvation <- 0.30 * swiss$Infant.Mortality
sw$Infant.Mortality.Heresy <- 0.05 * swiss$Infant.Mortality
# If we run a naive regression of fertility on all covariates,
# we get a problem with multicollinearity
summary(lm(data = sw,
Fertility ~ .))
##
## Call:
## lm(formula = Fertility ~ ., data = sw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2743 -5.2617 0.5032 4.1198 15.3213
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
## Agriculture -0.17211 0.07030 -2.448 0.01873 *
## Examination -0.25801 0.25388 -1.016 0.31546
## Education -0.87094 0.18303 -4.758 2.43e-05 ***
## Catholic 0.10412 0.03526 2.953 0.00519 **
## Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
## Infant.Survival NA NA NA NA
## Infant.Mortality.Disease NA NA NA NA
## Infant.Mortality.Starvation NA NA NA NA
## Infant.Mortality.Heresy NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
## F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
We’ll use ridge regression with parameter \(\lambda\) to circumvent this problem. We can actually use \(K\)-fold cross-validation to determine the optimal lambda that minimizes MSE.
# Creating a vector of lambdas to test
lambdas <- seq(0, 1,
by = 0.001)
# Preparing `sw` for ridge regression
# Remember, we need all these variables to be normalized!
sw <- apply(sw, 2, scale)
sw <- as.data.frame(sw)
# Recreating the naive model as a comparison tool
model_base <- lm(data = sw,
Fertility ~ .)
# Breaking apart our data into separate matrices
X <- as.matrix(sw[, 2:10])
Y <- as.numeric(sw[,1])
# We'll use the `glmnet` library to do the optimization for us
library(glmnet)
# Doing the cross validation to optimize lambda and
# setting seed for reproducibality.
# Note that since cross-validation has a random element,
# our ridge results can be quite sensitive to selection
# with small n!
set.seed(525)
cv_ridge <- cv.glmnet(
x = X,
y = Y,
alpha = 0,
lambda = lambdas,
nfolds = 3
)
# that object can call `lambda.min` to report the optimal lambda
cv_ridge$lambda.min
## [1] 0.134
# we can plot this as a function of lambdas
plot(x = rev(lambdas),
y = cv_ridge$cvm,
type = "l",
xlab = "lambda",
ylab = "MSE",
main = "Ridge")
abline(v = cv_ridge$lambda.min, col = "red")
So the optimal lambda is \(\lambda =
0.134\). Now we can use glmnet()
with this lambda as
an input to execute the ridge regression.
model_ridge <- glmnet(x = X,
y = Y,
alpha = 0, # ridge is the special case with alpha = 0
lambda = cv_ridge$lambda.min)
results <- cbind(coef(model_base),
coef(model_ridge))
colnames(results) <- c("OLS", "Ridge")
results
## 10 x 2 sparse Matrix of class "dgCMatrix"
## OLS Ridge
## (Intercept) 5.161631e-16 6.614458e-16
## Agriculture -3.129213e-01 -1.670633e-01
## Examination -1.647782e-01 -2.164305e-01
## Education -6.704008e-01 -4.872326e-01
## Catholic 3.476000e-01 2.486391e-01
## Infant.Mortality 2.511360e-01 5.738066e-02
## Infant.Survival NA -5.523551e-02
## Infant.Mortality.Disease NA 5.470183e-02
## Infant.Mortality.Starvation NA 5.642946e-02
## Infant.Mortality.Heresy NA 5.807007e-02
So our results from ridge are biased, but we now have a reportable
coefficient estimate for Infant.Survival
and submeasures of
Infant.Mortality
.
Another difficult task in linear modeling is that of variable selection – in other words, which covariates are the most appropriate to use? In empirical social science, we should always have some theoretical motivation for variable selection, but even theory can be imperfect by itself. We can also utilize regularization methods to exploit the full breadth of our data in making these choices. A popular tool is the Least Absolute Shrinkage and Selection Operator, or LASSO. We can execute a similar process to ridge regression by 1) determining an optimal lambda and 2) running the LASSO model with that input.
set.seed(525)
cv_lasso <- cv.glmnet(
x = X,
y = Y,
alpha = 1, # lasso is the (or a) case where alpha = 1
lambda = lambdas,
nfolds = 3
)
# that object can call `lambda.min` to report the optimal lambda
cv_lasso$lambda.min
## [1] 0.001
# we can plot this as a function of lambdas
plot(x = rev(lambdas),
y = cv_lasso$cvm,
type = "l",
xlab = "lambda",
ylab = "MSE",
main = "LASSO")
abline(v = cv_lasso$lambda.min, col = "red")
In this case, the optimal \(\lambda_{\text{LASSO}} = 0.001\), which happens to be the smallest we could specify (as a consequence of seeding). Nonetheless, we persist and impute the LASSO regression with this value and then compare results across models.
model_lasso <- glmnet(x = X,
y = Y,
alpha = 1, # lasso is the special case with alpha = 0
lambda = cv_lasso$lambda.min)
results <- cbind(coef(model_base),
coef(model_ridge),
coef(model_lasso))
colnames(results) <- c("OLS", "Ridge", "LASSO")
results
## 10 x 3 sparse Matrix of class "dgCMatrix"
## OLS Ridge LASSO
## (Intercept) 5.161631e-16 6.614458e-16 6.331524e-16
## Agriculture -3.129213e-01 -1.670633e-01 -3.090373e-01
## Examination -1.647782e-01 -2.164305e-01 -1.636513e-01
## Education -6.704008e-01 -4.872326e-01 -6.679378e-01
## Catholic 3.476000e-01 2.486391e-01 3.460835e-01
## Infant.Mortality 2.511360e-01 5.738066e-02 2.510008e-01
## Infant.Survival NA -5.523551e-02 .
## Infant.Mortality.Disease NA 5.470183e-02 .
## Infant.Mortality.Starvation NA 5.642946e-02 .
## Infant.Mortality.Heresy NA 5.807007e-02 6.114900e-17
No coefficients are shown for Infant.Survival
,
Infant.Mortality.Disease
, or
Infant.Mortality.Starvation
because LASSO shrunk those
coefficients all the way to zero – in other words, they were dropped
from the model completely because they weren’t influential enough.
Note that this illustrates a key difference between ridge and LASSO regression: ridge shrinks all coefficients towards zero, whereas LASSO can remove predictors from the model by shrinking the coefficients completely to zero.