Bias-Variance Tradeoff

OLS provides a lot of guarantees about the asymptotic properties of our proposed regressors, \(\beta\), but it doesn’t tell us anything about whether we have a good model. In particular, it tells us absolutely nothing about how well our model will perform on data we haven’t seen before.

The bias-variance tradeoff of model selection is a well-known problem in statistics and machine learning. Intuitively, it consists of two components:

Think about how these two are related. Generally, we try to reduce bias by introducing predictors to a model that capture things we think are important, or confound the relationship in which we are interested. But doing so increases the variance because, as we introduce more predictors (or as \(k\) grows large), the more that the \(n\)-dimensional regression line responds to changes in those inputs. In the most extreme case, we would have a predictor for every single point in our data: our predictions would be perfect, but the associated variance would be extraordinarily high. Then, if you run new data with this model, it will respond wildly to even small changes in the predictors – in other words, it will be extremely noisy.

Another property of this tradeoff is that low-bias models predict well in-sample, but poorly out-of-sample. On the other hand, low-variance models predict worse in-sample but better out-of-sample. This presents a difficult problem for researchers. We generally want our models to have as little bias as possible, but if we overfit, our coefficient estimates may reflect meaningless noise. It also implies that a good fit in-sample does not necessarily imply a good model.


Cross-Validation

Cross-validation refers to a set of methods for measuring the performance of a given predictive model on new test data sets. The basic idea of these techniques consists of dividing our data into two sets:

  1. The training set, which is used to build the model
  2. The testing set, which is used to validate the model by estimating prediction error

The statistic we use to give is the mean squared error, or MSE:

\[\begin{equation*} \text{MSE} = \frac{1}{N} \displaystyle \sum_{i = 1}^{N} \left(Y_i- \hat{Y}_{i}^{-\kappa(i)} \right)^2 \end{equation*}\]

In general, for this framework we wish to minimize the MSE across different model specifications.

K-Fold Cross-Validation

For this approach, we randomly divide the data into \(K\) equally-sized partitions (often called “folds”) — \(K = 5\) and \(K = 10\) are common choices. For each fold, we do the following:

  1. Hold out one fold, refered to as the “test” set
  2. Call each of the other \(K - 1\) folds (as a single entity) the “training” set
  3. Estimate each model using the training set
  4. Calculate the corresponding MSE using the testing set

Then, we average the MSE over each fold. At the end, we compare the MSE for each model and (in general) the model that has the smallest MSE is “best.”

An applied example: suppose we are interested in the effects of different socioeconomic indicators on fertility (this is an example I inherited from previous labs so bear with me). The swiss dataset in base R reporst measures of these concepts for 47 French-speaking provinces in the country of Switzerland around the year 1888.

# Fertility is the outcome of interest
# Agriculture, Examination, Education, Catholic, and Infant Mortality are
# predictors (I'm not going to theorize why)
# Summarizing this data
summary(swiss)[4,]
##           Fertility         Agriculture         Examination           Education 
##   "Mean   :70.14  "   "Mean   :50.66  "   "Mean   :16.49  "   "Mean   :10.98  " 
##            Catholic    Infant.Mortality 
## "Mean   : 41.144  "   "Mean   :19.94  "
# This data has 47 observations across those six variables
dim(swiss)
## [1] 47  6

Before we even begin to think about model selection, we want to partition this data into \(K\) folds. In this case, we’ll choose \(K = 3\) (not tradition, but inherited from previous labs).

# Setting seed for replicability
set.seed(525)

# This line of code creates designations for each of our three folds.
# That is, each of the 47 observations is randomly assigned to
# "fold 1", "fold 2", or "fold 3".
folds <- sample(1:3, nrow(swiss), replace = TRUE)

head(folds)
## [1] 2 2 1 3 2 2
# This indicates that observations 1 and 2 are in fold 2, observation 3 is in
# fold 1, observation 4 is in fold 3, and so on
# Notice that in this example, with small N, our folds are not evenly-sized!
# That does not matter too much for this exercise, but as N grows large
# the folds will asymptotically approach equal size.

Now, given these fold assignments, we’ll partition our training and test data. For this exercise, we will use folds 2 and 3 as the training data, and fold 1 as the test data.

training_data <- swiss[folds != 1,]
test_data <- swiss[folds == 1,]

Suppose we have two models to compare: one where fertility is a function of agriculture, education, and Catholic, and another where it is a function of different covariates, agriculture, examination, and infant mortality. That is, \[\begin{equation*} \begin{aligned} \text{Model 1:}\, & Y = \beta_0 + \beta_1 \times \text{agriculture} + \beta_2 \times \text{education} + \beta_3 \times \text{catholic} + \epsilon \\ \text{Model 2:}\, & Y = \beta_0 + \beta_1 \times \text{agriculture} + \beta_2 \times \text{examination} + \beta_3 \times \text{infant mortality} + \epsilon \end{aligned} \end{equation*}\]

We will fit these models using our training data (don’t use the full data set!):

# Training models
model_1 <- lm(data = training_data,
              Fertility ~ Agriculture + Education + Catholic)

model_2 <- lm(data = training_data,
              Fertility ~ Agriculture + Examination + Infant.Mortality)

Next, we fit those models to our test data and calculate the residuals.

# Fitting models to test data
test_fitted_model_1 <- predict(object = model_1,
                              test_data)

test_fitted_model_2 <- predict(object = model_2,
                               test_data)

# Calculating, storing, and reporting residuals
resids_model_1 <- (test_data$Fertility - test_fitted_model_1)
resids_model_2 <- (test_data$Fertility - test_fitted_model_2)

# Reporting those
cbind("Model 1 MSE" = mean(resids_model_1^2),
      "Model 2 MSE" = mean(resids_model_2^2))
##      Model 1 MSE Model 2 MSE
## [1,]    55.66296    70.95336

We’ve done this process for a single fold, but we wish to do so for every fold. A for-loop is useful for iteratively going about this process.

# Creating vectors to store the residuals from each iteration
model_1_MSE_Kfold <- NULL
model_2_MSE_Kfold <- NULL

# Note that objects in the loop are not stored in the global environment
# unless we save those results to an object that is already present
# in the environment

# We're iterating for each of our 3 folds, so this loop
# will be for i in 1:3, or i = 1, 2, 3
for (i in 1:3) {
  # Classifying training and test data
  training_data <- swiss[folds != i,]
  test_data <- swiss[folds == i,]
  
  # Fitting models using the training data
  model_1 <- lm(data = training_data,
              Fertility ~ Agriculture + Education + Catholic)
  
  model_2 <- lm(data = training_data,
              Fertility ~ Agriculture + Examination + Infant.Mortality)
  
  # Finding fitted values using the test data
  test_fitted_model_1 <- predict(object = model_1,
                              test_data)
  test_fitted_model_2 <- predict(object = model_2,
                               test_data)
  
  # Calculating residuals and storing them in the global environment
  model_1_resid_Kfold <- (test_data$Fertility - test_fitted_model_1)
  model_2_resid_Kfold <- (test_data$Fertility - test_fitted_model_2)
  
  model_1_MSE_Kfold[i] <- mean(model_1_resid_Kfold^2)
  model_2_MSE_Kfold[i] <- mean(model_2_resid_Kfold^2)
  # You can do these final steps in one line, but I've 
  # separated them out into two to show 1) how we calculate
  # each residual and then 2) how to iteratively store them 
  # in a global vector
}

Lastly, we compare the averaged MSE for each of these: whichever is lower is “better” in this context.

# How do we interpret these results?
cbind("Model 1 Average MSE:" = mean(model_1_MSE_Kfold),
      "Model 2 Average MSE:" = mean(model_2_MSE_Kfold)
      )
##      Model 1 Average MSE: Model 2 Average MSE:
## [1,]             63.87491             92.18196

Notice that model_1_MSE_Kfold and model_2_MSE_Kfold are both three elements long – why is this the case?

length(model_1_MSE_Kfold)
## [1] 3
length(model_2_MSE_Kfold)
## [1] 3

Leave-one-out Cross-Validation (K = N)

This is a special form of \(K\)-fold cross-validation where \(K = N\), meaning that we will leave out just one observation and use that as the test set for the training data, which is from \(N - 1\) observations. This looks the same as earlier, but now we don’t worry about fold designations.

# Creating vectors to store the residuals from each iteration
model_1_MSE_Nfold <- NULL
model_2_MSE_Nfold <- NULL

# Note that objects in the loop are not stored in the global environment
# unless we save those results to an object that is already present
# in the environment

# We're iterating for each of our observations now
for (i in 1:nrow(swiss)) {
  # Classifying training and test data
  training_data <- swiss[-i,]
  test_data <- swiss[i,]
  
  # Fitting models using the training data
  model_1 <- lm(data = training_data,
              Fertility ~ Agriculture + Education + Catholic)
  
  model_2 <- lm(data = training_data,
              Fertility ~ Agriculture + Examination + Infant.Mortality)
  
  # Finding fitted values using the test data
  test_fitted_model_1 <- predict(object = model_1,
                              test_data)
  test_fitted_model_2 <- predict(object = model_2,
                               test_data)
  
  # Calculating residuals and storing them in the global environment
  model_1_resid_Nfold <- (test_data$Fertility - test_fitted_model_1)
  model_2_resid_Nfold <- (test_data$Fertility - test_fitted_model_2)
  
  model_1_MSE_Nfold[i] <- mean(model_1_resid_Nfold^2)
  model_2_MSE_Nfold[i] <- mean(model_2_resid_Nfold^2)
}

# Why are these vectors 47 elements long?
cbind("Model 1 MSE Length" = length(model_1_MSE_Nfold),
      "Model 2 MSE Length" = length(model_2_MSE_Nfold))
##      Model 1 MSE Length Model 2 MSE Length
## [1,]                 47                 47
# Reporting results
cbind("Model 1 Average MSE:" = mean(model_1_MSE_Nfold),
      "Model 2 Average MSE:" = mean(model_2_MSE_Nfold)
      )
##      Model 1 Average MSE: Model 2 Average MSE:
## [1,]             65.02934             85.89542

A word of caution: all we’ve done here is compare mean squared errors across two different models. This is good for thinking about model selection and generalizability, but remember that we haven’t actually reported any regression coefficients or made any substantitve interpretations. This exercise is about validating whether a model performs well with new data, not about whether that model says anything informative about the world.


Ridge Regression

We’ve discussed the problem of multicollinearity in previous labs; namely, that having multicollinear variables results in \(\textbf{X}^{\top}\textbf{X}\) being non-invertible, or that \((\textbf{X}^{\top}\textbf{X})^{-1}\) (and by extension, \(\hat{\beta}\)) does not exist. One tool to mitigate this problem is to use Ridge, or Tikhonov regression. This method makes concessions about the bias of our estimator in exchange for improved efficiency.

In lecture, we demonstrated that the Ridge estimator \(\hat{\beta}_{R}\) is given by \[\begin{equation*} \hat{\beta}_{R} = (\textbf{X}^{\top}\textbf{X} + \lambda \textbf{I})^{-1}\textbf{X}^{\top}Y \end{equation*}\]

where \(\lambda \geq 0\) is called the ridge parameter. Notice that as \(\lambda \to 0\), \(\hat{\beta}_{R} \to \hat{\beta}\). Likewise, as \(\lambda \to \infty\), \(\hat{\beta}_{R} \to 0\). Recall that \(\hat{\beta}_{R}\) is a biased estimate of \(\beta\), so when using ridge regression we are conceding some bias to make gains in efficiency.

We’ll revisit the swiss dataset as an applied example. To artifically induce multicollinearity, we are going to define a new variable Infant.Survival that is simply 100 - Infant.Mortality. Running a naive regression of Fertility on all other covariates, we are going to run into a problem. We’ll also add multicollinear variables breaking down (fictional) types of infant mortality.

sw <- swiss # a modified `swiss` dataset

# Made-up multicollinear variables
sw$Infant.Survival <- 100 - swiss$Infant.Mortality
sw$Infant.Mortality.Disease <- 0.65 * swiss$Infant.Mortality
sw$Infant.Mortality.Starvation <- 0.30 * swiss$Infant.Mortality
sw$Infant.Mortality.Heresy <- 0.05 * swiss$Infant.Mortality


# If we run a naive regression of fertility on all covariates,
# we get a problem with multicollinearity
summary(lm(data = sw,
           Fertility ~ .))
## 
## Call:
## lm(formula = Fertility ~ ., data = sw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2743  -5.2617   0.5032   4.1198  15.3213 
## 
## Coefficients: (4 not defined because of singularities)
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 66.91518   10.70604   6.250 1.91e-07 ***
## Agriculture                 -0.17211    0.07030  -2.448  0.01873 *  
## Examination                 -0.25801    0.25388  -1.016  0.31546    
## Education                   -0.87094    0.18303  -4.758 2.43e-05 ***
## Catholic                     0.10412    0.03526   2.953  0.00519 ** 
## Infant.Mortality             1.07705    0.38172   2.822  0.00734 ** 
## Infant.Survival                   NA         NA      NA       NA    
## Infant.Mortality.Disease          NA         NA      NA       NA    
## Infant.Mortality.Starvation       NA         NA      NA       NA    
## Infant.Mortality.Heresy           NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared:  0.7067, Adjusted R-squared:  0.671 
## F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10

We’ll use ridge regression with parameter \(\lambda\) to circumvent this problem. We can actually use \(K\)-fold cross-validation to determine the optimal lambda that minimizes MSE.

# Creating a vector of lambdas to test
lambdas <- seq(0, 1,
               by = 0.001)

# Preparing `sw` for ridge regression
# Remember, we need all these variables to be normalized!
sw <- apply(sw, 2, scale)
sw <- as.data.frame(sw)

# Recreating the naive model as a comparison tool
model_base <- lm(data = sw,
                 Fertility ~ .)

# Breaking apart our data into separate matrices
X <- as.matrix(sw[, 2:10])
Y <- as.numeric(sw[,1])

# We'll use the `glmnet` library to do the optimization for us
library(glmnet)

# Doing the cross validation to optimize lambda and
# setting seed for reproducibality.
# Note that since cross-validation has a random element,
# our ridge results can be quite sensitive to selection
# with small n!
set.seed(525)
cv_ridge <- cv.glmnet(
  x = X,
  y = Y,
  alpha = 0,
  lambda = lambdas,
  nfolds = 3
)

# that object can call `lambda.min` to report the optimal lambda
cv_ridge$lambda.min
## [1] 0.134
# we can plot this as a function of lambdas
plot(x = rev(lambdas),
     y = cv_ridge$cvm,
     type = "l",
     xlab = "lambda",
     ylab = "MSE",
     main = "Ridge")
abline(v = cv_ridge$lambda.min, col = "red")

So the optimal lambda is \(\lambda = 0.134\). Now we can use glmnet() with this lambda as an input to execute the ridge regression.

model_ridge <- glmnet(x = X,
                      y = Y,
                      alpha = 0, # ridge is the special case with alpha = 0
                      lambda = cv_ridge$lambda.min)

results <- cbind(coef(model_base),
                 coef(model_ridge))

colnames(results) <- c("OLS", "Ridge")
results
## 10 x 2 sparse Matrix of class "dgCMatrix"
##                                       OLS         Ridge
## (Intercept)                  5.161631e-16  6.614458e-16
## Agriculture                 -3.129213e-01 -1.670633e-01
## Examination                 -1.647782e-01 -2.164305e-01
## Education                   -6.704008e-01 -4.872326e-01
## Catholic                     3.476000e-01  2.486391e-01
## Infant.Mortality             2.511360e-01  5.738066e-02
## Infant.Survival                        NA -5.523551e-02
## Infant.Mortality.Disease               NA  5.470183e-02
## Infant.Mortality.Starvation            NA  5.642946e-02
## Infant.Mortality.Heresy                NA  5.807007e-02

So our results from ridge are biased, but we now have a reportable coefficient estimate for Infant.Survival and submeasures of Infant.Mortality.


LASSO

Another difficult task in linear modeling is that of variable selection – in other words, which covariates are the most appropriate to use? In empirical social science, we should always have some theoretical motivation for variable selection, but even theory can be imperfect by itself. We can also utilize regularization methods to exploit the full breadth of our data in making these choices. A popular tool is the Least Absolute Shrinkage and Selection Operator, or LASSO. We can execute a similar process to ridge regression by 1) determining an optimal lambda and 2) running the LASSO model with that input.

set.seed(525)
cv_lasso <- cv.glmnet(
  x = X,
  y = Y,
  alpha = 1, # lasso is the (or a) case where alpha = 1
  lambda = lambdas,
  nfolds = 3
)

# that object can call `lambda.min` to report the optimal lambda
cv_lasso$lambda.min
## [1] 0.001
# we can plot this as a function of lambdas
plot(x = rev(lambdas),
     y = cv_lasso$cvm,
     type = "l",
     xlab = "lambda",
     ylab = "MSE",
     main = "LASSO")
abline(v = cv_lasso$lambda.min, col = "red")

In this case, the optimal \(\lambda_{\text{LASSO}} = 0.001\), which happens to be the smallest we could specify (as a consequence of seeding). Nonetheless, we persist and impute the LASSO regression with this value and then compare results across models.

model_lasso <- glmnet(x = X,
                      y = Y,
                      alpha = 1, # lasso is the special case with alpha = 0
                      lambda = cv_lasso$lambda.min)

results <- cbind(coef(model_base),
                 coef(model_ridge),
                 coef(model_lasso))

colnames(results) <- c("OLS", "Ridge", "LASSO")
results
## 10 x 3 sparse Matrix of class "dgCMatrix"
##                                       OLS         Ridge         LASSO
## (Intercept)                  5.161631e-16  6.614458e-16  6.331524e-16
## Agriculture                 -3.129213e-01 -1.670633e-01 -3.090373e-01
## Examination                 -1.647782e-01 -2.164305e-01 -1.636513e-01
## Education                   -6.704008e-01 -4.872326e-01 -6.679378e-01
## Catholic                     3.476000e-01  2.486391e-01  3.460835e-01
## Infant.Mortality             2.511360e-01  5.738066e-02  2.510008e-01
## Infant.Survival                        NA -5.523551e-02  .           
## Infant.Mortality.Disease               NA  5.470183e-02  .           
## Infant.Mortality.Starvation            NA  5.642946e-02  .           
## Infant.Mortality.Heresy                NA  5.807007e-02  6.114900e-17

No coefficients are shown for Infant.Survival, Infant.Mortality.Disease, or Infant.Mortality.Starvation because LASSO shrunk those coefficients all the way to zero – in other words, they were dropped from the model completely because they weren’t influential enough.

Note that this illustrates a key difference between ridge and LASSO regression: ridge shrinks all coefficients towards zero, whereas LASSO can remove predictors from the model by shrinking the coefficients completely to zero.