This assignment is designed to review the materials you learn in the lab. Be sure to comment your code to clarify what you are doing. Not only does it help with grading, but it will also help you when you revisit it in the future. Please use either RMarkdown or knitr to turn in your assignment. These are fully compatible with R and LaTeX. If your code uses random number generation, please use set.seed(12345) for replicability. Please post any questions on Piazza.


1) Checking Intuition

  1. Explain the logic and intution of \(K\)-fold cross-validation techniques, using minimal math or algebra.
  2. Similarly, explain the logic and intuition of ridge regression.
  3. Similarly, explain the logic and intuition of LASSO.

2) Coding in R

Open the newhamp dataset from the faraway R package. This dataset contains vote counts and other demographic information from 276 wards (i.e., voting districts) in the 2008 Democratic Party presidential primary in the U.S. state of New Hampshire.

  1. Create two linear models. In both, pObama is the dependent variable. Choose three other explanatory variables in your first model, and three different explanatory variables in your second. Report and interpret summaries of each model.
  2. Perform leave-one-out cross-validation on each of these two models and compare performance. Which model is preferrable? Why?
  3. Create a vector of lambdas. Run \(K\)-fold cross validation with \(K = 5\) folds LASSO using pObama as the outcome variable and everything else except votesys as an explanatory variable. Find and report the \(\lambda\) that maximizes the model performance. Donโ€™t forget to standardize your data!
  4. Run LASSO with the optimal \(\lambda\) and report coefficients. Which variables remain?