Let’s talk about (statistical) leverage! Points that are unusual (that is, they are outliers) with respect to the predictors (that is, all of \(\textbf{X}\)) are said to have high leverage because they can impact the coefficient estimates for our regression models. Let’s look at a toy example.
library(tidyverse)
# Setting seed for replicability
set.seed(1950)
# Generating some sample data
X <- cbind(rep(1, 10), # our vector of 1s
c(1:9, 20) # some other data
)
# and some noisy outcome
y1 <- X %*% c(1, 0.5) + rnorm(10)
# Plotting the two
ggplot(mapping = aes(x = X[,2], # We just want the non-vector of 1's column
y = y1) # and our simulated outcome
)+
geom_point(color = "blue")+ # this plots each element as a point
geom_smooth(method = `lm`, # This plots a single regression line
formula = y ~ x, # using the formula passed from aes()
se = F # and omitting the standard error intervals
)+
# And this is just to make things look nice
xlab("X")+
ylab("Y")+
theme_bw()
Here, things look pretty nice! But what if that point where \(X = 20\) were different – say, much further from the line?
# copying over our y1 vector
y2 <- y1
# and then changing the relevant y
y2[10] <- y1[10] - 10
# Plotting the old and new data together
ggplot(mapping = aes(x = X[,2],
y = y1)
)+
geom_point(aes(y = y2), color = "red")+
geom_smooth(method = `lm`,
formula = y ~ x,
se = F
)+
geom_point(color = "blue")+
geom_smooth(aes(y = y2),
method = `lm`,
formula = y ~ x,
se = F,
color = "red")+
xlab("X")+
ylab("Y")+
theme_bw()
Suddenly, our regression line is quite different! In particular, it matters that the changed \(y\) value represents a bit of an outlier for \(\textbf{X}\). If we were to make a chnage closer to the data, things would be less different.
# Creating a third y which is not so different
y3 <- y1
y3[5] <- y1[5] - 5
# and plotting all three
ggplot(mapping = aes(x = X[,2],
y = y1)) +
geom_point(aes(y = y3),
color = "green") +
geom_point(aes(y = y2),
color = "red") +
geom_smooth(method='lm',
formula = y ~ x,
se = F) +
geom_point(color = "blue") +
geom_smooth(aes(y = y2),
method='lm',
formula = y ~ x,
se = F,
color = "red") +
geom_smooth(aes(y = y3),
method='lm',
formula = y ~ x,
se = F,
color = "green") +
xlab("X")+
ylab("Y")+
theme_bw()
It’s more difficult to visualize this in a multivariate setting, but
it extends well because we can utilize the projection matrix \(\textbf{P}\). The projection matrix is only
a function of \(\textbf{X}\), so the
values it contains aren’t directly impacted by the fit of the model. In
other words, leverage is just a characteristic of \(\textbf{X}\), and characterizes whether a
certain observation could potentially impact our model. In
practice, we look at the diagonal entries of \(\textbf{P}\) to get our leverage
values.
Let’s look at the toy case again to think about \(\textbf{P}\):
# the design matrix
X
## [,1] [,2]
## [1,] 1 1
## [2,] 1 2
## [3,] 1 3
## [4,] 1 4
## [5,] 1 5
## [6,] 1 6
## [7,] 1 7
## [8,] 1 8
## [9,] 1 9
## [10,] 1 20
# X - transpose - X
t(X) %*% X
## [,1] [,2]
## [1,] 10 65
## [2,] 65 685
# (X - transpose - X) inverse
solve(t(X) %*% X)
## [,1] [,2]
## [1,] 0.2609524 -0.024761905
## [2,] -0.0247619 0.003809524
# The full projection (or hat) matrix
hat <- X %*% solve(t(X) %*% X) %*% t(X)
hat
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.21523810 0.19428571 0.17333333 0.15238095 0.13142857 0.11047619
## [2,] 0.19428571 0.17714286 0.16000000 0.14285714 0.12571429 0.10857143
## [3,] 0.17333333 0.16000000 0.14666667 0.13333333 0.12000000 0.10666667
## [4,] 0.15238095 0.14285714 0.13333333 0.12380952 0.11428571 0.10476190
## [5,] 0.13142857 0.12571429 0.12000000 0.11428571 0.10857143 0.10285714
## [6,] 0.11047619 0.10857143 0.10666667 0.10476190 0.10285714 0.10095238
## [7,] 0.08952381 0.09142857 0.09333333 0.09523810 0.09714286 0.09904762
## [8,] 0.06857143 0.07428571 0.08000000 0.08571429 0.09142857 0.09714286
## [9,] 0.04761905 0.05714286 0.06666667 0.07619048 0.08571429 0.09523810
## [10,] -0.18285714 -0.13142857 -0.08000000 -0.02857143 0.02285714 0.07428571
## [,7] [,8] [,9] [,10]
## [1,] 0.08952381 0.06857143 0.04761905 -0.18285714
## [2,] 0.09142857 0.07428571 0.05714286 -0.13142857
## [3,] 0.09333333 0.08000000 0.06666667 -0.08000000
## [4,] 0.09523810 0.08571429 0.07619048 -0.02857143
## [5,] 0.09714286 0.09142857 0.08571429 0.02285714
## [6,] 0.09904762 0.09714286 0.09523810 0.07428571
## [7,] 0.10095238 0.10285714 0.10476190 0.12571429
## [8,] 0.10285714 0.10857143 0.11428571 0.17714286
## [9,] 0.10476190 0.11428571 0.12380952 0.22857143
## [10,] 0.12571429 0.17714286 0.22857143 0.79428571
# the diagonal, which contains the leverage
diag(hat)
## [1] 0.2152381 0.1771429 0.1466667 0.1238095 0.1085714 0.1009524 0.1009524
## [8] 0.1085714 0.1238095 0.7942857
As we can see, the tenth observation has a much higher leverage
(\(\ell_{10} \sim 0.794\)) than the
fifth \(\ell_{5} \sim 0.109\). This is
a useful diagnostic; on average, these values should be \(\frac{K}{n} = \frac{2}{10} = 0.2\). Any
values above this threshold — \(\ell_{1}\) and \(\ell_{10}\) — are ones we might need to
worry about.
Finally, for a given model, we can get these values using
hatvalues(my_model)
such that
my_model <- lm()
.
Of course, an observation with high leverage isn’t necessarily going to affect our model. Problems arise when an observation both has high leverage and is an outlier with respect to \(y\). One way to formalize this is using a measure called Cook’s Distance:
\[\begin{equation*} D_i = \frac{(\hat{Y} - \tilde{Y}_{-i})^{\top} (\hat{Y} - \tilde{Y}_{-i})}{\hat{\sigma}^2 \times (K + 1)} = \frac{\hat{\epsilon_{i}} \times \textbf{P}_{ii}}{\hat{\sigma}^2 \times (K + 1) \times (1 - \textbf{P}_{ii})^2} \end{equation*}\]
where \(\tilde{Y}_{i} =
\textbf{X}\hat{\beta}_{-i}\).
We might also use a metric called DFBETAS: \[\begin{equation*} \texttt{DFBETA}_{ik} = \frac{\hat{\beta}_k - \tilde{\beta}_{k \, (-i)}}{\text{s.e.}(\tilde{\beta}_{k \, (-i)})} \end{equation*}\]
where \(\tilde{\beta}_{-i} =
\frac{(\textbf{X}^{\top}\textbf{X})^{-1}X_i\hat{\epsilon}_{i}}{1 -
\textbf{P}_{ii}}\).
Below, we’ll find and plot Cook’s Distance, DFBETAS, and other diagnostics:
# Arbitrarily subsetting to low mpg cars
cars <- mtcars %>% filter(mpg < 20)
# Adding JDW's car to the data
jdw_camry <- data.frame(mpg = 52,
cyl = 4,
disp = 152.6,
hp = 208,
drat = 3.29,
wt = 3.472,
qsec = 15.9,
vs = 1,
am = 0,
gear = 8,
carb = NA
)
rownames(jdw_camry) <- "JDW Camry"
cars <- cars %>%
filter(mpg < 20) %>%
rbind(jdw_camry) %>%
mutate(hybrid = case_when(mpg > 50 ~ 1,
mpg < 50 ~ 0))
# The relationship in question: miles per gallon as a function of weight and displacement
cars_model <- lm(data = cars,
mpg ~ wt + disp)
# Using Cook's distance
cooks.distance(cars_model)
## Hornet Sportabout Valiant Duster 360 Merc 280
## 3.209123e-02 7.926039e-03 1.319276e-05 5.193316e-02
## Merc 280C Merc 450SE Merc 450SL Merc 450SLC
## 8.022510e-02 8.447856e-03 1.531776e-03 6.942170e-03
## Cadillac Fleetwood Lincoln Continental Chrysler Imperial Dodge Challenger
## 1.038148e-03 1.049107e-02 9.962559e-03 5.689739e-04
## AMC Javelin Camaro Z28 Pontiac Firebird Ford Pantera L
## 2.173365e-03 1.928634e-03 5.565569e-02 8.519039e-03
## Ferrari Dino Maserati Bora JDW Camry
## 4.975589e-02 2.767982e-03 1.478342e+00
# and DFBETAs
# Here. DFBETA > 2 / sqrt(n) is our threshold
dfbetas(cars_model)
## (Intercept) wt disp
## Hornet Sportabout 0.170540297 -0.235567604 0.236869947
## Valiant -0.059731757 -0.015902022 0.075654019
## Duster 360 0.002993177 -0.004264272 0.004565158
## Merc 280 -0.074140762 -0.152460677 0.310226521
## Merc 280C -0.093311459 -0.191882682 0.390442297
## Merc 450SE 0.040081535 -0.097782128 0.095372924
## Merc 450SL -0.014343624 -0.008683554 0.019046236
## Merc 450SLC -0.019078248 -0.031999339 0.050230817
## Cadillac Fleetwood 0.040033382 -0.027121479 -0.007283065
## Lincoln Continental 0.140280892 -0.113573960 0.009739256
## Chrysler Imperial -0.136883860 0.118503194 -0.022347427
## Dodge Challenger -0.023079913 0.024137997 -0.020515571
## AMC Javelin -0.050870749 0.048369995 -0.035032032
## Camaro Z28 -0.018363342 0.027963608 -0.039652508
## Pontiac Firebird 0.109249439 -0.236875204 0.329808074
## Ford Pantera L 0.103213309 -0.133375488 0.120697691
## Ferrari Dino -0.275821052 0.081325995 0.154933016
## Maserati Bora -0.046940409 0.037714243 -0.025375578
## JDW Camry 1.026616523 3.960938882 -7.241925116
# and influence.measures(), which reports a bunch of these
influence.measures(cars_model)
## Influence measures of
## lm(formula = mpg ~ wt + disp, data = cars) :
##
## dfb.1_ dfb.wt dfb.disp dffit cov.r cook.d
## Hornet Sportabout 0.17054 -0.23557 0.23687 0.30492 1.338589 3.21e-02
## Valiant -0.05973 -0.01590 0.07565 -0.15046 1.271218 7.93e-03
## Duster 360 0.00299 -0.00426 0.00457 0.00609 1.402787 1.32e-05
## Merc 280 -0.07414 -0.15246 0.31023 -0.39077 1.299960 5.19e-02
## Merc 280C -0.09331 -0.19188 0.39044 -0.49181 1.205755 8.02e-02
## Merc 450SE 0.04008 -0.09778 0.09537 -0.15532 1.281215 8.45e-03
## Merc 450SL -0.01434 -0.00868 0.01905 -0.06579 1.271009 1.53e-03
## Merc 450SLC -0.01908 -0.03200 0.05023 -0.14117 1.214885 6.94e-03
## Cadillac Fleetwood 0.04003 -0.02712 -0.00728 -0.05405 1.662068 1.04e-03
## Lincoln Continental 0.14028 -0.11357 0.00974 -0.17213 1.768241 1.05e-02
## Chrysler Imperial -0.13688 0.11850 -0.02235 0.16776 1.712860 9.96e-03
## Dodge Challenger -0.02308 0.02414 -0.02052 -0.04003 1.320418 5.69e-04
## AMC Javelin -0.05087 0.04837 -0.03503 -0.07835 1.309081 2.17e-03
## Camaro Z28 -0.01836 0.02796 -0.03965 -0.07382 1.292970 1.93e-03
## Pontiac Firebird 0.10925 -0.23688 0.32981 0.40763 1.197304 5.57e-02
## Ford Pantera L 0.10321 -0.13338 0.12070 0.15517 1.583041 8.52e-03
## Ferrari Dino -0.27582 0.08133 0.15493 -0.38124 1.355189 4.98e-02
## Maserati Bora -0.04694 0.03771 -0.02538 -0.08857 1.268069 2.77e-03
## JDW Camry 1.02662 3.96094 -7.24193 8.57797 0.000283 1.48e+00
## hat inf
## Hornet Sportabout 0.1706
## Valiant 0.0885
## Duster 360 0.1349
## Merc 280 0.1830
## Merc 280C 0.1830
## Merc 450SE 0.0950
## Merc 450SL 0.0584
## Merc 450SLC 0.0604
## Cadillac Fleetwood 0.2710 *
## Lincoln Continental 0.3221 *
## Chrysler Imperial 0.3007 *
## Dodge Challenger 0.0841
## AMC Javelin 0.0851
## Camaro Z28 0.0740
## Pontiac Firebird 0.1526
## Ford Pantera L 0.2446 *
## Ferrari Dino 0.2006
## Maserati Bora 0.0644
## JDW Camry 0.2271 *
In R, we can call plot(my_model)
to get four
standard diagnostic plots:
# Residuals versus Fitted
plot(cars_model, 1) # I'm calling them one at a time instead of all at once
# You can call each at once using plot(my_model) with no other inputs
# Normal Q-Q
plot(cars_model, 2)
# Scale-Location
plot(cars_model, 3)
# Residuals vs Leverage
plot(cars_model, 5) # although by default this is the fourth plot
plot(cars_model, 4)
abline(h = 0.25,
col = "red",
lt = "dotted")
OLS provides a lot of guarantees about the asymptotic properties of
our proposed regressors, \(\beta\), but
it doesn’t tell us anything about whether we have a good model.
In particular, it tells us absolutely nothing about how well our model
will perform on data we haven’t seen before.
The bias-variance tradeoff of model selection is a
well-known problem in statistics and machine learning. Intuitively, it
consists of two components:
Think about how these two are related. Generally, we try to reduce
bias by introducing predictors to a model that capture things we think
are important, or confound the relationship in which we are interested.
But doing so increases the variance because, as we introduce more
predictors (or as \(k\) grows large),
the more that the \(n\)-dimensional
regression line responds to changes in those inputs. In the most extreme
case, we would have a predictor for every single point in our data: our
predictions would be perfect, but the associated variance would be
extraordinarily high. Then, if you run new data with this model, it will
respond wildly to even small changes in the predictors – in other words,
it will be extremely noisy.
Another property of this tradeoff is that low-bias models predict well in-sample, but poorly out-of-sample. On the other hand, low-variance models predict worse in-sample but better out-of-sample. This presents a difficult problem for researchers. We generally want our models to have as little bias as possible, but if we overfit, our coefficient estimates may reflect meaningless noise. It also implies that a good fit in-sample does not necessarily imply a good model.