Regression Diagnostics

Let’s talk about (statistical) leverage! Points that are unusual (that is, they are outliers) with respect to the predictors (that is, all of \(\textbf{X}\)) are said to have high leverage because they can impact the coefficient estimates for our regression models. Let’s look at a toy example.

library(tidyverse)

# Setting seed for replicability
set.seed(1950)

# Generating some sample data
X <- cbind(rep(1, 10), # our vector of 1s
           c(1:9, 20)  # some other data
           )

# and some noisy outcome
y1 <- X %*% c(1, 0.5) + rnorm(10)

# Plotting the two
ggplot(mapping = aes(x = X[,2], # We just want the non-vector of 1's column
                     y = y1)    # and our simulated outcome
       )+
  
  geom_point(color = "blue")+ # this plots each element as a point
  
  geom_smooth(method = `lm`,   # This plots a single regression line
              formula = y ~ x, # using the formula passed from aes()
              se = F           # and omitting the standard error intervals
              )+
  # And this is just to make things look nice
  xlab("X")+
  ylab("Y")+
  theme_bw()

Here, things look pretty nice! But what if that point where \(X = 20\) were different – say, much further from the line?

# copying over our y1 vector
y2 <- y1

# and then changing the relevant y
y2[10] <- y1[10] - 10

# Plotting the old and new data together
ggplot(mapping = aes(x = X[,2],
                     y = y1)
       )+
  geom_point(aes(y = y2), color = "red")+
  geom_smooth(method = `lm`,
              formula = y ~ x,
              se = F
              )+
  geom_point(color = "blue")+
  geom_smooth(aes(y = y2),
              method = `lm`,
              formula = y ~ x,
              se = F,
              color = "red")+
  xlab("X")+
  ylab("Y")+
  theme_bw()

Suddenly, our regression line is quite different! In particular, it matters that the changed \(y\) value represents a bit of an outlier for \(\textbf{X}\). If we were to make a chnage closer to the data, things would be less different.

# Creating a third y which is not so different
y3 <- y1
y3[5] <- y1[5] - 5

# and plotting all three
ggplot(mapping = aes(x = X[,2],
                     y = y1)) +
  geom_point(aes(y = y3),
             color = "green") +
  geom_point(aes(y = y2),
             color = "red") +
  geom_smooth(method='lm',
              formula = y ~ x,
              se = F) +
  geom_point(color = "blue") +
  geom_smooth(aes(y = y2),
              method='lm',
              formula = y ~ x,
              se = F,
              color = "red") +
  geom_smooth(aes(y = y3),
              method='lm',
              formula = y ~ x,
              se = F,
              color = "green") +
  xlab("X")+
  ylab("Y")+
  theme_bw()

It’s more difficult to visualize this in a multivariate setting, but it extends well because we can utilize the projection matrix \(\textbf{P}\). The projection matrix is only a function of \(\textbf{X}\), so the values it contains aren’t directly impacted by the fit of the model. In other words, leverage is just a characteristic of \(\textbf{X}\), and characterizes whether a certain observation could potentially impact our model. In practice, we look at the diagonal entries of \(\textbf{P}\) to get our leverage values.

Let’s look at the toy case again to think about \(\textbf{P}\):

# the design matrix
X
##       [,1] [,2]
##  [1,]    1    1
##  [2,]    1    2
##  [3,]    1    3
##  [4,]    1    4
##  [5,]    1    5
##  [6,]    1    6
##  [7,]    1    7
##  [8,]    1    8
##  [9,]    1    9
## [10,]    1   20
# X - transpose - X
t(X) %*% X
##      [,1] [,2]
## [1,]   10   65
## [2,]   65  685
# (X - transpose - X) inverse
solve(t(X) %*% X)
##            [,1]         [,2]
## [1,]  0.2609524 -0.024761905
## [2,] -0.0247619  0.003809524
# The full projection (or hat) matrix
hat <- X %*% solve(t(X) %*% X) %*% t(X)
hat
##              [,1]        [,2]        [,3]        [,4]       [,5]       [,6]
##  [1,]  0.21523810  0.19428571  0.17333333  0.15238095 0.13142857 0.11047619
##  [2,]  0.19428571  0.17714286  0.16000000  0.14285714 0.12571429 0.10857143
##  [3,]  0.17333333  0.16000000  0.14666667  0.13333333 0.12000000 0.10666667
##  [4,]  0.15238095  0.14285714  0.13333333  0.12380952 0.11428571 0.10476190
##  [5,]  0.13142857  0.12571429  0.12000000  0.11428571 0.10857143 0.10285714
##  [6,]  0.11047619  0.10857143  0.10666667  0.10476190 0.10285714 0.10095238
##  [7,]  0.08952381  0.09142857  0.09333333  0.09523810 0.09714286 0.09904762
##  [8,]  0.06857143  0.07428571  0.08000000  0.08571429 0.09142857 0.09714286
##  [9,]  0.04761905  0.05714286  0.06666667  0.07619048 0.08571429 0.09523810
## [10,] -0.18285714 -0.13142857 -0.08000000 -0.02857143 0.02285714 0.07428571
##             [,7]       [,8]       [,9]       [,10]
##  [1,] 0.08952381 0.06857143 0.04761905 -0.18285714
##  [2,] 0.09142857 0.07428571 0.05714286 -0.13142857
##  [3,] 0.09333333 0.08000000 0.06666667 -0.08000000
##  [4,] 0.09523810 0.08571429 0.07619048 -0.02857143
##  [5,] 0.09714286 0.09142857 0.08571429  0.02285714
##  [6,] 0.09904762 0.09714286 0.09523810  0.07428571
##  [7,] 0.10095238 0.10285714 0.10476190  0.12571429
##  [8,] 0.10285714 0.10857143 0.11428571  0.17714286
##  [9,] 0.10476190 0.11428571 0.12380952  0.22857143
## [10,] 0.12571429 0.17714286 0.22857143  0.79428571
# the diagonal, which contains the leverage
diag(hat)
##  [1] 0.2152381 0.1771429 0.1466667 0.1238095 0.1085714 0.1009524 0.1009524
##  [8] 0.1085714 0.1238095 0.7942857

As we can see, the tenth observation has a much higher leverage (\(\ell_{10} \sim 0.794\)) than the fifth \(\ell_{5} \sim 0.109\). This is a useful diagnostic; on average, these values should be \(\frac{K}{n} = \frac{2}{10} = 0.2\). Any values above this threshold — \(\ell_{1}\) and \(\ell_{10}\) — are ones we might need to worry about.

Finally, for a given model, we can get these values using hatvalues(my_model) such that my_model <- lm().


Cook’s Distance and DFBETAS

Of course, an observation with high leverage isn’t necessarily going to affect our model. Problems arise when an observation both has high leverage and is an outlier with respect to \(y\). One way to formalize this is using a measure called Cook’s Distance:

\[\begin{equation*} D_i = \frac{(\hat{Y} - \tilde{Y}_{-i})^{\top} (\hat{Y} - \tilde{Y}_{-i})}{\hat{\sigma}^2 \times (K + 1)} = \frac{\hat{\epsilon_{i}} \times \textbf{P}_{ii}}{\hat{\sigma}^2 \times (K + 1) \times (1 - \textbf{P}_{ii})^2} \end{equation*}\]

where \(\tilde{Y}_{i} = \textbf{X}\hat{\beta}_{-i}\).

We might also use a metric called DFBETAS: \[\begin{equation*} \texttt{DFBETA}_{ik} = \frac{\hat{\beta}_k - \tilde{\beta}_{k \, (-i)}}{\text{s.e.}(\tilde{\beta}_{k \, (-i)})} \end{equation*}\]

where \(\tilde{\beta}_{-i} = \frac{(\textbf{X}^{\top}\textbf{X})^{-1}X_i\hat{\epsilon}_{i}}{1 - \textbf{P}_{ii}}\).

Below, we’ll find and plot Cook’s Distance, DFBETAS, and other diagnostics:

# Arbitrarily subsetting to low mpg cars
cars <- mtcars %>% filter(mpg < 20)

# Adding JDW's car to the data
jdw_camry <- data.frame(mpg = 52,
                        cyl = 4,
                        disp = 152.6,
                        hp = 208,
                        drat = 3.29,
                        wt = 3.472,
                        qsec = 15.9,
                        vs = 1,
                        am = 0,
                        gear = 8,
                        carb = NA
                          )

rownames(jdw_camry) <- "JDW Camry"

cars <- cars %>%
  filter(mpg < 20) %>% 
  rbind(jdw_camry) %>% 
  mutate(hybrid = case_when(mpg > 50 ~ 1,
                            mpg < 50 ~ 0))

# The relationship in question: miles per gallon as a function of weight and displacement
cars_model <- lm(data = cars,
                 mpg ~ wt + disp)


# Using Cook's distance
cooks.distance(cars_model)
##   Hornet Sportabout             Valiant          Duster 360            Merc 280 
##        3.209123e-02        7.926039e-03        1.319276e-05        5.193316e-02 
##           Merc 280C          Merc 450SE          Merc 450SL         Merc 450SLC 
##        8.022510e-02        8.447856e-03        1.531776e-03        6.942170e-03 
##  Cadillac Fleetwood Lincoln Continental   Chrysler Imperial    Dodge Challenger 
##        1.038148e-03        1.049107e-02        9.962559e-03        5.689739e-04 
##         AMC Javelin          Camaro Z28    Pontiac Firebird      Ford Pantera L 
##        2.173365e-03        1.928634e-03        5.565569e-02        8.519039e-03 
##        Ferrari Dino       Maserati Bora           JDW Camry 
##        4.975589e-02        2.767982e-03        1.478342e+00
# and DFBETAs
# Here. DFBETA > 2 / sqrt(n) is our threshold
dfbetas(cars_model)
##                      (Intercept)           wt         disp
## Hornet Sportabout    0.170540297 -0.235567604  0.236869947
## Valiant             -0.059731757 -0.015902022  0.075654019
## Duster 360           0.002993177 -0.004264272  0.004565158
## Merc 280            -0.074140762 -0.152460677  0.310226521
## Merc 280C           -0.093311459 -0.191882682  0.390442297
## Merc 450SE           0.040081535 -0.097782128  0.095372924
## Merc 450SL          -0.014343624 -0.008683554  0.019046236
## Merc 450SLC         -0.019078248 -0.031999339  0.050230817
## Cadillac Fleetwood   0.040033382 -0.027121479 -0.007283065
## Lincoln Continental  0.140280892 -0.113573960  0.009739256
## Chrysler Imperial   -0.136883860  0.118503194 -0.022347427
## Dodge Challenger    -0.023079913  0.024137997 -0.020515571
## AMC Javelin         -0.050870749  0.048369995 -0.035032032
## Camaro Z28          -0.018363342  0.027963608 -0.039652508
## Pontiac Firebird     0.109249439 -0.236875204  0.329808074
## Ford Pantera L       0.103213309 -0.133375488  0.120697691
## Ferrari Dino        -0.275821052  0.081325995  0.154933016
## Maserati Bora       -0.046940409  0.037714243 -0.025375578
## JDW Camry            1.026616523  3.960938882 -7.241925116
# and influence.measures(), which reports a bunch of these
influence.measures(cars_model)
## Influence measures of
##   lm(formula = mpg ~ wt + disp, data = cars) :
## 
##                       dfb.1_   dfb.wt dfb.disp    dffit    cov.r   cook.d
## Hornet Sportabout    0.17054 -0.23557  0.23687  0.30492 1.338589 3.21e-02
## Valiant             -0.05973 -0.01590  0.07565 -0.15046 1.271218 7.93e-03
## Duster 360           0.00299 -0.00426  0.00457  0.00609 1.402787 1.32e-05
## Merc 280            -0.07414 -0.15246  0.31023 -0.39077 1.299960 5.19e-02
## Merc 280C           -0.09331 -0.19188  0.39044 -0.49181 1.205755 8.02e-02
## Merc 450SE           0.04008 -0.09778  0.09537 -0.15532 1.281215 8.45e-03
## Merc 450SL          -0.01434 -0.00868  0.01905 -0.06579 1.271009 1.53e-03
## Merc 450SLC         -0.01908 -0.03200  0.05023 -0.14117 1.214885 6.94e-03
## Cadillac Fleetwood   0.04003 -0.02712 -0.00728 -0.05405 1.662068 1.04e-03
## Lincoln Continental  0.14028 -0.11357  0.00974 -0.17213 1.768241 1.05e-02
## Chrysler Imperial   -0.13688  0.11850 -0.02235  0.16776 1.712860 9.96e-03
## Dodge Challenger    -0.02308  0.02414 -0.02052 -0.04003 1.320418 5.69e-04
## AMC Javelin         -0.05087  0.04837 -0.03503 -0.07835 1.309081 2.17e-03
## Camaro Z28          -0.01836  0.02796 -0.03965 -0.07382 1.292970 1.93e-03
## Pontiac Firebird     0.10925 -0.23688  0.32981  0.40763 1.197304 5.57e-02
## Ford Pantera L       0.10321 -0.13338  0.12070  0.15517 1.583041 8.52e-03
## Ferrari Dino        -0.27582  0.08133  0.15493 -0.38124 1.355189 4.98e-02
## Maserati Bora       -0.04694  0.03771 -0.02538 -0.08857 1.268069 2.77e-03
## JDW Camry            1.02662  3.96094 -7.24193  8.57797 0.000283 1.48e+00
##                        hat inf
## Hornet Sportabout   0.1706    
## Valiant             0.0885    
## Duster 360          0.1349    
## Merc 280            0.1830    
## Merc 280C           0.1830    
## Merc 450SE          0.0950    
## Merc 450SL          0.0584    
## Merc 450SLC         0.0604    
## Cadillac Fleetwood  0.2710   *
## Lincoln Continental 0.3221   *
## Chrysler Imperial   0.3007   *
## Dodge Challenger    0.0841    
## AMC Javelin         0.0851    
## Camaro Z28          0.0740    
## Pontiac Firebird    0.1526    
## Ford Pantera L      0.2446   *
## Ferrari Dino        0.2006    
## Maserati Bora       0.0644    
## JDW Camry           0.2271   *

In R, we can call plot(my_model) to get four standard diagnostic plots:

# Residuals versus Fitted
plot(cars_model, 1) # I'm calling them one at a time instead of all at once

                    # You can call each at once using plot(my_model) with no other inputs
  1. The Residuals vs Fitted plot can help with two things. First, it can help us see if the constant variance assumption is appropriate (most oftin, if it isn’t we would see a “funnel” in one direction or another); second, it can tell us whether we might need to transform the outcome variable in some way (if the line of best-fit isn’t reasonably straight). A horizontal line without distinct patterns is an indication for a linear relationship, which is what we are looking for.
# Normal Q-Q
plot(cars_model, 2)

  1. The Normal Q-Q plot helps assess whether our residuals are normally distributed or not. If they are, all of the points should be found on the straight dotted line. If most of the points are on the straight dotted line but the points at the end are not, this indicates that the residuals’ tails are heavier than under a Normal distribution. If the points form a “wavy” line, it means that the distribution of residuals is skewed, indicating that our outcome variable may need to be transformed if we wich to have Normally-distributed errors.
# Scale-Location
plot(cars_model, 3)

  1. The Scale-Location plot is basically the Residuals vs Fitted plot folded in half and compressed. It does not tell us much more, but is used to check the homoskedasticity. A horizontal lilne with evenly-spread points is a good indication that our errors are homoskedastic.
# Residuals vs Leverage
plot(cars_model, 5) # although by default this is the fourth plot

  1. The Residuals vs Leverage plot conveys the standardized residuals of each observation against its leverage. Remember, we only have to worry about this if both are high. The dotted lines are a contour plot of Cook’s Distance.
plot(cars_model, 4)
abline(h = 0.25,
       col = "red",
       lt = "dotted")

  1. The Cook’s Distance plot displays (not surprisingly) Cook’s Distance for each observation. In general, we are concerned if \(D_i > \frac{4}{N - K - 1}\). In this example, our threshold \(D = \frac{4}{19 - 2 - 1} = \frac{4}{16} = 0.25\), which my Camry clearly exceeds.

Bias-Variance Tradeoff

OLS provides a lot of guarantees about the asymptotic properties of our proposed regressors, \(\beta\), but it doesn’t tell us anything about whether we have a good model. In particular, it tells us absolutely nothing about how well our model will perform on data we haven’t seen before.

The bias-variance tradeoff of model selection is a well-known problem in statistics and machine learning. Intuitively, it consists of two components:

Think about how these two are related. Generally, we try to reduce bias by introducing predictors to a model that capture things we think are important, or confound the relationship in which we are interested. But doing so increases the variance because, as we introduce more predictors (or as \(k\) grows large), the more that the \(n\)-dimensional regression line responds to changes in those inputs. In the most extreme case, we would have a predictor for every single point in our data: our predictions would be perfect, but the associated variance would be extraordinarily high. Then, if you run new data with this model, it will respond wildly to even small changes in the predictors – in other words, it will be extremely noisy.

Another property of this tradeoff is that low-bias models predict well in-sample, but poorly out-of-sample. On the other hand, low-variance models predict worse in-sample but better out-of-sample. This presents a difficult problem for researchers. We generally want our models to have as little bias as possible, but if we overfit, our coefficient estimates may reflect meaningless noise. It also implies that a good fit in-sample does not necessarily imply a good model.