Inference for the Linear Model

Big picture: we wish to make some claim about a relationship in the world. We can characterize this relationship between outcome \(Y\) and explanatory factor(s) \(X\) in a linear framework, wherein \(\beta\) is the parameter of interest. \(\beta\) itself is unknown, but we use data to estimate some \(\hat{\beta}\) that is informative of the true \(\beta\). In other words, \(\hat{\beta}\) allows to make inferences about the nature of \(\beta\).

Omnibus F-Testing

The phrase “omnibus F-testing” is an excellent vocal warm-up for those of you who have to do some public speaking. Try it out, and be sure to over-emphasize each syllable.

More practically, omnibus F-testing refers to the practice of evaluating a battery of null hypotheses within a single linear regression framework. In general, we specify:

\[\begin{equation*} H_0: \textbf{A}\hat{\beta}= \textbf{b} \end{equation*}\]

Where \(\textbf{A}\) is of full row rank. This is flexible to any sort of linear restriction on \(\beta\), as well as any number of linear-independence restrictions. In class, we demonstrated two cases:

  1. For \(H_0: \displaystyle \sum_{k = 0}^{K} \beta_{k} = 0\), \(\textbf{A} = \left(1, \dots, 1\right)^{\top}\). In other words, if our null specifies that the sum of all betas is zero, then \(\textbf{A}\) must be a vector of \(k\) elements where each entry is 1.
  2. For \(H_0: \beta_k = 0 \, \forall \, k\), \(\textbf{A} = \textbf{I}\). In other words, if our null specifies that every beta is zero, it must be the case that \(\textbf{A}\) is the \(k \times k\) identity matrix.

To summarize: for the omnibus F-test, we specify some form of null hypothesis where 1) the sum of all betas is zero; or 2) every beta is zero. The second is more common and flexible. As such, we specify the alternative hypothesis \(H_A\):

\[\begin{equation*} H_A: \text{For}\, i \in [1,k], \, \exists \, i \, \phi( \beta_i) \not = 0 \end{equation*}\]

That is, among our vector of \(k\) betas, there exists at least 1 \(\beta_i \not = 0\), which falsifies \(H_0\). The way we evaluate these hypotheses is by constructing a suitable test statistic, \(F\), which evaluates the cumulative distance from \(H_0\) we observe via \(\hat{\beta}\). Specifically, define:

\[\begin{equation*} F \equiv \frac{(\textbf{A} \hat{\beta}- \textbf{b})^{\top} \left(\textbf{A}(\textbf{X}^{\top}\textbf{X})^{-1}\textbf{A} \right)^{-1} (\textbf{A}\hat{\beta}- \textbf{b})}{\hat{\epsilon}^{\top}\hat{\epsilon} / \sigma^2} \times \frac{N - K - 1}{\text{rank}(\textbf{A})} \end{equation*}\]

The numerator and denominator are known to follow \(\chi^{2}_{\text{rank}(\textbf{A})}\) and \(\chi^{2}_{N - K - 1}\) distributions, respectively. Thus, \(F\) also follows a known distribution: \(F \sim F_{\text{rank}(\textbf{A}), N - K - 1}\). If our test statistic \(F\) is larger than the critical value associated with some specified \(\alpha\), we reject the null hypothesis (similar to the univariate case).

Consider the following example:

library(faraway)

# A simple multivariate model of demographic/economic variables
# on propensity for gambling
summary(lm(data = teengamb,
           gamble ~ income + sex + status))
## 
## Call:
## lm(formula = gamble ~ income + sex + status, data = teengamb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.682 -12.169  -0.268   9.161  97.728 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  13.0315    15.8676   0.821  0.41603    
## income        4.9280     1.0352   4.760 2.21e-05 ***
## sex         -24.3393     8.1274  -2.995  0.00454 ** 
## status       -0.1496     0.2413  -0.620  0.53856    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.92 on 43 degrees of freedom
## Multiple R-squared:  0.5058, Adjusted R-squared:  0.4713 
## F-statistic: 14.67 on 3 and 43 DF,  p-value: 1.014e-06
# And the F-test statistic.
summary(lm(data = teengamb,
           gamble ~ income + sex + status + income))$fstatistic[1]
##    value 
## 14.67009

Under this specific set of parameters, our \(F\)-test statistic is 14.67, which is significant at the \(p = 0.0000001\) level. In other words, we reject the null hypothesis that every \(\beta\) in this regression is 0. Notice that we fail to reject the null hypothesis that the \(\beta\) associated with status is 0, but fail to reject the null that every \(\beta\) is 0.

Interaction Terms

Let \(\textbf{X}\) be matrix of \(k\) variables. In the OLS framework, an interaction term \(X_{\ell}\) is a variable where \(X_{\ell} = X_{m} \times X_n\); that is, \(X_{\ell}\) is itself a product of two other variables (\(X_m\) and \(X_n\) are analogous to \(X_i\) and \(Z_i\) in Ted’s slides). The model in such a scenario is:

\[\begin{equation*} Y_i = \beta_0 + \beta_1 X_{i,m} + \beta_2 X_{i, n} + \beta_3 X_{i, \ell} + \epsilon_i \end{equation*}\]

Using an interaction term allow the slope of one variable to be the function of another, and allows us to measure the variation within a subgroup of our data. Usually, this works most effectively when one of \(X_m\) or \(X_m\) is a binary variable.

Let’s revisit the teengamb for an application. Let \(Y\) be gamble, and our two \(X\) be income and sex.

# Plotting the data without regard for `sex`
library(ggplot2)
ggplot(data = teengamb,
       aes(x = income,
           y = gamble))+
  geom_point()+
  geom_smooth(method = lm)+
  theme_bw()+
    xlab("Income")+
  ylab("Gambling Propensity")+
  theme(legend.background = element_rect(color = "grey80"),
        legend.position = "right",
        legend.key.size = unit(2, "cm"))+
  labs(title = "Income and Gambling Propensity")

We see a pretty clear positive relationship: as income increases, so does propensity for gambling. However, this approach masks the role of gender (setting aside the normative implications of treating gender as exclusively binary).

# Now plotting with distinctions for sex
ggplot(data = teengamb,
       aes(x = income,
           y = gamble,
           color = as.factor(sex)))+
  geom_point()+
  geom_smooth(method = lm)+
  theme_bw()+
  scale_color_manual(values = c("0" = "blue",
                                "1" = "orange"),
                     labels = c("Men",
                                "Women"))+
  xlab("Income")+
  ylab("Gambling Propensity")+
  theme(legend.background = element_rect(color = "grey80"),
        legend.position = "right",
        legend.key.size = unit(2, "cm"))+
  labs(title = "Income and Gambling Propensity by Sex",
       color = "Sex")

In a regression framework, we can interact income and sex to recover these differences.

summary(lm(data = teengamb,
           gamble ~ income + sex + (income * sex)))
## 
## Call:
## lm(formula = gamble ~ income + sex + (income * sex), data = teengamb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.522  -4.860  -1.790   6.273  93.478 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.6596     6.3164  -0.421  0.67580    
## income        6.5181     0.9881   6.597 4.95e-08 ***
## sex           5.7996    11.2003   0.518  0.60724    
## income:sex   -6.3432     2.1446  -2.958  0.00502 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.98 on 43 degrees of freedom
## Multiple R-squared:  0.5857, Adjusted R-squared:  0.5568 
## F-statistic: 20.26 on 3 and 43 DF,  p-value: 2.451e-08

Here, the slope of interest for women is plausibly zero, whereas for men it is clearly positive. A note: be very careful about interpreting the results of a regression table versus a plot in this setting! \(\beta_1 = 0\) indicates that, among men, a unit increase in income is associated with a 6.52 increase in gambling propensity. The plot reflects this. However, it is different for women (an example of sexism in my lab composition practices, perhaps).

The plot suggests that, for women, the relationship between income and gambling propensity is close to zero. However, the regression table \(\beta_3\) should be interpreted as among women, a unit increase in income is associated with a 6.34 unit decrease in gambling propensity compared to men.. In other words, \(\beta_3\) captures variation in \(X_m\) (income) holding \(X_n = 1\) (sex).

A final caution: for interaction terms to work, we need enough common support (variation in \(X_m\) throughout the range in possible values of \(X_n\).). Referring to the plot, there is very clearly variation in gambling propensity for men, whereas the variation for women is virtually nonexistent. A lesson: visualizing relationships conditioning on variables is always a good complement to any regression exercise.

Bootstrap

Remember that, in nearly any setting, we only get one attempt at estimating \(\beta\) and other parameters of interest. That is, we only get to conduct a single draw from our population of interest, and must use that draw to make any and all inferences. The bootstrap is a statistics technique which allows us to exploit variation in our single draw: since we cannot resample from the population, we instead will resample from the sample. The procedure, in general, is:

  1. Draw a random sample with replacement of the data, of the same size as the data
  2. Using this new draw, estimate parameters of interest
  3. Repeat this process \(M\) times, creating an empirical distribution of parameter estimates
  4. Recover properties of that empirical distribution – usually the mean and variance

If we do this many times, we can approximate the act of sampling from the full population, which allows us to estimate the sampling distribution of any statistic.

An application using teengamb:

# Setting seed for replicability
set.seed(12345)

### Bootstrap parameters of interest
# We will sample N times
N <- nrow(teengamb)   # we will sample n times

# And iterate M times
M <- 10000

### "True" model
# What happens if we run this through lm()?
my_model <- lm(data = teengamb,
               gamble ~ income + sex + status)
summary(my_model)
## 
## Call:
## lm(formula = gamble ~ income + sex + status, data = teengamb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.682 -12.169  -0.268   9.161  97.728 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  13.0315    15.8676   0.821  0.41603    
## income        4.9280     1.0352   4.760 2.21e-05 ***
## sex         -24.3393     8.1274  -2.995  0.00454 ** 
## status       -0.1496     0.2413  -0.620  0.53856    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.92 on 43 degrees of freedom
## Multiple R-squared:  0.5058, Adjusted R-squared:  0.4713 
## F-statistic: 14.67 on 3 and 43 DF,  p-value: 1.014e-06
# We're going to bind these results later
model_betas <- my_model$coefficients
names(model_betas) <- c("Intercept", "Income", "Sex", "Status")

model_errors <- sqrt(diag(vcov(my_model)))
names(model_errors) <- c("Intercept", "Income", "Sex", "Status")

### Now Bootstrap
# There are a bajillion different ways to bootstrap, including many
# canned functions in various R packages, but we'll do so manually
# using a for-loop. We do it this way to illustrate exactly what
# it is we are doing.

# Creating empty matrices in which to store coefficient estimates
# Since we are capturing four parameters over M iterations,
# this will be 4 x M
boot_betas <- data.frame(matrix(ncol = 4,
                              nrow = M))
names(boot_betas) <- c("Intercept", "Income", "Sex", "Status")

boot_errors <- data.frame(matrix(ncol = 4,
                              nrow = M))
names(boot_errors) <- c("Intercept", "Income", "Sex", "Status")

# Constructing the for-loop
for (i in 1:M) {    # We wish to iterate M times
  # print(i)        # Printing the iteration number is good practice when
                    # running loops, but it'll explode our document here,
                    # so we comment it out
  
  # Here, we create an index of rows from which to sample
  rows <- sample(x = N,           # the list of rows from which we will sample
                 size = N,        # the number of times we will sample
                 replace = TRUE)  # and replace each draw each time
  
  # Now we actually sample the observations from our data
  # for the Mth iteration of the loop
  boot_data <- as.data.frame(teengamb[rows, ]) 
  
  # Estimating the linear model for the Mth iteration
  results <- summary(lm(data = boot_data,
                        gamble ~ income + sex + status))
  
  # Recovering coefficient estimates for the Mth iteration
  # and storing them in the global environment
  # There are many ways to do this: be creative! But make sure
  # your loop actually does what you think it is doing
  boot_betas[i,] <- as.data.frame(t(results$coefficients[,1]))
  
  # Same thing for the standard errors
  boot_errors[i,] <- as.data.frame(t(results$coefficients[,2]))
}

### Having collected our bootstrapped data, let's preview some results
# The empirical sampling distribution of each coefficient:
ggplot(data = boot_betas,
       aes(x = Intercept))+
  geom_density()+
  geom_vline(aes(xintercept = mean(Intercept)),
             color = "blue")+
  annotate("text",
           x = mean(boot_betas$Intercept),
           hjust = - 0.25,
           y = 0.025,
           label = paste("Mean =", round(mean(boot_betas$Intercept), digits = 2)),
           color = "blue")+
  theme_bw()+
  xlab("Coefficients: Intercept")+
  ylab("Density")+
  labs(title = "Empirical Distribution of Coefficients: Intercept")

# For Income
ggplot(data = boot_betas,
       aes(x = Income))+
  geom_density()+
  geom_vline(aes(xintercept = mean(Income)),
             color = "blue")+
  annotate("text",
           x = mean(boot_betas$Income),
           hjust = - 0.25,
           y = 0.025,
           label = paste("Mean =", round(mean(boot_betas$Income), digits = 2)),
           color = "blue")+
  theme_bw()+
  xlab("Coefficients: Income")+
  ylab("Density")+
  labs(title = "Empirical Distribution of Coefficients: Income")

# For Sex
ggplot(data = boot_betas,
       aes(x = Sex))+
  geom_density()+
  geom_vline(aes(xintercept = mean(Sex)),
             color = "blue")+
  annotate("text",
           x = mean(boot_betas$Sex),
           hjust = - 0.25,
           y = 0.025,
           label = paste("Mean =", round(mean(boot_betas$Sex), digits = 2)),
           color = "blue")+
  theme_bw()+
  xlab("Coefficients: Sex")+
  ylab("Density")+
  labs(title = "Empirical Distribution of Coefficients: Sex")

# For Status
ggplot(data = boot_betas,
       aes(x = Status))+
  geom_density()+
  geom_vline(aes(xintercept = mean(Status)),
             color = "blue")+
  annotate("text",
           x = mean(boot_betas$Status),
           hjust = - 0.25,
           y = 0.025,
           label = paste("Mean =", round(mean(boot_betas$Status), digits = 2)),
           color = "blue")+
  theme_bw()+
  xlab("Coefficient: Status")+
  ylab("Density")+
  labs(title = "Empirical Distribution of Coefficients: Status")

# We can construct similar plots for the standard errors
# Now to actually report results:
# recovering the averages from the bootstrapped samples
results_boot_betas <- apply(boot_betas,
                         2,
                         FUN = mean)

# and manually calcuating standard errors
results_boot_errors <- apply(boot_errors,
                          2,
                          FUN = mean)

# Bias Correction
boot_betas_corrected <- (2 * model_betas) - results_boot_betas

boot_ci <- rbind(
# Intercept
cbind(boot_betas_corrected[1] - 1.96 * (results_boot_errors[1]) / sqrt(M),
      boot_betas_corrected[1] + 1.96 * (results_boot_errors[1]) / sqrt(M)),

# Income
cbind(boot_betas_corrected[2] - 0.95 * (results_boot_errors[2]) / sqrt(M),
      boot_betas_corrected[2] + 0.95 * (results_boot_errors[2]) / sqrt(M)),

# Sex
cbind(boot_betas_corrected[3] - 1.96 * (results_boot_errors[3]) / sqrt(M),
      boot_betas_corrected[3] + 1.96 * (results_boot_errors[3]) / sqrt(M)),

# Status
cbind(boot_betas_corrected[4] - 1.96 * (results_boot_errors[4]) / sqrt(M),
      boot_betas_corrected[4] + 1.96 * (results_boot_errors[4]) / sqrt(M))
)

# Reporting Results
results_boot <- cbind(model_betas,
                      boot_betas_corrected,
                      model_errors,
                      results_boot_errors,
                      boot_ci)

colnames(results_boot) <- c("Model Betas",
                         "Bootstrap Betas",
                         "Model Errors",
                         "Bootstrap Errors",
                         "Bootstrap Lower Bound",
                         "Bootstrap Upper Bound")
results_boot
##           Model Betas Bootstrap Betas Model Errors Bootstrap Errors
## Intercept  13.0314758      14.0404214   15.8676107        15.319307
## Income      4.9279770       4.9632841    1.0352454         1.012283
## Sex       -24.3393433     -24.8959198    8.1274125         7.753255
## Status     -0.1495854      -0.1658591    0.2412857         0.232462
##           Bootstrap Lower Bound Bootstrap Upper Bound
## Intercept            13.7401629            14.3406798
## Income                4.9536675             4.9729008
## Sex                 -25.0478836           -24.7439560
## Status               -0.1704153            -0.1613028

So what we’ve done here is recover estimates that are admittedly biased (mostly the intercept), but have greater certainty around their point estimates. We still need to be careful when making claims about the population characteristics, but we have a very good overall picture of the sampling distribution of teengamb.