Big picture: we wish to make some claim about a relationship in the world. We can characterize this relationship between outcome \(Y\) and explanatory factor(s) \(X\) in a linear framework, wherein \(\beta\) is the parameter of interest. \(\beta\) itself is unknown, but we use data to estimate some \(\hat{\beta}\) that is informative of the true \(\beta\). In other words, \(\hat{\beta}\) allows to make inferences about the nature of \(\beta\).
The phrase “omnibus F-testing” is an excellent vocal warm-up
for those of you who have to do some public speaking. Try it out, and be
sure to over-emphasize each syllable.
More practically, omnibus F-testing refers to the practice of evaluating a battery of null hypotheses within a single linear regression framework. In general, we specify:
\[\begin{equation*} H_0: \textbf{A}\hat{\beta}= \textbf{b} \end{equation*}\]
Where \(\textbf{A}\) is of full row
rank. This is flexible to any sort of linear restriction on \(\beta\), as well as any number of
linear-independence restrictions. In class, we demonstrated two
cases:
To summarize: for the omnibus F-test, we specify some form of null
hypothesis where 1) the sum of all betas is zero; or 2) every beta is
zero. The second is more common and flexible. As such, we specify the
alternative hypothesis \(H_A\):
\[\begin{equation*} H_A: \text{For}\, i \in [1,k], \, \exists \, i \, \phi( \beta_i) \not = 0 \end{equation*}\]
That is, among our vector of \(k\)
betas, there exists at least 1 \(\beta_i \not
= 0\), which falsifies \(H_0\).
The way we evaluate these hypotheses is by constructing a suitable test
statistic, \(F\), which evaluates the
cumulative distance from \(H_0\) we
observe via \(\hat{\beta}\).
Specifically, define:
\[\begin{equation*} F \equiv \frac{(\textbf{A} \hat{\beta}- \textbf{b})^{\top} \left(\textbf{A}(\textbf{X}^{\top}\textbf{X})^{-1}\textbf{A} \right)^{-1} (\textbf{A}\hat{\beta}- \textbf{b})}{\hat{\epsilon}^{\top}\hat{\epsilon} / \sigma^2} \times \frac{N - K - 1}{\text{rank}(\textbf{A})} \end{equation*}\]
The numerator and denominator are known to follow \(\chi^{2}_{\text{rank}(\textbf{A})}\) and
\(\chi^{2}_{N - K - 1}\) distributions,
respectively. Thus, \(F\) also follows
a known distribution: \(F \sim
F_{\text{rank}(\textbf{A}), N - K - 1}\). If our test statistic
\(F\) is larger than the critical value
associated with some specified \(\alpha\), we reject the null hypothesis
(similar to the univariate case).
Consider the following example:
library(faraway)
# A simple multivariate model of demographic/economic variables
# on propensity for gambling
summary(lm(data = teengamb,
gamble ~ income + sex + status))
##
## Call:
## lm(formula = gamble ~ income + sex + status, data = teengamb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.682 -12.169 -0.268 9.161 97.728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.0315 15.8676 0.821 0.41603
## income 4.9280 1.0352 4.760 2.21e-05 ***
## sex -24.3393 8.1274 -2.995 0.00454 **
## status -0.1496 0.2413 -0.620 0.53856
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.92 on 43 degrees of freedom
## Multiple R-squared: 0.5058, Adjusted R-squared: 0.4713
## F-statistic: 14.67 on 3 and 43 DF, p-value: 1.014e-06
# And the F-test statistic.
summary(lm(data = teengamb,
gamble ~ income + sex + status + income))$fstatistic[1]
## value
## 14.67009
Under this specific set of parameters, our \(F\)-test statistic is 14.67, which is
significant at the \(p = 0.0000001\)
level. In other words, we reject the null hypothesis that every \(\beta\) in this regression is 0. Notice
that we fail to reject the null hypothesis that the \(\beta\) associated with status
is 0, but fail to reject the null that every \(\beta\) is 0.
Let \(\textbf{X}\) be matrix of \(k\) variables. In the OLS framework, an interaction term \(X_{\ell}\) is a variable where \(X_{\ell} = X_{m} \times X_n\); that is, \(X_{\ell}\) is itself a product of two other variables (\(X_m\) and \(X_n\) are analogous to \(X_i\) and \(Z_i\) in Ted’s slides). The model in such a scenario is:
\[\begin{equation*} Y_i = \beta_0 + \beta_1 X_{i,m} + \beta_2 X_{i, n} + \beta_3 X_{i, \ell} + \epsilon_i \end{equation*}\]
Using an interaction term allow the slope of one variable to be the
function of another, and allows us to measure the variation
within a subgroup of our data. Usually, this works most
effectively when one of \(X_m\) or
\(X_m\) is a binary variable.
Let’s revisit the teengamb
for an application. Let \(Y\) be gamble
, and our two
\(X\) be income
and
sex
.
# Plotting the data without regard for `sex`
library(ggplot2)
ggplot(data = teengamb,
aes(x = income,
y = gamble))+
geom_point()+
geom_smooth(method = lm)+
theme_bw()+
xlab("Income")+
ylab("Gambling Propensity")+
theme(legend.background = element_rect(color = "grey80"),
legend.position = "right",
legend.key.size = unit(2, "cm"))+
labs(title = "Income and Gambling Propensity")
We see a pretty clear positive relationship: as income increases, so does propensity for gambling. However, this approach masks the role of gender (setting aside the normative implications of treating gender as exclusively binary).
# Now plotting with distinctions for sex
ggplot(data = teengamb,
aes(x = income,
y = gamble,
color = as.factor(sex)))+
geom_point()+
geom_smooth(method = lm)+
theme_bw()+
scale_color_manual(values = c("0" = "blue",
"1" = "orange"),
labels = c("Men",
"Women"))+
xlab("Income")+
ylab("Gambling Propensity")+
theme(legend.background = element_rect(color = "grey80"),
legend.position = "right",
legend.key.size = unit(2, "cm"))+
labs(title = "Income and Gambling Propensity by Sex",
color = "Sex")
In a regression framework, we can interact income
and
sex
to recover these differences.
summary(lm(data = teengamb,
gamble ~ income + sex + (income * sex)))
##
## Call:
## lm(formula = gamble ~ income + sex + (income * sex), data = teengamb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.522 -4.860 -1.790 6.273 93.478
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.6596 6.3164 -0.421 0.67580
## income 6.5181 0.9881 6.597 4.95e-08 ***
## sex 5.7996 11.2003 0.518 0.60724
## income:sex -6.3432 2.1446 -2.958 0.00502 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.98 on 43 degrees of freedom
## Multiple R-squared: 0.5857, Adjusted R-squared: 0.5568
## F-statistic: 20.26 on 3 and 43 DF, p-value: 2.451e-08
Here, the slope of interest for women is plausibly zero, whereas for
men it is clearly positive. A note: be very careful about interpreting
the results of a regression table versus a plot in this setting! \(\beta_1 = 0\) indicates that, among men, a
unit increase in income is associated with a 6.52 increase in gambling
propensity. The plot reflects this. However, it is different for women
(an example of sexism in my lab composition practices, perhaps).
The plot suggests that, for women, the relationship between income
and gambling propensity is close to zero. However, the regression table
\(\beta_3\) should be interpreted as
among women, a unit increase in income is associated with a 6.34
unit decrease in gambling propensity compared to men.. In other
words, \(\beta_3\) captures variation
in \(X_m\) (income) holding \(X_n = 1\) (sex).
A final caution: for interaction terms to work, we need enough common support (variation in \(X_m\) throughout the range in possible values of \(X_n\).). Referring to the plot, there is very clearly variation in gambling propensity for men, whereas the variation for women is virtually nonexistent. A lesson: visualizing relationships conditioning on variables is always a good complement to any regression exercise.
Remember that, in nearly any setting, we only get one attempt at estimating \(\beta\) and other parameters of interest. That is, we only get to conduct a single draw from our population of interest, and must use that draw to make any and all inferences. The bootstrap is a statistics technique which allows us to exploit variation in our single draw: since we cannot resample from the population, we instead will resample from the sample. The procedure, in general, is:
If we do this many times, we can approximate the act of
sampling from the full population, which allows us to estimate the
sampling distribution of any statistic.
An application using teengamb
:
# Setting seed for replicability
set.seed(12345)
### Bootstrap parameters of interest
# We will sample N times
N <- nrow(teengamb) # we will sample n times
# And iterate M times
M <- 10000
### "True" model
# What happens if we run this through lm()?
my_model <- lm(data = teengamb,
gamble ~ income + sex + status)
summary(my_model)
##
## Call:
## lm(formula = gamble ~ income + sex + status, data = teengamb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.682 -12.169 -0.268 9.161 97.728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.0315 15.8676 0.821 0.41603
## income 4.9280 1.0352 4.760 2.21e-05 ***
## sex -24.3393 8.1274 -2.995 0.00454 **
## status -0.1496 0.2413 -0.620 0.53856
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.92 on 43 degrees of freedom
## Multiple R-squared: 0.5058, Adjusted R-squared: 0.4713
## F-statistic: 14.67 on 3 and 43 DF, p-value: 1.014e-06
# We're going to bind these results later
model_betas <- my_model$coefficients
names(model_betas) <- c("Intercept", "Income", "Sex", "Status")
model_errors <- sqrt(diag(vcov(my_model)))
names(model_errors) <- c("Intercept", "Income", "Sex", "Status")
### Now Bootstrap
# There are a bajillion different ways to bootstrap, including many
# canned functions in various R packages, but we'll do so manually
# using a for-loop. We do it this way to illustrate exactly what
# it is we are doing.
# Creating empty matrices in which to store coefficient estimates
# Since we are capturing four parameters over M iterations,
# this will be 4 x M
boot_betas <- data.frame(matrix(ncol = 4,
nrow = M))
names(boot_betas) <- c("Intercept", "Income", "Sex", "Status")
boot_errors <- data.frame(matrix(ncol = 4,
nrow = M))
names(boot_errors) <- c("Intercept", "Income", "Sex", "Status")
# Constructing the for-loop
for (i in 1:M) { # We wish to iterate M times
# print(i) # Printing the iteration number is good practice when
# running loops, but it'll explode our document here,
# so we comment it out
# Here, we create an index of rows from which to sample
rows <- sample(x = N, # the list of rows from which we will sample
size = N, # the number of times we will sample
replace = TRUE) # and replace each draw each time
# Now we actually sample the observations from our data
# for the Mth iteration of the loop
boot_data <- as.data.frame(teengamb[rows, ])
# Estimating the linear model for the Mth iteration
results <- summary(lm(data = boot_data,
gamble ~ income + sex + status))
# Recovering coefficient estimates for the Mth iteration
# and storing them in the global environment
# There are many ways to do this: be creative! But make sure
# your loop actually does what you think it is doing
boot_betas[i,] <- as.data.frame(t(results$coefficients[,1]))
# Same thing for the standard errors
boot_errors[i,] <- as.data.frame(t(results$coefficients[,2]))
}
### Having collected our bootstrapped data, let's preview some results
# The empirical sampling distribution of each coefficient:
ggplot(data = boot_betas,
aes(x = Intercept))+
geom_density()+
geom_vline(aes(xintercept = mean(Intercept)),
color = "blue")+
annotate("text",
x = mean(boot_betas$Intercept),
hjust = - 0.25,
y = 0.025,
label = paste("Mean =", round(mean(boot_betas$Intercept), digits = 2)),
color = "blue")+
theme_bw()+
xlab("Coefficients: Intercept")+
ylab("Density")+
labs(title = "Empirical Distribution of Coefficients: Intercept")
# For Income
ggplot(data = boot_betas,
aes(x = Income))+
geom_density()+
geom_vline(aes(xintercept = mean(Income)),
color = "blue")+
annotate("text",
x = mean(boot_betas$Income),
hjust = - 0.25,
y = 0.025,
label = paste("Mean =", round(mean(boot_betas$Income), digits = 2)),
color = "blue")+
theme_bw()+
xlab("Coefficients: Income")+
ylab("Density")+
labs(title = "Empirical Distribution of Coefficients: Income")
# For Sex
ggplot(data = boot_betas,
aes(x = Sex))+
geom_density()+
geom_vline(aes(xintercept = mean(Sex)),
color = "blue")+
annotate("text",
x = mean(boot_betas$Sex),
hjust = - 0.25,
y = 0.025,
label = paste("Mean =", round(mean(boot_betas$Sex), digits = 2)),
color = "blue")+
theme_bw()+
xlab("Coefficients: Sex")+
ylab("Density")+
labs(title = "Empirical Distribution of Coefficients: Sex")
# For Status
ggplot(data = boot_betas,
aes(x = Status))+
geom_density()+
geom_vline(aes(xintercept = mean(Status)),
color = "blue")+
annotate("text",
x = mean(boot_betas$Status),
hjust = - 0.25,
y = 0.025,
label = paste("Mean =", round(mean(boot_betas$Status), digits = 2)),
color = "blue")+
theme_bw()+
xlab("Coefficient: Status")+
ylab("Density")+
labs(title = "Empirical Distribution of Coefficients: Status")
# We can construct similar plots for the standard errors
# Now to actually report results:
# recovering the averages from the bootstrapped samples
results_boot_betas <- apply(boot_betas,
2,
FUN = mean)
# and manually calcuating standard errors
results_boot_errors <- apply(boot_errors,
2,
FUN = mean)
# Bias Correction
boot_betas_corrected <- (2 * model_betas) - results_boot_betas
boot_ci <- rbind(
# Intercept
cbind(boot_betas_corrected[1] - 1.96 * (results_boot_errors[1]) / sqrt(M),
boot_betas_corrected[1] + 1.96 * (results_boot_errors[1]) / sqrt(M)),
# Income
cbind(boot_betas_corrected[2] - 0.95 * (results_boot_errors[2]) / sqrt(M),
boot_betas_corrected[2] + 0.95 * (results_boot_errors[2]) / sqrt(M)),
# Sex
cbind(boot_betas_corrected[3] - 1.96 * (results_boot_errors[3]) / sqrt(M),
boot_betas_corrected[3] + 1.96 * (results_boot_errors[3]) / sqrt(M)),
# Status
cbind(boot_betas_corrected[4] - 1.96 * (results_boot_errors[4]) / sqrt(M),
boot_betas_corrected[4] + 1.96 * (results_boot_errors[4]) / sqrt(M))
)
# Reporting Results
results_boot <- cbind(model_betas,
boot_betas_corrected,
model_errors,
results_boot_errors,
boot_ci)
colnames(results_boot) <- c("Model Betas",
"Bootstrap Betas",
"Model Errors",
"Bootstrap Errors",
"Bootstrap Lower Bound",
"Bootstrap Upper Bound")
results_boot
## Model Betas Bootstrap Betas Model Errors Bootstrap Errors
## Intercept 13.0314758 14.0404214 15.8676107 15.319307
## Income 4.9279770 4.9632841 1.0352454 1.012283
## Sex -24.3393433 -24.8959198 8.1274125 7.753255
## Status -0.1495854 -0.1658591 0.2412857 0.232462
## Bootstrap Lower Bound Bootstrap Upper Bound
## Intercept 13.7401629 14.3406798
## Income 4.9536675 4.9729008
## Sex -25.0478836 -24.7439560
## Status -0.1704153 -0.1613028
So what we’ve done here is recover estimates that are admittedly
biased (mostly the intercept), but have greater certainty around their
point estimates. We still need to be careful when making claims about
the population characteristics, but we have a very good overall picture
of the sampling distribution of teengamb
.