Model-Based Inference

Let \(X\) be a random variable. Here are a bunch of definitions for model-based inference we will use:
- \(X_i\) are indexed manifestations of \(X\) from \(i\) to \(n\), where \(n\) is the total number of observations.
- \(\mu\) is the population mean
- \(\sigma^2\) is the population variance
- \(\hat{\mu}\) or \(\bar{X}\) is the sample mean
- \(\hat{\sigma}\) is the sample variance
- The standard error, or estimated standard deviation of the sampling distribution is \(\frac{\hat{\sigma}}{\sqrt{n}}\)

Assume that \(X_i\) are independent and identically-distributed (often abbreviated as “i.i.d.”). Then,
- \(\hat{\mu}= \frac{1}{n} \displaystyle \sum_{i=1}^{n}X_i\)
- \(\mathbb{E}[X_i] = \mu\)
- \(\mathbb{V}(X_i) = \sigma^2\)
- \(\mathbb{E}[\hat{\mu}] = \mathbb{E}\left[\frac{1}{n} \displaystyle \sum_{i=1}^{n}X_i \right] = \frac{1}{n} \cdot n \cdot \mathbb{E}[X_i] = \mu\)
- \(\mathbb{V}(\hat{\mu}) = \mathbb{V}\left(\frac{1}{n} \displaystyle \sum_{i=1}^{n}X_i \right) = \left(\frac{1}{n}\right)^2 \cdot n \cdot \mathbb{V}(X_i) = \frac{\sigma^2}{n}\)


Unbiasedness

Definition: Unbiased
Let \(\hat{\theta}\) be an estimator for some estimand \(\theta\). We say that \(\hat{\theta}\) is unbiased if \(\mathbb{E}[\hat{\theta}] - \theta = 0\).

Intuitively, what does this mean?

Exercise: Deriving the Sample Mean
Let \(X\) be a random variable. Let \(\mu \equiv \mathbb{E}[X_i]\) be the population mean of \(X\) and \(\hat{\mu}= \mathbb{E}\left[\frac{1}{n}\displaystyle \sum_{i = 1}^{n} X_i \right]\). Show that \(\hat{\mu}\) is an unbiased estimator of \(\mu\).

Solution: An estimator is unbiased if \(\mathbb{E}[\hat{\theta}] - \theta = 0\). In this case, we wish to show \(\mathbb{E}[\hat{\mu}] - \mu = 0\). Using the definitions for each, we calculate:

\[\begin{equation*} \begin{aligned} \mathbb{E}[\hat{\mu}] - \mu & = \mathbb{E}\left[ \mathbb{E}\left[\frac{1}{n}\displaystyle \sum_{i = 1}^{n} X_i \right] \right] - \mathbb{E}[X_i]\\ & = \frac{1}{n} \cdot \displaystyle \sum_{i=1}^{n} \mathbb{E}\left[\mathbb{E}[X_i]\right] - \mathbb{E}[X_i]\\ & = \frac{n}{n} \cdot \mathbb{E}[X_i] - \mathbb{E}[X_i]\\ & = 0 \end{aligned} \end{equation*}\]

Note: any proof that \(\hat{\theta}\) is an unbiased estimator of \(\theta\) in this course will follow a similar logic.


Asymptotic Properties of the Sample Mean

Definition: Weak Law of Large Numbers

If \(\{X_i\}_{i = 1}^{n}\) is a sequence of i.i.d. random variables with mean \(\mu\) and finite variance \(\sigma^2\), then:

\[\begin{equation*} \bar{X}_n \overset{p} \to \mu \end{equation*}\]

or \(\bar{X}_n\) converges in probability to \(\mu\).

Definition: Central Limit Theorem

If \(\{X_i\}_{i = 1}^{n}\) is a sequence of i.i.d. random variables with mean \(\mu\) and finite variance \(\sigma^2\), then:

\[\begin{equation*} \frac{\bar{X}_n - \mu}{\sigma \sqrt{n}} \overset{d} \to \mathcal{N}(0,1) \end{equation*}\]

Implications
WLLN implies consistency:

\[\begin{equation*} \hat{\mu}= \bar{X}_n \overset{p} \to \mu \end{equation*}\]

CLT implies asymptotic normality:

\[\begin{equation*} \begin{aligned} \sqrt{n} \cdot (\hat{\mu}- \mu) & \overset{d} \to \mathcal{N}(0, \sigma^2)\\ \implies \hat{\mu}& \overset{\text{approx}} \sim \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right) \end{aligned} \end{equation*}\]

Definition: Standard Error
The standard error is the estimated standard deviation of the sampling distribution. It is given by \[\begin{equation*} \text{s.e. of } \hat{\mu}= \frac{\hat{\sigma}}{\sqrt{n}} \end{equation*}\] where \(\hat{\sigma}^2\) is unbiased and consistent for \(\sigma^2\) (by WLLN).

Asymptotic Confidence Intervals
Using the property:

\[\begin{equation*} \frac{\hat{\mu}- \mu}{\sigma \sqrt{n}} \overset{d} \to \mathcal{N}(0,1) \end{equation*}\]

we can derive the 95% asymptotic confidence interval:

\[\begin{equation*} \begin{aligned} \mathbb{P}(- 1.96 \leq \frac{\hat{\mu}- \mu}{\sigma \sqrt{n}} \leq 1.96) & \overset{p} \to 0.95\\ \implies \mathbb{P}(\hat{\mu}- 1.96 \times \hat{\sigma}/ \sqrt{n} \leq \mu \leq \hat{\mu}+ 1.96 \times \hat{\sigma}/ \sqrt{n}) & \overset{p} \to 0.95 \end{aligned} \end{equation*}\]

Generally, for some value \(\alpha \in [0,1]\), we can express the \((1 - \alpha) \times 100\%\) asymptotic confidence interval as:

\[\begin{equation*} \text{CI}_{1-\alpha} = \left[\hat{\mu}- z_{\frac{\alpha}{2}} \times \text{s.e.},\, \hat{\mu}+ z_{\frac{\alpha}{2}} \times \text{s.e.}\right] \end{equation*}\]


A Half-Baked Example
I found this recipe for a quiche which calls for some proportion of spinach. We want to figure out how much spinach is in the quiche, but we don’t get to eat the entire thing – we only get the piece on our plate. However, we can the asymptotic properties of the sample mean to make an informed estimate of just how much spinach we have. Assume that our quiche is infinitely big and that it was properly stirred prior to baking (i.e, the spinach is “i.i.d.”).

Suppose that our piece takes 10 bites to consume and 3 of those bites contained spinach. We might say that our estimate, \(\hat{\mu}\) (or \(\hat{p}\)), is \[\begin{equation*} \hat{\mu}= \hat{p} = \frac{1}{10} \displaystyle \sum_{i = 1}^{10} X_i = 0.3 \end{equation*}\]

The variance of our estimate is constructed using \(\frac{\hat{\sigma}^2}{n}\). Because our quiche is fundamentally a Bernoulli experiment, we can impute the Bernoulli variance \(\hat{\sigma}^2 = \hat{p}(1 - \hat{p})\): \[\begin{equation*} \hat{\mathbb{V}}(p) = \frac{\hat{\sigma}^2}{n} = \frac{\hat{p}(1 - \hat{p})}{n} = \frac{0.3(1 - 0.3)}{10} = 0.021 \end{equation*}\]

Because the standard error is the square root of the estimated variance, we have \(\text{s.e.} = \sqrt{0.021} \approx 0.145\). Then, constructing our 95% asymptotic confidence interval, we have:

\[\begin{equation*} 95\%\text{CI} = \left[0.3 \pm 1.96 \times 0.145 \right] = [0.0145, 0.5842] \end{equation*}\]

What does this mean? It implies that, over repeated samples (i.e. repeated bites and pieces of this infinite quiche), the probability that the true proportion of bites with spinach is within [0.0145, 0.5842] is 95%.


Working with a Data Set in R

We will do some exercises using the mtcars dataset in R.

# mtcars is a part of base R, meaning we do not need any packages to use it.

# Loading the data. We'll assign the term "data" to the object "mtcars" using `<-`
data <- mtcars

# since mtcars is in base R, we can also just call it anytime using `mtcars`

# What are the objects in the dataset? Using help(mtcars), we can get information.
# In this case, each row is a different model of car.

# What variables are in this dataset? We can investigate using colnames()
colnames(data)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
# We can call any variables in our data using the "$" character. We do so by typing
# data$variable. If we want miles per gallon, or "mpg", we use "data$mpg"
data$mpg
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4

Suppose we care about three things: the miles per gallon of each car, their weights, and the relationship between the two.

# The variables we care about in this case are "mpg" and "weight".
# "mpg" is how many miles per gallon of fuel the car gets on average
# "wt" is the weight of the vehicle in thousands of pounds

To find the sample means. of mpg and wt, we can use \(\hat{\mu}= \frac{1}{n} \displaystyle \sum_{i=1}^{n}X_i\)

# What is n? We can use nrow() to calculate it.
nrow(data)
## [1] 32
# We can store this value using "<-". That way, any time we need to do math with n,
# we can call that value automatically.
n <- nrow(data)

# Let's find the mean of "mpg". We'll call it "mu_mpg". Use our formula
# for mu hat.
mu_mpg <- (1 / n) * sum(data$mpg)
mu_mpg
## [1] 20.09062
# Similarly, we can find the mean of "wt".
mu_wt <- (1 / n) * sum(data$wt)
mu_wt
## [1] 3.21725
# Of course, R can calculate means automatically. Use mean().
mean(data$mpg)
## [1] 20.09062
mean(data$wt)
## [1] 3.21725
# We can visualize these using hist()
hist(data$mpg,
     
     # Breaks sets the bin size for the histogram. Here, we want all values of mpg
     # so we set breaks equal to n
     breaks = n,
     main = "Histogram of Miles per Gallon",
     xlab = "Miles per Gallon")

# abline() places a line on top of our plot. Here, we're putting a vertical line
# indicating where the mean is.
abline(v = mu_mpg,
       col = "red",
       lty = "dashed")

To find the sample variances, we can use \(\hat{\sigma}= \frac{1}{n - 1} \cdot \displaystyle \sum_{i = 1}^{n} (X_i - \hat{\mu})^2\)

# Let's find the variance of "mpg." We'll call it "sd_mpg".
sd_mpg <- (1 / (n-1)) * sum((data$mpg - mu_mpg)^2)
sd_mpg
## [1] 36.3241
# And for weight
sd_wt <- (1 / (n-1)) * sum((data$wt - mu_wt)^2)
sd_wt
## [1] 0.957379
# Of course, R can do this automatically. Use var().
var(data$mpg)
## [1] 36.3241
var(data$wt)
## [1] 0.957379

We can also report these values aggregated by some group. In base R, we can do this with the aggregate() function:

# Let's group the weights by number of forward gears
# This is done with `aggregate()`

aggregate(x = data$wt, # The data we want to aggregate
          
          by = list(data$gear), # The data we're grouping by
          
          FUN = mean # the function we're applying
          )
##   Group.1        x
## 1       3 3.892600
## 2       4 2.616667
## 3       5 2.632600
# Does it make sense that cars with more forward gears seem to weigh less?

Suppose we want to investigate the relationship between mpg and wt. We can use plot() in base R to visualize that relationship. We can also group the data in our plot.

plot(x = data$mpg, # This specifies what our x-axis is
     y = data$wt,  # and this specifies our y axis
     
     col = factor(data$gear), # grouping the data by number of forward gears
     
     # These just label the plot title and axes.
     main = "Miles per Gallon vs Weight",
     xlab = "Miles per Gallon",
     ylab = "Weight (lbs)")

# This creates a legend for our points
legend("topright", # where on the plot is the legend?
       
       legend = levels(factor(data$gear)), # What we are grouping the legend by?
       
       pch = 19, # what symbol do we use?
       
       col = factor(levels(factor(data$gear))), # mapping the colors to our legend
       
       title = "Number of Gears" # giving the legend a title
       )