Political Science 581: Quantitative Political Methodology I

Normal Distribution

Definition: a continuous random variable $Z \sim \mathcal{N}(0,1)$ if it has the following pdf: \[\begin{equation*} \phi(z) = \displaystyle \frac{1}{\sqrt{2 \pi}} \exp \left[\frac{-z^2}{2} \right] \end{equation*}\]

Likewise, the CDF of a standard normal random variable is given by: \[\begin{equation*} \Phi(z) = \displaystyle \int_{-\infty}^{z} \frac{1}{\sqrt{2 \pi}} \exp \left[\frac{-t^2}{2} \right]\,dt \end{equation*}\]

The normal distribution is symmetric around its center, which gives us some useful properties. Let $Z \sim \mathcal{N}(0,1)$:

$\phi(z) = \phi(-z)$. We can visualize this with R.

# I'm just plotting z from -4 to 4. Remember that it actually traverses the real line.
# Use z to shift the bounds of the plot.
# This is really important! If you're looking at something centered around
# another number, you need to adjust the field of view accordingly!
z <- c(-4,4)

# I'm using the ggplot2 library to make some nicer visuals than base R.
# If you've never used ggplot2, you can install it by entering
# `install.packages("ggplot2")` into the console.
# You only need to install a package once, but need to load that library
# every new session of R. You do that by entering `library(*)` into the console.
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.1.3

# We're passing `z` into ggplot() as a data.frame object. Don't worry about this too much
ggplot(data = data.frame(z), aes(z)) +
  
  # This inputs the standard normal distribution with mean 0 and standard deviation 1
  stat_function(fun = dnorm, n = 101, args = list(mean = 0, sd = 1)) +
  
  # Suppose z = 1. Then -z = -1. Then phi(z) = phi(1) = phi(-z) = phi(-1)
  # This code makes the point and dotted lines for phi(1)
  geom_point(aes(x = 1, y = dnorm(1)), color = "red")+
  geom_segment(aes(x = 1, xend = 1, y = 0, yend = dnorm(1)),
               color = "red",
               linetype = "dashed")+
  annotate("text", x = 1, y = dnorm(1), hjust = -0.25, label = "phi(1)", color = "red")+
  
  # and this does the same for phi(-1)
  geom_point(aes(x = -1, y = dnorm(-1)), color = "red")+
  geom_segment(aes(x = -1, xend = -1, y = 0, yend = dnorm(-1)),
               color = "red",
               linetype = "dashed")+
  annotate("text", x = -1, y = dnorm(-1), hjust = 1.25, label = "phi(-1)", color = "red")+
  
  # This horizontal line shows that the two are equal
  geom_segment(aes(x = -1, xend = 1, y = dnorm(-1), yend = dnorm(1)),
               color = "red",
               linetype = "dashed")+
  
  # This just makes it nicer to look at
  ylab("") +
  scale_y_continuous(breaks = NULL)+
  theme_bw()

# And this works for any value of z! For instance, z = +/-0.5 and z = +/-2:
# z = +/-0.5
ggplot(data = data.frame(z), aes(z)) +
  stat_function(fun = dnorm, n = 101, args = list(mean = 0, sd = 1)) + ylab("") +
  geom_point(aes(x = 0.5, y = dnorm(0.5)), color = "blue")+
  geom_segment(aes(x = 0.5, xend = 0.5, y = 0, yend = dnorm(0.5)),
               color = "blue",
               linetype = "dashed")+
  annotate("text", x = 0.5, y = dnorm(0.5), hjust = -0.25, label = "phi(0.5)", color = "blue")+
  geom_point(aes(x = -0.5, y = dnorm(-0.5)), color = "blue")+
  geom_segment(aes(x = -0.5, xend = -0.5, y = 0, yend = dnorm(-0.5)),
               color = "blue",
               linetype = "dashed")+
  annotate("text", x = -0.5, y = dnorm(-0.5), hjust = 1.25, label = "phi(-0.5)", color = "blue")+
  geom_segment(aes(x = -0.5, xend = 0.5, y = dnorm(-0.5), yend = dnorm(0.5)),
               color = "blue",
               linetype = "dashed")+
  scale_y_continuous(breaks = NULL)+
  theme_bw()

# and z = +/-2
ggplot(data = data.frame(z), aes(z)) +
  stat_function(fun = dnorm, n = 101, args = list(mean = 0, sd = 1)) + ylab("") +
  geom_point(aes(x = 2, y = dnorm(2)), color = "blue")+
  geom_segment(aes(x = 2, xend = 2, y = 0, yend = dnorm(2)),
               color = "blue",
               linetype = "dashed")+
  annotate("text", x = 2, y = dnorm(2), hjust = -0.25, label = "phi(2)", color = "blue")+
  geom_point(aes(x = -2, y = dnorm(-2)), color = "blue")+
  geom_segment(aes(x = -2, xend = -2, y = 0, yend = dnorm(-2)),
               color = "blue",
               linetype = "dashed")+
  annotate("text", x = -2, y = dnorm(-2), hjust = 1.25, label = "phi(-2)", color = "blue")+
  geom_segment(aes(x = -2, xend = 2, y = dnorm(-2), yend = dnorm(2)),
               color = "blue",
               linetype = "dashed")+
  scale_y_continuous(breaks = NULL)+
  theme_bw()

$\Phi(Z) = 1 - \Phi(-z)$. Because of the symmetry, both partitions of $\Phi(z)$ and $\Phi(-z)$ must sum to 1. In R:

# Let's use the standard normal again
# Suppose Z = 0
ggplot(data = data.frame(z), aes(z))+
  
  # This just plots the normal PDF. you can change the mean and sd later
  stat_function(fun = dnorm, n = 101, args = list(mean = 0, sd = 1))+
  
  # This specifies area under the curve visually. Remember that this goes from -infinity to Z
  # `fill` just specifies color and `alpha` is transparency
  stat_function(fun = dnorm,
                xlim = c(-4, 0), # Remember, we want the area from the lower bound to Z = 0
                geom = "area",
                fill = "red",
                alpha = 0.5)+
  
  # And this is the right partition from Z to +infinity
  stat_function(fun = dnorm,
                xlim = c(0,4),
                geom = "area",
                fill = "blue",
                alpha = 0.5)+
  
  # And this is just to look nice
  scale_y_continuous(breaks = NULL)+
  theme_bw()

# Suppose, then, that Z = 1.
# Z = 1 is red and Z = -1 is blue/purple
ggplot(data = data.frame(z), aes(z))+
  stat_function(fun = dnorm)+
  stat_function(fun = dnorm,
                xlim = c(-4, 1),
                geom = "area",
                fill = "red",
                alpha = 0.5)+
  stat_function(fun = dnorm,
                xlim = c(-4, -1),
                geom = "area",
                fill = "blue",
                alpha = 0.5)+
  scale_y_continuous(breaks = NULL)+
  theme_bw()

# If we add those areas up, we get the whole sample space with area = 1
ggplot(data = data.frame(z), aes(z))+
  stat_function(fun = dnorm)+
  stat_function(fun = dnorm,
                xlim = c(-4, 1),
                geom = "area",
                fill = "red",
                alpha = 0.5)+
  stat_function(fun = dnorm,
                xlim = c(1, 4),
                geom = "area",
                fill = "blue",
                alpha = 0.5)+
  scale_y_continuous(breaks = NULL)+
  theme_bw()

Remember, for a random variable $X = \mu + \sigma Z$, we can standardize it to $Z \sim \mathcal{N}(0,1)$. In that case, we would say $X \sim \mathcal{N}(\mu, \sigma^2)$. A consequence is that we can use this property to derive information about any normal random variable in terms of the standard normal. To find the CDF of $X \sim \mathcal{N}(\mu, \sigma^2)$:

Note that $X = \mu + \sigma Z \implies \frac{x - \mu}{\sigma} = Z$.
Using the definition of CDF, we have: \[\begin{equation*} \begin{aligned} F_{X}(x) & = \mathbb{P}(X \leq x)\\ & = \mathbb{P}(\mu + \sigma Z \leq x)\\ & = \mathbb{P}(Z \leq \frac{x - \mu}{\sigma})\\ & = \Phi\left(\frac{x - \mu}{\sigma}\right) \end{aligned} \end{equation*}\]

Similarly, we can use this result to find the PDF: \[\begin{equation*} \begin{aligned} f_{X}(x) & = \frac{d}{dx}F_{X}(x)\\ & = \frac{d}{dx}\Phi\left(\frac{x - \mu}{\sigma}\right)\\ & = \frac{1}{\sigma}\frac{1}{\sqrt{2\pi}} \exp \left[\frac{(x - \mu)^2}{2\sigma^2} \right]\\ \implies \mathbb{P}(a \leq X \leq b) & = \Phi\left(\frac{b - \mu}{\sigma}\right) - \Phi\left(\frac{a - \mu}{\sigma}\right) \end{aligned} \end{equation*}\]

Exercise: Using Properties of the Normal Distribution
Let $X \sim \mathcal{N}(-5,4)$:
a) Find $\mathbb{P}(X \leq 0)$
b) Find $\mathbb{P}(-7 \leq x \leq -3)$

Solution:
a) Since $\mu = -5$ and $\sigma^2 = 4$, we can determine $\mathbb{P}(X \leq 0)$ using these properties: \[\begin{equation*} \begin{aligned} \mathbb{P}(X \leq 0) & = F_{X}(x = 0)\\ & = \Phi\left(\frac{0 - \mu}{\sigma}\right)\\ & = \Phi\left(\frac{0 - (-5)}{\sqrt{4}}\right)\\ & = \Phi\left(\frac{5}{2}\right) \end{aligned} \end{equation*}\]

Then, we can use the pnorm() function in R to evaluate that quantity.

pnorm(q = 5/2, mean = 0, sd = 1) # why are we using this mean and standard deviation?

## [1] 0.9937903

And we can double-check using pnorm() directly:

pnorm(q = 0, mean = -5, sd = sqrt(4))

## [1] 0.9937903

If we were to plot the area under the curve, it would look something like this:

z <- c(-13:3) # Shifting the bounds to something else

ggplot(data = data.frame(z), aes(z))+
  
  # Just drawing the black PDF curve
  stat_function(fun = dnorm,
                n = 101,
                args = list(mean = -5, sd = 2))+
  
  # Now the area under the curve
  stat_function(fun = dnorm,
                n = 101,
                args = list(mean = -5, sd = 2),
                xlim = c(-13, 0),
                geom = "area",
                fill = "red",
                alpha = 0.5)+
  ylab("")+
  theme_bw()

Simply plug and chug: \[\begin{equation*} \begin{aligned} \mathbb{P}(-7 \leq X \leq -3) & = \Phi\left(\frac{-3 - (-5)}{\sqrt{4}}\right) - \Phi\left(\frac{-7 - (-5)}{\sqrt{4}}\right)\\ & = \Phi\left(\frac{2}{2}\right) - \Phi\left(\frac{-2}{2}\right)\\ & = \Phi(1) - \Phi(-1) \end{aligned} \end{equation*}\]

From here, there are a number of ways to evaluate $\Phi(1) - \Phi(-1)$. We can do it directly using pnorm():

pnorm(q = 1, mean = 0, sd = 1) - pnorm(q = -1, mean = 0, sd = 1)

## [1] 0.6826895

We can also use a property of $\Phi(z)$: namely, that $\Phi(z) = 1 - \Phi(-z)$, which implies $\Phi(z) - \Phi(-z) = [1 - \Phi(z)] - \Phi(-z)$, which is $1 - 2 \Phi(-z)$ With pnorm(), that is:

1 - 2* pnorm(q = -1, mean = 0, sd = 1)

## [1] 0.6826895

Multiple Random Variables

Definition: Let $X$ and $Y$ be random variables. The joint distribution function of $X$ and $Y$ is the function $F : \mathbb{R}^2 \to [0,1]$, defined by $F(x,y) = \mathbb{P}(X \leq x, Y \leq y)$.

If $(X, Y)$ is a discrete random vector, the joint probability mass function $f : \mathbb{R}^2 \to \mathbb{R}$ is defined by $f(x,y) = \mathbb{P}(X = x, Y = y)$
If $(X, Y)$ is a continuous random vector, the joint probability density function $f : \mathbb{R}^2 : \mathbb{R}$ is defined by \[\begin{equation*} F(x,y) = \displaystyle \int_{-\infty}^{y} \int_{-\infty}^{x} f(s,t)\,ds\,dt \end{equation*}\]

Exercise: Using a Joint PDF
Let $X$ and $Y$ be two jointly continuous random variables with the following joint PDF: \[\begin{equation*} f_{XY}(x,y) = \begin{cases} x + cy^2 & \text{if } 0 \leq x \leq 1, 0 \leq y \leq 1\\ 0 & \text{otherwise} \end{cases} \end{equation*}\]

Find the constant $c$.
Find $\mathbb{P}(0 \leq X \leq \frac{1}{2}, 0 \leq Y \leq \frac{1}{2})$.

Solution:

Remember that $\displaystyle \int_{X} \int_{Y} f(x,y)\,dx\,dy = 1$. Since $X$ and $Y$ are each defined over $[0,1]$, we simply integrate the PDF $f_{XY}(x,y) = x + cy^2$ over those bounds. \[\begin{equation*} \begin{aligned} 1 & = \displaystyle \int_{X} \int_{Y} f(x,y)\,dx\,dy\\ & = \int_{0}^{1} \int_{0}^{1} x + cy^2\,dx\,dy\\ & = \int_{0}^{1} [\frac{1}{2}x^2 + cy^2x]_{x=0}^{x=1}\,dy\\ & = \int_{0}^{1} \frac{1}{2} + cy^2\,dy\\ & = [\frac{1}{2}y + \frac{1}{3}cy^3]_{y=0}^{y=1}\\ & = \frac{1}{2} + \frac{1}{3}c\\ \implies c &= \frac{3}{2} \end{aligned} \end{equation*}\]
Now, we need to change the bounds of integration to reflect the problem. Using our result that $c = \frac{3}{2}$ from part a), we evaluate: \[\begin{equation*} \begin{aligned} \mathbb{P}(0 \leq X \leq \frac{1}{2}, 0 \leq Y \leq \frac{1}{2}) & = \displaystyle \int_{0}^{1/2} \int_{0}^{1/2} (x + \frac{3}{2}y^2) \,dx\,dy\\ & = \int_{0}^{1/2} [\frac{1}{2}x^2 + \frac{3}{2}y^2 x]_{x=0}^{x=\frac{1}{2}}\,dy\\ & = \int_{0}^{1/2} (\frac{1}{8} + \frac{3}{4}y^2)\,dy\\ & = [\frac{1}{8}y + \frac{3}{12}y^3]_{y=0}^{y=\frac{1}{2}}\\ & = \frac{1}{8}(\frac{1}{2}) + \frac{3}{12}(\frac{1}{2})^3\\ & = \frac{3}{32} \end{aligned} \end{equation*}\]

Definition: Multivariate Normal

An $n$-dimensional multivariate normal random vector $\mathbf{x} = (x_1, \dots, x_n)$ with the following density function: \[\begin{equation*} f(\mathbf{x}) = \frac{1}{\sqrt{(2 \pi)^n |\Sigma|}} \exp \left[-\frac{1}{2} (\mathbf{x} - \mu)^{\top} \Sigma^{-1}(\mathbf{x} - \mu)\right] \end{equation*}\] where $\mu$ is an $n \times 1$ vector of means and $\Sigma$ is an $n \times n$ positive definite covariance matrix – in other words, the diagonal entries for $\Sigma$ are $\sigma_{x_1}^2, \dots, \sigma_{x_n}^2$.

Definition: Conditional distributions

Let $X$ and $Y$ be random variables with marginal functions $f_{X}(x)$ and $f_{Y}(y)$ respectively, and the joint probability function $f(x,y)$. The conditional distribution of $Y$ given $X$ is defined by: \[\begin{equation*} f(y|x) = \frac{f(x,y)}{f_{X}(x)} \end{equation*}\]

Definition: Independence
$X$ and $Y$ are said to be independent if $f(x,y) = f_X(x) \cdot f_{Y}(y)$, or if the joint distribution is the product of each marginal distribution.

Definition: Marginal distribution

Let $X$ and $Y$ be random variables. Then, the marginal distribution of $X$, $f_{X}(x)$ is as follows:

If $X$ is discrete, then $f_{X}(x) = \displaystyle \sum_{y \in \mathbb{R}} f(x,y)$
If $X$ is continuous, then $f_{X}(x) = \displaystyle \int_{Y} f(x,y)\,dy$

Exercise: Using Joint, Marginal, and Conditional Distributions

Let $X$ and $Y$ be random variables. Prove that if $X$ and $Y$ are independent, then $f(y|x) = f_{Y}(y)$.

Solution: By definition, $f(y|x) = \frac{f(x,y)}{f_{X}(x)}$. Independence means that $f(x,y) = f_X(x) \cdot f_{Y}(y)$, so we can substitute $f(y|x) = \frac{f_X(x) \cdot f_{Y}(y)}{f_{X}(x)} = f_{Y}(y)$.

Summarizing Distributions

We can summarize most distributions with just a few numbers:
1) Central Tendency: where is the center of the distribution?
2) Spread: who spread is the distribution around the center?

Definition: Expectation
Let $X$ be a a random variable with probability mass/density function $f_{X}(x)$. Then the expectation of $X$ is defined as:
\[\begin{equation*} \mathbb{E}[X] = \begin{cases} \displaystyle \sum_{X} x \cdot f(x) & \text{if $X$ is discrete}\\ \displaystyle \int_{X} x \cdot f(x)\, dx & \text{if $X$ is continuous} \end{cases} \end{equation*}\]

Let $X$ and $Y$ be random variables, and $a$ and $b$ be constants. Then:
- $\mathbb{E}[a] = a$
- $\mathbb{E}[a \cdot X] = a \cdot \mathbb{E}[X]$
- $\mathbb{E}[a \cdot X + b] = a \cdot \mathbb{E}[X] + b$
- $\mathbb{E}[a \cdot X + b \cdot Y] = a \cdot \mathbb{E}[X] + b \cdot \mathbb{E}[Y]$
- If $X$ and $Y$ are independent, then $\mathbb{E}[X \cdot Y] = \mathbb{E}[X] \cdot \mathbb{E}[Y]$

Law of Iterated Expectations
Let $X$ and $Y$ be random variables. Then, $\mathbb{E}[X] = \mathbb{E}\left[\mathbb{E}[X | Y]\right]$

Definition: Variance
Let $X$ and $Y$ be random variables, and $a$ be a constant. The variance is defined as:
\[\begin{equation*} \mathbb{V}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - \mathbb{E}[X]^2 \end{equation*}\]

The variance has some important properties:
- $\mathbb{V}(X + a) = \mathbb{V}(X)$
- $\mathbb{V}(a \cdot X) = a^2 \cdot \mathbb{V}(X)$
- The covariance is $\text{Cov}(X,Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[X \cdot Y] - \mathbb{E}[X] \cdot \mathbb{E}[Y]$
- $\mathbb{V}(X \pm Y) = \mathbb{V}(X) + \mathbb{V}(Y) + 2 \cdot \text{Cov}(X,Y)$
- If $X$ and $Y$ are independent, $\text{Cov}(X,Y) = 0$, meaning $\mathbb{V}(X \pm Y) = \mathbb{V}(X) + \mathbb{V}(Y)$

Exercise: Deriving a Variance
Let $X \sim \text{Bern}(p)$. We showed in class that $\mathbb{E}[X] = p$. Find $\mathbb{V}(X)$.

Solution: $\mathbb{V}(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2$. We know$\mathbb{E}[X] = p$, so $\mathbb{E}[X]^2 = p^2$. We need to find $\mathbb{E}[X^2]$: \[\begin{equation*} \begin{aligned} \mathbb{E}[X^2] & = \mathbb{P}(X = 1) (1)^2 + \mathbb{P}(X = 0) (0)^2\\ & = \mathbb{P}(X = 1) \end{aligned} \end{equation*}\]

Thus, $\mathbb{V}(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2 = p - p^2$, which can be factored to $p \cdot (1-p)$.

Political Science 581: Quantitative Political Methodology I

Lab 2

Jordan Duffin Wong

January 27, 2023

Normal Distribution

Multiple Random Variables

Summarizing Distributions