Definition: a continuous random variable \(Z \sim \mathcal{N}(0,1)\) if it has the
following pdf: \[\begin{equation*}
\phi(z) = \displaystyle \frac{1}{\sqrt{2 \pi}} \exp
\left[\frac{-z^2}{2} \right]
\end{equation*}\]
Likewise, the CDF of a standard normal random variable is given by: \[\begin{equation*} \Phi(z) = \displaystyle \int_{-\infty}^{z} \frac{1}{\sqrt{2 \pi}} \exp \left[\frac{-t^2}{2} \right]\,dt \end{equation*}\]
The normal distribution is symmetric around its center, which gives us some useful properties. Let \(Z \sim \mathcal{N}(0,1)\):
# I'm just plotting z from -4 to 4. Remember that it actually traverses the real line.
# Use z to shift the bounds of the plot.
# This is really important! If you're looking at something centered around
# another number, you need to adjust the field of view accordingly!
z <- c(-4,4)
# I'm using the ggplot2 library to make some nicer visuals than base R.
# If you've never used ggplot2, you can install it by entering
# `install.packages("ggplot2")` into the console.
# You only need to install a package once, but need to load that library
# every new session of R. You do that by entering `library(*)` into the console.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.3
# We're passing `z` into ggplot() as a data.frame object. Don't worry about this too much
ggplot(data = data.frame(z), aes(z)) +
# This inputs the standard normal distribution with mean 0 and standard deviation 1
stat_function(fun = dnorm, n = 101, args = list(mean = 0, sd = 1)) +
# Suppose z = 1. Then -z = -1. Then phi(z) = phi(1) = phi(-z) = phi(-1)
# This code makes the point and dotted lines for phi(1)
geom_point(aes(x = 1, y = dnorm(1)), color = "red")+
geom_segment(aes(x = 1, xend = 1, y = 0, yend = dnorm(1)),
color = "red",
linetype = "dashed")+
annotate("text", x = 1, y = dnorm(1), hjust = -0.25, label = "phi(1)", color = "red")+
# and this does the same for phi(-1)
geom_point(aes(x = -1, y = dnorm(-1)), color = "red")+
geom_segment(aes(x = -1, xend = -1, y = 0, yend = dnorm(-1)),
color = "red",
linetype = "dashed")+
annotate("text", x = -1, y = dnorm(-1), hjust = 1.25, label = "phi(-1)", color = "red")+
# This horizontal line shows that the two are equal
geom_segment(aes(x = -1, xend = 1, y = dnorm(-1), yend = dnorm(1)),
color = "red",
linetype = "dashed")+
# This just makes it nicer to look at
ylab("") +
scale_y_continuous(breaks = NULL)+
theme_bw()
# And this works for any value of z! For instance, z = +/-0.5 and z = +/-2:
# z = +/-0.5
ggplot(data = data.frame(z), aes(z)) +
stat_function(fun = dnorm, n = 101, args = list(mean = 0, sd = 1)) + ylab("") +
geom_point(aes(x = 0.5, y = dnorm(0.5)), color = "blue")+
geom_segment(aes(x = 0.5, xend = 0.5, y = 0, yend = dnorm(0.5)),
color = "blue",
linetype = "dashed")+
annotate("text", x = 0.5, y = dnorm(0.5), hjust = -0.25, label = "phi(0.5)", color = "blue")+
geom_point(aes(x = -0.5, y = dnorm(-0.5)), color = "blue")+
geom_segment(aes(x = -0.5, xend = -0.5, y = 0, yend = dnorm(-0.5)),
color = "blue",
linetype = "dashed")+
annotate("text", x = -0.5, y = dnorm(-0.5), hjust = 1.25, label = "phi(-0.5)", color = "blue")+
geom_segment(aes(x = -0.5, xend = 0.5, y = dnorm(-0.5), yend = dnorm(0.5)),
color = "blue",
linetype = "dashed")+
scale_y_continuous(breaks = NULL)+
theme_bw()
# and z = +/-2
ggplot(data = data.frame(z), aes(z)) +
stat_function(fun = dnorm, n = 101, args = list(mean = 0, sd = 1)) + ylab("") +
geom_point(aes(x = 2, y = dnorm(2)), color = "blue")+
geom_segment(aes(x = 2, xend = 2, y = 0, yend = dnorm(2)),
color = "blue",
linetype = "dashed")+
annotate("text", x = 2, y = dnorm(2), hjust = -0.25, label = "phi(2)", color = "blue")+
geom_point(aes(x = -2, y = dnorm(-2)), color = "blue")+
geom_segment(aes(x = -2, xend = -2, y = 0, yend = dnorm(-2)),
color = "blue",
linetype = "dashed")+
annotate("text", x = -2, y = dnorm(-2), hjust = 1.25, label = "phi(-2)", color = "blue")+
geom_segment(aes(x = -2, xend = 2, y = dnorm(-2), yend = dnorm(2)),
color = "blue",
linetype = "dashed")+
scale_y_continuous(breaks = NULL)+
theme_bw()
# Let's use the standard normal again
# Suppose Z = 0
ggplot(data = data.frame(z), aes(z))+
# This just plots the normal PDF. you can change the mean and sd later
stat_function(fun = dnorm, n = 101, args = list(mean = 0, sd = 1))+
# This specifies area under the curve visually. Remember that this goes from -infinity to Z
# `fill` just specifies color and `alpha` is transparency
stat_function(fun = dnorm,
xlim = c(-4, 0), # Remember, we want the area from the lower bound to Z = 0
geom = "area",
fill = "red",
alpha = 0.5)+
# And this is the right partition from Z to +infinity
stat_function(fun = dnorm,
xlim = c(0,4),
geom = "area",
fill = "blue",
alpha = 0.5)+
# And this is just to look nice
scale_y_continuous(breaks = NULL)+
theme_bw()
# Suppose, then, that Z = 1.
# Z = 1 is red and Z = -1 is blue/purple
ggplot(data = data.frame(z), aes(z))+
stat_function(fun = dnorm)+
stat_function(fun = dnorm,
xlim = c(-4, 1),
geom = "area",
fill = "red",
alpha = 0.5)+
stat_function(fun = dnorm,
xlim = c(-4, -1),
geom = "area",
fill = "blue",
alpha = 0.5)+
scale_y_continuous(breaks = NULL)+
theme_bw()
# If we add those areas up, we get the whole sample space with area = 1
ggplot(data = data.frame(z), aes(z))+
stat_function(fun = dnorm)+
stat_function(fun = dnorm,
xlim = c(-4, 1),
geom = "area",
fill = "red",
alpha = 0.5)+
stat_function(fun = dnorm,
xlim = c(1, 4),
geom = "area",
fill = "blue",
alpha = 0.5)+
scale_y_continuous(breaks = NULL)+
theme_bw()
Remember, for a random variable \(X = \mu +
\sigma Z\), we can standardize it to \(Z \sim \mathcal{N}(0,1)\). In that case, we
would say \(X \sim \mathcal{N}(\mu,
\sigma^2)\). A consequence is that we can use this property to
derive information about any normal random variable in terms of the
standard normal. To find the CDF of \(X \sim
\mathcal{N}(\mu, \sigma^2)\):
Similarly, we can use this result to find the PDF: \[\begin{equation*} \begin{aligned} f_{X}(x) & = \frac{d}{dx}F_{X}(x)\\ & = \frac{d}{dx}\Phi\left(\frac{x - \mu}{\sigma}\right)\\ & = \frac{1}{\sigma}\frac{1}{\sqrt{2\pi}} \exp \left[\frac{(x - \mu)^2}{2\sigma^2} \right]\\ \implies \mathbb{P}(a \leq X \leq b) & = \Phi\left(\frac{b - \mu}{\sigma}\right) - \Phi\left(\frac{a - \mu}{\sigma}\right) \end{aligned} \end{equation*}\]
Exercise: Using Properties of the Normal
Distribution
Let \(X \sim \mathcal{N}(-5,4)\):
a) Find \(\mathbb{P}(X \leq 0)\)
b) Find \(\mathbb{P}(-7 \leq x \leq
-3)\)
Solution:
a) Since \(\mu = -5\) and \(\sigma^2 = 4\), we can determine \(\mathbb{P}(X \leq 0)\) using these
properties: \[\begin{equation*}
\begin{aligned}
\mathbb{P}(X \leq 0) & = F_{X}(x = 0)\\
& = \Phi\left(\frac{0 - \mu}{\sigma}\right)\\
& = \Phi\left(\frac{0 - (-5)}{\sqrt{4}}\right)\\
& = \Phi\left(\frac{5}{2}\right)
\end{aligned}
\end{equation*}\]
Then, we can use the pnorm() function in R to evaluate that quantity.
pnorm(q = 5/2, mean = 0, sd = 1) # why are we using this mean and standard deviation?
## [1] 0.9937903
And we can double-check using pnorm() directly:
pnorm(q = 0, mean = -5, sd = sqrt(4))
## [1] 0.9937903
If we were to plot the area under the curve, it would look something like this:
z <- c(-13:3) # Shifting the bounds to something else
ggplot(data = data.frame(z), aes(z))+
# Just drawing the black PDF curve
stat_function(fun = dnorm,
n = 101,
args = list(mean = -5, sd = 2))+
# Now the area under the curve
stat_function(fun = dnorm,
n = 101,
args = list(mean = -5, sd = 2),
xlim = c(-13, 0),
geom = "area",
fill = "red",
alpha = 0.5)+
ylab("")+
theme_bw()
From here, there are a number of ways to evaluate \(\Phi(1) - \Phi(-1)\). We can do it directly using pnorm():
pnorm(q = 1, mean = 0, sd = 1) - pnorm(q = -1, mean = 0, sd = 1)
## [1] 0.6826895
We can also use a property of \(\Phi(z)\): namely, that \(\Phi(z) = 1 - \Phi(-z)\), which implies \(\Phi(z) - \Phi(-z) = [1 - \Phi(z)] - \Phi(-z)\), which is \(1 - 2 \Phi(-z)\) With pnorm(), that is:
1 - 2* pnorm(q = -1, mean = 0, sd = 1)
## [1] 0.6826895
Definition: Let \(X\) and \(Y\) be random variables. The joint distribution function of \(X\) and \(Y\) is the function \(F : \mathbb{R}^2 \to [0,1]\), defined by \(F(x,y) = \mathbb{P}(X \leq x, Y \leq y)\).
Exercise: Using a Joint PDF
Let \(X\) and \(Y\) be two jointly continuous random
variables with the following joint PDF: \[\begin{equation*}
f_{XY}(x,y) = \begin{cases}
x + cy^2 & \text{if } 0 \leq x \leq 1, 0 \leq y \leq 1\\
0 & \text{otherwise}
\end{cases}
\end{equation*}\]
Solution:
Remember that \(\displaystyle \int_{X}
\int_{Y} f(x,y)\,dx\,dy = 1\). Since \(X\) and \(Y\) are each defined over \([0,1]\), we simply integrate the PDF \(f_{XY}(x,y) = x + cy^2\) over those bounds.
\[\begin{equation*}
\begin{aligned}
1 & = \displaystyle \int_{X} \int_{Y} f(x,y)\,dx\,dy\\
& = \int_{0}^{1} \int_{0}^{1} x + cy^2\,dx\,dy\\
& = \int_{0}^{1} [\frac{1}{2}x^2 + cy^2x]_{x=0}^{x=1}\,dy\\
& = \int_{0}^{1} \frac{1}{2} + cy^2\,dy\\
& = [\frac{1}{2}y + \frac{1}{3}cy^3]_{y=0}^{y=1}\\
& = \frac{1}{2} + \frac{1}{3}c\\
\implies c &= \frac{3}{2}
\end{aligned}
\end{equation*}\]
Now, we need to change the bounds of integration to reflect the
problem. Using our result that \(c =
\frac{3}{2}\) from part a), we evaluate: \[\begin{equation*}
\begin{aligned}
\mathbb{P}(0 \leq X \leq \frac{1}{2}, 0 \leq Y \leq \frac{1}{2}) & =
\displaystyle \int_{0}^{1/2} \int_{0}^{1/2} (x + \frac{3}{2}y^2)
\,dx\,dy\\
& = \int_{0}^{1/2} [\frac{1}{2}x^2 + \frac{3}{2}y^2
x]_{x=0}^{x=\frac{1}{2}}\,dy\\
& = \int_{0}^{1/2} (\frac{1}{8} + \frac{3}{4}y^2)\,dy\\
& = [\frac{1}{8}y + \frac{3}{12}y^3]_{y=0}^{y=\frac{1}{2}}\\
& = \frac{1}{8}(\frac{1}{2}) + \frac{3}{12}(\frac{1}{2})^3\\
& = \frac{3}{32}
\end{aligned}
\end{equation*}\]
Definition: Multivariate Normal
An \(n\)-dimensional multivariate
normal random vector \(\mathbf{x} = (x_1,
\dots, x_n)\) with the following density function: \[\begin{equation*}
f(\mathbf{x}) = \frac{1}{\sqrt{(2 \pi)^n |\Sigma|}} \exp
\left[-\frac{1}{2} (\mathbf{x} - \mu)^{\top} \Sigma^{-1}(\mathbf{x} -
\mu)\right]
\end{equation*}\] where \(\mu\)
is an \(n \times 1\) vector of means
and \(\Sigma\) is an \(n \times n\) positive definite covariance
matrix – in other words, the diagonal entries for \(\Sigma\) are \(\sigma_{x_1}^2, \dots,
\sigma_{x_n}^2\).
Definition: Conditional distributions
Let \(X\) and \(Y\) be random variables with marginal
functions \(f_{X}(x)\) and \(f_{Y}(y)\) respectively, and the joint
probability function \(f(x,y)\). The
conditional distribution of \(Y\) given \(X\) is defined by: \[\begin{equation*}
f(y|x) = \frac{f(x,y)}{f_{X}(x)}
\end{equation*}\]
Definition: Independence
\(X\) and \(Y\) are said to be independent if
\(f(x,y) = f_X(x) \cdot f_{Y}(y)\), or
if the joint distribution is the product of each marginal
distribution.
Definition: Marginal distribution
Let \(X\) and \(Y\) be random variables. Then, the
marginal distribution of \(X\), \(f_{X}(x)\) is as follows:
Exercise: Using Joint, Marginal, and Conditional
Distributions
Let \(X\) and \(Y\) be random variables. Prove that if
\(X\) and \(Y\) are independent, then \(f(y|x) = f_{Y}(y)\).
Solution: By definition, \(f(y|x) = \frac{f(x,y)}{f_{X}(x)}\). Independence means that \(f(x,y) = f_X(x) \cdot f_{Y}(y)\), so we can substitute \(f(y|x) = \frac{f_X(x) \cdot f_{Y}(y)}{f_{X}(x)} = f_{Y}(y)\).
We can summarize most distributions with just a few numbers:
1) Central Tendency: where is the center of the
distribution?
2) Spread: who spread is the distribution around the
center?
Definition: Expectation
Let \(X\) be a a random variable with
probability mass/density function \(f_{X}(x)\). Then the expectation
of \(X\) is defined as:
\[\begin{equation*}
\mathbb{E}[X] = \begin{cases}
\displaystyle \sum_{X} x \cdot f(x) & \text{if $X$ is discrete}\\
\displaystyle \int_{X} x \cdot f(x)\, dx & \text{if $X$ is
continuous}
\end{cases}
\end{equation*}\]
Let \(X\) and \(Y\) be random variables, and \(a\) and \(b\) be constants. Then:
- \(\mathbb{E}[a] = a\)
- \(\mathbb{E}[a \cdot X] = a \cdot
\mathbb{E}[X]\)
- \(\mathbb{E}[a \cdot X + b] = a \cdot
\mathbb{E}[X] + b\)
- \(\mathbb{E}[a \cdot X + b \cdot Y] = a
\cdot \mathbb{E}[X] + b \cdot \mathbb{E}[Y]\)
- If \(X\) and \(Y\) are independent, then \(\mathbb{E}[X \cdot Y] = \mathbb{E}[X] \cdot
\mathbb{E}[Y]\)
Law of Iterated Expectations
Let \(X\) and \(Y\) be random variables. Then, \(\mathbb{E}[X] = \mathbb{E}\left[\mathbb{E}[X |
Y]\right]\)
Definition: Variance
Let \(X\) and \(Y\) be random variables, and \(a\) be a constant. The variance is
defined as:
\[\begin{equation*}
\mathbb{V}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] -
\mathbb{E}[X]^2
\end{equation*}\]
The variance has some important properties:
- \(\mathbb{V}(X + a) =
\mathbb{V}(X)\)
- \(\mathbb{V}(a \cdot X) = a^2 \cdot
\mathbb{V}(X)\)
- The covariance is \(\text{Cov}(X,Y)
= \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[X
\cdot Y] - \mathbb{E}[X] \cdot \mathbb{E}[Y]\)
- \(\mathbb{V}(X \pm Y) = \mathbb{V}(X) +
\mathbb{V}(Y) + 2 \cdot \text{Cov}(X,Y)\)
- If \(X\) and \(Y\) are independent, \(\text{Cov}(X,Y) = 0\), meaning \(\mathbb{V}(X \pm Y) = \mathbb{V}(X) +
\mathbb{V}(Y)\)
Exercise: Deriving a Variance
Let \(X \sim \text{Bern}(p)\). We
showed in class that \(\mathbb{E}[X] =
p\). Find \(\mathbb{V}(X)\).
Solution: \(\mathbb{V}(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2\). We know\(\mathbb{E}[X] = p\), so \(\mathbb{E}[X]^2 = p^2\). We need to find \(\mathbb{E}[X^2]\): \[\begin{equation*} \begin{aligned} \mathbb{E}[X^2] & = \mathbb{P}(X = 1) (1)^2 + \mathbb{P}(X = 0) (0)^2\\ & = \mathbb{P}(X = 1) \end{aligned} \end{equation*}\]
Thus, \(\mathbb{V}(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2 = p - p^2\), which can be factored to \(p \cdot (1-p)\).