Data Science Inference and Modeling
Data Science Inference and Modeling
The textbook for the Data Science course series is freely available online.
This course corresponds to the textbook chapters Statistical Inference and Statistical Models.
Learning Objectives
• The concepts necessary to define estimates and margins of errors of populations, parameters, estimates,
and standard errors in order to make predictions about data
• How to use models to aggregate data from different sources
• The very basics of Bayesian statistics and predictive modeling
Course Overview
You will apply the central limit theorem to assess how close a sample estimate is to the population parameter
of interest.
You will learn how to calculate confidence intervals and learn about the relationship between confidence
intervals and p-values.
You will learn about statistical models in the context of election forecasting.
You will learn about Bayesian statistics through looking at examples from rare disease diagnosis and baseball.
You will learn about election forecasting, building on what you’ve learned in the previous sections about
statistical modeling and Bayesian statistics.
1
Section 7: Association Tests
You will learn how to use association and chi-squared tests to perform inference for binary, categorical, and
ordinal data through an example looking at research funding rates.
Introduction to Inference
• statistical inference, the process of deducing characteristics of a population using data from a random
sample
• the statistical concepts necessary to define estimates and margins of errors
• how to forecast future results and estimate the precision of our forecast
• how to calculate and interpret confidence intervals and p-values
Key points
• Information gathered from a small random sample can be used to infer characteristics of the entire
population.
• Opinion polls are useful when asking everyone in the population is impossible.
• A common use for opinion polls is determining voter preferences in political elections for the purposes
of forecasting election results.
• The spread of a poll is the estimated difference between support two candidates or options.
Section 1 Overview
Section 1 introduces you to parameters and estimates.
After completing Section 1, you will be able to:
The textbook for this section is available here and here; first part.
Key points
• The task of statistical inference is to estimate an unknown population parameter using observed data
from a sample.
• In a sampling model, the collection of elements in the urn is called the population.
• A parameter is a number that summarizes data for an entire population.
• A sample is observed data from a subset of the population.
• An estimate is a summary of the observed data about a parameter that we believe is informative. It
is a data-driven guess of the population parameter.
2
• We want to predict the proportion of the blue beads in the urn, the parameter 𝑝 . The proportion of
red beads in the urn is 1 − 𝑝 and the spread is 2𝑝 − 1.
• The sample proportion is a random variable. Sampling gives random results drawn from the population
distribution.
if(!require(tidyverse)) install.packages("tidyverse")
## -- Conflicts ----------------------------------------------------------------------------------------
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
if(!require(dslabs)) install.packages("dslabs")
library(tidyverse)
library(dslabs)
take_poll(25) # draw 25 beads
Blue Red
3
The Sample Average
The textbook for this section is available here and here.
Key points
• Many common data science tasks can be framed as estimating a parameter from a sample.
• We illustrate statistical inference by walking through the process to estimate 𝑝. From the estimate of
𝑝, we can easily calculate an estimate of the spread, 2𝑝 − 1.
• Consider the random variable 𝑋 that is 1 if a blue bead is chosen and 0 if a red bead is chosen. The
proportion of blue beads in 𝑁 draws is the average of the draws 𝑋1 , ..., 𝑋𝑁 .
• 𝑋̄ is the sample average. In statistics, a bar on top of a symbol denotes the average. 𝑋̄ is a random
variable because it is the average of random draws - each time we take a sample, 𝑋̄ is different.
𝑋̄ = 𝑋1 +𝑋2 +...+𝑋𝑁
𝑁
• The number of blue beads drawn in N draws, 𝑁 𝑋,̄ is 𝑁 times the proportion of values in the urn.
However, we do not know the true proportion: we are trying to estimate this parameter 𝑝.
• A poll taken in advance of an election estimates 𝑝 for that moment, not for election day.
• In order to predict election results, forecasters try to use early estimates of 𝑝 to predict 𝑝 on election
day. We discuss some approaches in later sections.
• When interpreting values of 𝑋,̄ it is important to remember that 𝑋̄ is a random variable with an
expected value and standard error that represents the sample proportion of positive events.
• The expected value of 𝑋̄ is the parameter of interest 𝑝. This follows from the fact that 𝑋̄ is the sum
of independent draws of a random variable times a constant 1/𝑁 .
E(𝑋)̄ = 𝑝
• As the number of draws 𝑁 increases, the standard error of our estimate 𝑋̄ decreases. The standard
error of the average of 𝑋̄ over 𝑁 draws is:
• In theory, we can get more accurate estimates of 𝑝 by increasing 𝑁 . In practice, there are limits on
the size of 𝑁 due to costs, as well as other factors we discuss later.
• We can also use other random variable equations to determine the expected value of the sum of draws
E(𝑆) and standard error of the sum of draws SE(𝑆).
E(𝑆) = 𝑁 𝑝
SE(𝑆) = √𝑁 𝑝(1 − 𝑝)
4
Assessment - Parameters and Estimates
1. Suppose you poll a population in which a proportion 𝑝 of voters are Democrats and 1 − 𝑝 are Repub-
licans.
Your sample size is 𝑁 = 25. Consider the random variable 𝑆, which is the total number of Democrats in
your sample.
What is the expected value of this random variable 𝑆?
□ A. E(𝑆) = 25(1 − 𝑝)
⊠ B. E(𝑆) = 25𝑝
□ C. E(𝑆) = √25𝑝(1 − 𝑝)
□ D. E(𝑆) = 𝑝
2. Again, consider the random variable 𝑆, which is the total number of Democrats in your sample of 25
voters.
The variable 𝑝 describes the proportion of Democrats in the sample, whereas 1 − 𝑝 describes the proportion
of Republicans.
What is the standard error of 𝑆?
□ A. SE(𝑆) = √
25𝑝(1 − 𝑝)
□ B. SE(𝑆) = 25𝑝
□ C. SE(𝑆) = 25(1 − 𝑝)
⊠ D. SE(𝑆) = √25𝑝(1 − 𝑝)
3. Consider the random variable 𝑆/𝑁 , which is equivalent to the sample average that we have been
denoting as 𝑋.̄
The variable 𝑁 represents the sample size and 𝑝 is the proportion of Democrats in the population.
̄
What is the expected value of 𝑋?
⊠ A. E(𝑋)̄ = 𝑝
□ B. E(𝑋)̄ = 𝑁 𝑝
□ C. E(𝑋)̄ = 𝑁 (1 − 𝑝)
□ D. E(𝑋)̄ = 1 − 𝑝
̄
4. What is the standard error of the sample average, 𝑋?
The variable 𝑁 represents the sample size and 𝑝 is the proportion of Democrats in the population.
□ A. SE(𝑋)̄ = √𝑁 𝑝(1 − 𝑝)
⊠ B. SE(𝑋)̄ = √𝑝(1 − 𝑝)/𝑁
□ C. SE(𝑋)̄ = √𝑝(1
√ − 𝑝)
□ D. SE(𝑋)̄ = 𝑁
5. Write a line of code that calculates the standard error se of a sample average when you poll 25 people
in the population.
Generate a sequence of 100 proportions of Democrats p that vary from 0 (no Democrats) to 1 (all Democrats).
Plot se versus p for the 100 different proportions.
5
# `N` represents the number of people polled
N <- 25
# Create a variable `p` that contains 100 proportions ranging from 0 to 1 using the `seq` function
p <- seq(0, 1, length.out = 100)
# Create a variable `se` that contains the standard error of each sample average
se <- sqrt(p * (1 - p)/N)
0.04
0.02
0.00
6. Using the same code as in the previous exercise, create a for-loop that generates three plots of p versus
se when the sample sizes equal 𝑁 = 25, 𝑁 = 100, and 𝑁 = 1000.
# The vector `p` contains 100 proportions of Democrats ranging from 0 to 1 using the `seq` function
p <- seq(0, 1, length = 100)
# Write a for-loop that calculates the standard error `se` for every value of `p` for each of the three
for (N in sample_sizes)
{
se <- sqrt(p * (1 - p)/N)
plot(p,se,ylim = c(0,0.1))
}
6
se se
0.00 0.02 0.04 0.06 0.08 0.10 0.00 0.02 0.04 0.06 0.08 0.10
0.0
0.0
0.2
0.2
0.4
0.4
7
p
p
0.6
0.6
0.8
0.8
1.0
1.0
0.10
0.08
0.06
se
0.04
0.02
0.00
7. Our estimate for the difference in proportions of Democrats and Republicans is 𝑑 = 𝑋̄ − (1 − 𝑋).
̄
Which derivation correctly uses the rules we learned about sums of random variables and scaled random
variables to derive the expected value of 𝑑
8. Our estimate for the difference in proportions of Democrats and Republicans is 𝑑 = 𝑋̄ − (1 − 𝑋).
̄
Which derivation correctly uses the rules we learned about sums of random variables and scaled random
variables to derive the standard error of 𝑑?
In this case, the Republican party is winning by a relatively large margin of 𝑑 = −0.1, or a 10% margin of
victory. What is the standard error of the spread 2𝑋̄ − 1 in this case?
8
# `p` represents the proportion of Democratic voters
p <- 0.45
# Calculate the standard error of the spread. Print this value to the console.
2*sqrt((p*(1-p)/N))
## [1] 0.1989975
10. So far we have said that the difference between the proportion of Democratic voters and Republican
voters is about 10% and that the standard error of this spread is about 0.2 when 𝑁 = 25.
Select the statement that explains why this sample size is sufficient or not.
□ A. This sample size is sufficient because the expected value of our estimate 2𝑋̄ −1 is 𝑑 so our prediction
will be right on.
⊠ B. This sample size is too small because the standard error is larger than the spread.
□ C. This sample size is sufficient because the standard error of about 0.2 is much smaller than the spread
of 10%.
□ D. Without knowing p, we have no way of knowing that increasing our sample size would actually
improve our standard error.
Section 2 Overview
In Section 2, you will look at the Central Limit Theorem in practice.
After completing Section 2, you will be able to:
• Use the Central Limit Theorem to calculate the probability that a sample estimate 𝑋̄ is close to the
population proportion 𝑝.
• Run a Monte Carlo simulation to corroborate theoretical results built using probability theory.
̂ 𝑋).
• Estimate the spread based on estimates of 𝑋̄ and SE( ̄
• Understand why bias can mean that larger sample sizes aren’t necessarily better.
• Because 𝑋̄ is the sum of random draws divided by a constant, the distribution of 𝑋̄ is approximately
normal.
• We can convert 𝑋̄ to a standard normal random variable 𝑍:
̄ E(𝑋)
𝑋− ̄
𝑍=
SE(𝑋)̄
9
• The Central Limit Theorem (CLT) still works if 𝑋̄ is used in place of 𝑝. This is called a plug-in
estimate. Hats over values denote estimates. Therefore:
̂ 𝑋)̄ = √𝑋(1
SE( ̄ − 𝑋)/𝑁
̄
Using the CLT, the probability that 𝑋̄ is within .01 of the actual value of 𝑝 is:
̄ − 𝑋)/𝑁
Pr(𝑍 ≤ .01/√𝑋(1 ̄ ̄ − 𝑋)/𝑁
) − Pr(𝑍 ≤ −.01/√𝑋(1 ̄ )
## [1] 0.07971926
Margin of Error
• The margin of error is defined as 2 times the standard error of the estimate 𝑋.̄
• There is about a 95% chance that 𝑋̄ will be within two standard errors of the actual parameter 𝑝.
• We can run Monte Carlo simulations to compare with theoretical results assuming a value of 𝑝.
• In practice, 𝑝 is unknown. We can corroborate theoretical results by running Monte Carlo simulations
with one or several values of 𝑝.
• One practical choice for 𝑝 when modeling is 𝑋,̄ the observed value of 𝑋̂ in a sample.
10
Code: Histogram and QQ-plot of Monte Carlo results
if(!require(gridExtra)) install.packages("gridExtra")
##
## Attaching package: 'gridExtra'
library(gridExtra)
p1 <- data.frame(x_hat = x_hat) %>%
ggplot(aes(x_hat)) +
geom_histogram(binwidth = 0.005, color = "black")
p2 <- data.frame(x_hat = x_hat) %>%
ggplot(aes(sample = x_hat)) +
stat_qq(dparams = list(mean = mean(x_hat), sd = sd(x_hat))) +
geom_abline() +
ylab("X_hat") +
xlab("Theoretical normal")
grid.arrange(p1, p2, nrow=1)
0.51
1200
0.48
800
X_hat
count
0.45
400
0.42
0 0.39
0.39 0.42 0.45 0.48 0.51 0.39 0.42 0.45 0.48 0.51
x_hat Theoretical normal
11
The Spread
• An extremely large poll would theoretically be able to predict election results almost perfectly.
• These sample sizes are not practical. In addition to cost concerns, polling doesn’t reach everyone in
the population (eventual voters) with equal probability, and it also may include data from outside our
population (people who will not end up voting).
• These systematic errors in polling are called bias. We will learn more about bias in the future.
Code: Plotting margin of error in an extremely large poll over a range of values of p
N <- 100000
p <- seq(0.35, 0.65, length = 100)
SE <- sapply(p, function(x) 2*sqrt(x*(1-x)/N))
data.frame(p = p, SE = SE) %>%
ggplot(aes(p, SE)) +
geom_line()
12
0.00316
0.00312
SE
0.00308
0.00304
Calculate the sample average if the proportion of Democrats equals 0.45 and the sample size is 100.
# Write a function called `take_sample` that takes `p` and `N` as arguements and returns the average val
take_sample <- function(p, N) {
x <- sample(c(0,1), size = N, replace = TRUE, prob = c(1-p, p))
return(mean(x))
}
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# Call the `take_sample` function to determine the sample average of `N` randomly selected people from a
take_sample(p, N)
## [1] 0.46
13
2. Assume the proportion of Democrats in the population 𝑝 equals 0.45 and that your sample size 𝑁 is
100 polled voters.
The take_sample function you defined previously generates our estimate, 𝑋.̄
Replicate the random sampling 10,000 times and calculate 𝑝 − 𝑋̄ for each random sample. Save these
differences as a vector called errors. Find the average of errors and plot a histogram of the distribution.
# The variable `B` specifies the number of times we want the sample to be replicated
B <- 10000
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# Create an objected called `errors` that replicates subtracting the result of the `take_sample` functio
errors <- replicate(B, p - take_sample(p, N))
# Calculate the mean of the errors. Print this value to the console.
mean(errors)
## [1] -4.9e-05
hist(errors)
Histogram of errors
1500
1000
Frequency
500
0
errors
14
3. In the last exercise, you made a vector of differences between the actual value for 𝑝 and an estimate,
𝑋.̄
We called these differences between the actual and estimated values errors.
The errors object has already been loaded for you. Use the hist function to plot a histogram of the values
contained in the vector errors. Which statement best describes the distribution of the errors?
In practice, the error is not observed because we do not know the actual proportion of Democratic voters,
𝑝. However, we can describe the size of the error by constructing a simulation.
̄
What is the average size of the error if we define the size by taking the absolute value |𝑝 − 𝑋|?
# The variable `B` specifies the number of times we want the sample to be replicated
B <- 10000
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# We generated `errors` by subtracting the estimate from the actual proportion of Democratic voters
errors <- replicate(B, p - take_sample(p, N))
# Calculate the mean of the absolute value of each simulated error. Print this value to the console.
mean(abs(errors))
## [1] 0.039267
5. The standard error is related to the typical size of the error we make when predicting.
We say size because, as we just saw, the errors are centered around 0. In that sense, the typical error is
0. For mathematical reasons related to the central limit theorem, we actually use the standard deviation of
errors rather than the average of the absolute values.
As we have discussed, the standard error is the square root of the average squared distance (𝑋̄ − 𝑝)2 . The
standard deviation is defined as the square root of the distance squared.
Calculate the standard deviation of the spread.
15
# Define `p` as the proportion of Democrats in the population being polled
p <- 0.45
# The variable `B` specifies the number of times we want the sample to be replicated
B <- 10000
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# We generated `errors` by subtracting the estimate from the actual proportion of Democratic voters
errors <- replicate(B, p - take_sample(p, N))
## [1] 0.04949939
6. The theory we just learned tells us what this standard deviation is going to be because it is the standard
error of 𝑋.̄
Estimate the standard error given an expected value of 0.45 and a sample size of 100.
## [1] 0.04974937
7. In practice, we don’t know 𝑝, so we construct an estimate of the theoretical prediction based by plugging
in 𝑋̄ for 𝑝. Calculate the standard error of the estimate: SE(̂ 𝑋)̄
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# Define `X` as a random sample of `N` voters with a probability of picking a Democrat ('1') equal to `p
X <- sample(c(0,1), size = N, replace = TRUE, prob = c(1-p, p))
16
# Define `X_bar` as the average sampled proportion
X_bar <- mean(X)
# Calculate the standard error of the estimate. Print the result to the console.
se <- sqrt((X_bar*(1-X_bar)/N))
se
## [1] 0.04983974
8. The standard error estimates obtained from the Monte Carlo simulation, the theoretical prediction, and
the estimate of the theoretical prediction are all very close, which tells us that the theory is working.
This gives us a practical approach to knowing the typical error we will make if we predict 𝑝 with 𝑋.̂ The
theoretical result gives us an idea of how large a sample size is required to obtain the precision we need.
Earlier we learned that the largest standard errors occur for 𝑝 = 0.5.
Create a plot of the largest standard error for 𝑁 ranging from 100 to 5,000.
1000
0
se
Based on this plot, how large does the sample size have to be to have a standard error of about 1%?
□ A. 100
□ B. 500
⊠ C. 2,500
□ D. 4,000
9. For 𝑁 = 100, the central limit theorem tells us that the distribution of 𝑋̂ is…
17
□ A. practically equal to 𝑝.
⊠ B. approximately normal with expected value 𝑝 and standard error √𝑝(1 − 𝑝)/𝑁 .
□ C. approximately normal with expected value 𝑋̄ and standard error √𝑋(1
̄ − 𝑋/𝑁
̄ .
□ D. not a random variable.
10. We calculated a vector errors that contained, for each simulated sample, the difference between the
actual value p and our estimate 𝑋.̂
□ A. practically equal to 0.
⊠ B. approximately normal with expected value 0 and standard error √𝑝(1 − 𝑝)/𝑁 .
□ C. approximately normal with expected value p and standard error √𝑝(1 − 𝑝)/𝑁 .
□ D. not a random variable.
11. Make a qq-plot of the errors you generated previously to see if they follow a normal distribution.
# The variable `B` specifies the number of times we want the sample to be replicated
B <- 10000
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# Generate `errors` by subtracting the estimate from the actual proportion of Democratic voters
errors <- replicate(B, p - take_sample(p, N))
18
Normal Q−Q Plot
0.10
Sample Quantiles
0.00
−0.10
−0.20
−4 −2 0 2 4
Theoretical Quantiles
12. If 𝑝 = 0.45 and 𝑁 = 100, use the central limit theorem to estimate the probability that 𝑋̄ > 0.5.
# Calculate the probability that the estimated proportion of Democrats in the population is greater than
1-pnorm(0.5, p, sqrt(p*(1-p)/N))
## [1] 0.1574393
13. Assume you are in a practical situation and you don’t know 𝑝.
19
# Calculate the probability that the error is 0.01 or larger
1-pnorm(0.01,0,se_hat) + pnorm(-0.01,0,se_hat)
## [1] 0.8414493
Section 3 Overview
In Section 3, you will look at confidence intervals and p-values.
After completing Section 3, you will be able to:
Confidence Intervals
• We can use statistical theory to compute the probability that a given interval contains the true pa-
rameter 𝑝.
• 95% confidence intervals are intervals constructed to have a 95% chance of including 𝑝. The margin of
error is approximately a 95% confidence interval.
• The start and end of these confidence intervals are random variables.
• To calculate any size confidence interval, we need to calculate the value 𝑧 for which Pr(−𝑧 ≤ 𝑍 ≤
𝑧) equals the desired confidence. For example, a 99% confidence interval requires calculating 𝑧 for
Pr(−𝑧 ≤ 𝑍 ≤ 𝑧) = 0.99.
• For a confidence interval of size 𝑞, we solve for 𝑧 = 1 − 1−𝑞
2 .
• To determine a 95% confidence interval, use z <- qnorm(0.975). This value is slightly smaller than
2 times the standard error.
data("nhtemp")
data.frame(year = as.numeric(time(nhtemp)), temperature = as.numeric(nhtemp)) %>%
ggplot(aes(year, temperature)) +
geom_point() +
geom_smooth() +
ggtitle("Average Yearly Temperatures in New Haven")
20
Average Yearly Temperatures in New Haven
53
temperature
51
49
p <- 0.45
N <- 1000
X <- sample(c(0,1), size = N, replace = TRUE, prob = c(1-p, p)) # generate N observations
X_hat <- mean(X) # calculate X_hat
SE_hat <- sqrt(X_hat*(1-X_hat)/N) # calculate SE_hat, SE of the mean of N observations
c(X_hat - 2*SE_hat, X_hat + 2*SE_hat) # build interval of 2*SE above and below mean
## [1] 0.995
## [1] 0.005
21
pnorm(z) - pnorm(-z) # demonstrating that this z value gives correct probability for interval
## [1] 0.99
• We can run a Monte Carlo simulation to confirm that a 95% confidence interval contains the true value
of 𝑝 95% of the time.
• A plot of confidence intervals from this simulation demonstrates that most intervals include 𝑝, but
roughly 5% of intervals miss the true value of 𝑝.
B <- 10000
inside <- replicate(B, {
X <- sample(c(0,1), size = N, replace = TRUE, prob = c(1-p, p))
X_hat <- mean(X)
SE_hat <- sqrt(X_hat*(1-X_hat)/N)
between(p, X_hat - 2*SE_hat, X_hat + 2*SE_hat) # TRUE if p in confidence interval
})
mean(inside)
## [1] 0.9566
Power
• If we are trying to predict the result of an election, then a confidence interval that includes a spread
of 0 (a tie) is not helpful.
22
• A confidence interval that includes a spread of 0 does not imply a close election, it means the sample
size is too small.
• Power is the probability of detecting an effect when there is a true effect to find. Power increases as
sample size increases, because larger sample size means smaller standard error.
N <- 25
X_hat <- 0.48
(2*X_hat - 1) + c(-2, 2)*2*sqrt(X_hat*(1-X_hat)/N)
p-Values
• The null hypothesis is the hypothesis that there is no effect. In this case, the null hypothesis is that
the spread is 0, or 𝑝 = 0.5.
• The p-value is the probability of detecting an effect of a certain size or larger when the null hypothesis
is true.
• We can convert the probability of seeing an observed value under the null hypothesis into a standard
normal random variable. We compute the value of 𝑧 that corresponds to the observed result, and then
use that 𝑧 to compute the p-value.
• If a 95% confidence interval does not include our observed value, then the p-value must be smaller
than 0.05.
• It is preferable to report confidence intervals instead of p-values, as confidence intervals give information
about the size of the estimate and p-values do not.
## [1] 0.6891565
The p-value is the probability of observing a value as extreme or more extreme than the result given that
the null hypothesis is true.
In the context of the normal distribution, this refers to the probability of observing a Z-score whose absolute
value is as high or higher than the Z-score of interest.
Suppose we want to find the p-value of an observation 2 standard deviations larger than the mean. This
means we are looking for anything with |𝑧| ≥ 2.
23
Figure 1: Standard normal distribution (centered at z=0 with a standard deviation of 1
Graphically, the p-value gives the probability of an observation that’s at least as far away from the mean or
further. This plot shows a standard normal distribution (centered at 𝑧 = 0) with a standard deviation of 1).
The shaded tails are the region of the graph that are 2 standard deviations or more away from the mean.
The p-value is the proportion of area under a normal curve that has z-scores as extreme or more extreme than
the given value - the tails on this plot of a normal distribution are shaded to show the region corresponding
to the p-value.
The right tail can be found with 1-pnorm(2). We want to have both tails, though, because we want to find
the probability of any observation as far away from the mean or farther, in either direction. (This is what’s
meant by a two-tailed p-value.) Because the distribution is symmetrical, the right and left tails are the same
size and we know that our desired value is just 2*(1-pnorm(2)).
Recall that, by default, pnorm() gives the CDF for a normal distribution with a mean of 𝜇 = 0 and standard
deviation of 𝜎 = 1. To find p-values for a given z-score z in a normal distribution with mean mu and standard
deviation sigma, use 2*(1-pnorm(z, mu, sigma)) instead.
1. For the following exercises, we will use actual poll data from the 2016 election.
The exercises will contain pre-loaded data from the dslabs package.
library(dslabs)
data("polls_us_election_2016")
We will use all the national polls that ended within a few weeks before the election.
Assume there are only two candidates and construct a 95% confidence interval for the election night propor-
tion 𝑝.
# Generate an object `polls` that contains data filtered for polls that ended on or after October 31, 20
24
polls <- filter(polls_us_election_2016, enddate >= "2016-10-31" & state == "U.S.")
# How many rows does `polls` contain? Print this value to the console.
nrow(polls)
## [1] 70
# Assign the sample size of the first poll in `polls` to a variable called `N`. Print this value to the
N <- head(polls$samplesize,1)
N
## [1] 2220
# For the first poll in `polls`, assign the estimated percentage of Clinton voters to a variable called
X_hat <- (head(polls$rawpoll_clinton,1)/100)
X_hat
## [1] 0.47
# Calculate the standard error of `X_hat` and save it to a variable called `se_hat`. Print this value to
se_hat <- sqrt(X_hat*(1-X_hat)/N)
se_hat
## [1] 0.01059279
# Use `qnorm` to calculate the 95% confidence interval for the proportion of Clinton voters. Save the lo
ci <- c(X_hat - qnorm(0.975)*se_hat, X_hat + qnorm(0.975)*se_hat)
ci
2. Create a new object called pollster_results that contains the pollster’s name, the end date of the
poll, the proportion of voters who declared a vote for Clinton, the standard error of this estimate, and
the lower and upper bounds of the confidence interval for the estimate.
# The `polls` object that filtered all the data by date and nation has already been loaded. Examine it u
head(polls)
25
## 4 YouGov B 3677
## 5 Gravis Marketing B- 16639
## 6 Fox News/Anderson Robbins Research/Shaw & Company Research A 1295
## population rawpoll_clinton rawpoll_trump rawpoll_johnson rawpoll_mcmullin
## 1 lv 47.00 43.00 4.00 NA
## 2 lv 38.03 35.69 5.46 NA
## 3 lv 42.00 39.00 6.00 NA
## 4 lv 45.00 41.00 5.00 NA
## 5 rv 47.00 43.00 3.00 NA
## 6 lv 48.00 44.00 3.00 NA
## adjpoll_clinton adjpoll_trump adjpoll_johnson adjpoll_mcmullin
## 1 45.20163 41.72430 4.626221 NA
## 2 43.34557 41.21439 5.175792 NA
## 3 42.02638 38.81620 6.844734 NA
## 4 45.65676 40.92004 6.069454 NA
## 5 46.84089 42.33184 3.726098 NA
## 6 49.02208 43.95631 3.057876 NA
# Create a new object called `pollster_results` that contains columns for pollster name, end date, X_hat
polls <- mutate(polls, X_hat = polls$rawpoll_clinton/100, se_hat = sqrt(X_hat*(1-X_hat)/polls$samplesize
pollster_results <- select(polls, pollster, enddate, X_hat, se_hat, lower, upper)
pollster_results
26
## 31 IBD/TIPP 2016-11-03 0.4440
## 32 IBD/TIPP 2016-11-05 0.4300
## 33 ABC News/Washington Post 2016-11-02 0.4700
## 34 ABC News/Washington Post 2016-11-01 0.4700
## 35 ABC News/Washington Post 2016-10-31 0.4600
## 36 Ipsos 2016-11-03 0.4320
## 37 IBD/TIPP 2016-11-04 0.4420
## 38 YouGov 2016-11-01 0.4600
## 39 IBD/TIPP 2016-10-31 0.4460
## 40 Ipsos 2016-11-02 0.4550
## 41 Rasmussen Reports/Pulse Opinion Research 2016-11-03 0.4400
## 42 The Times-Picayune/Lucid 2016-11-06 0.4500
## 43 Ipsos 2016-11-01 0.4470
## 44 IBD/TIPP 2016-11-02 0.4400
## 45 IBD/TIPP 2016-11-01 0.4400
## 46 Rasmussen Reports/Pulse Opinion Research 2016-11-02 0.4200
## 47 Ipsos 2016-10-31 0.4400
## 48 The Times-Picayune/Lucid 2016-11-05 0.4500
## 49 Rasmussen Reports/Pulse Opinion Research 2016-10-31 0.4400
## 50 Google Consumer Surveys 2016-10-31 0.3769
## 51 CVOTER International 2016-11-05 0.4925
## 52 Rasmussen Reports/Pulse Opinion Research 2016-11-01 0.4400
## 53 CVOTER International 2016-11-04 0.4906
## 54 The Times-Picayune/Lucid 2016-11-04 0.4500
## 55 USC Dornsife/LA Times 2016-11-06 0.4323
## 56 CVOTER International 2016-11-03 0.4853
## 57 The Times-Picayune/Lucid 2016-11-03 0.4400
## 58 USC Dornsife/LA Times 2016-11-05 0.4263
## 59 CVOTER International 2016-11-02 0.4878
## 60 USC Dornsife/LA Times 2016-11-04 0.4256
## 61 CVOTER International 2016-11-01 0.4881
## 62 The Times-Picayune/Lucid 2016-11-02 0.4400
## 63 Gravis Marketing 2016-10-31 0.4600
## 64 USC Dornsife/LA Times 2016-11-03 0.4338
## 65 The Times-Picayune/Lucid 2016-11-01 0.4300
## 66 USC Dornsife/LA Times 2016-11-02 0.4247
## 67 Gravis Marketing 2016-11-02 0.4700
## 68 USC Dornsife/LA Times 2016-11-01 0.4236
## 69 The Times-Picayune/Lucid 2016-10-31 0.4200
## 70 USC Dornsife/LA Times 2016-10-31 0.4328
## se_hat lower upper
## 1 0.010592790 0.4492385 0.4907615
## 2 0.002978005 0.3744632 0.3861368
## 3 0.010534681 0.3993524 0.4406476
## 4 0.008204286 0.4339199 0.4660801
## 5 0.003869218 0.4624165 0.4775835
## 6 0.013883131 0.4527896 0.5072104
## 7 0.013174309 0.4241788 0.4758212
## 8 0.013863610 0.4128278 0.4671722
## 9 0.014793245 0.3830058 0.4409942
## 10 0.017560908 0.4055813 0.4744187
## 11 0.014725994 0.4511376 0.5088624
## 12 0.018281811 0.4641683 0.5358317
## 13 0.016190357 0.4082675 0.4717325
27
## 14 0.009908346 0.4305800 0.4694200
## 15 0.009096403 0.4182714 0.4539286
## 16 0.015722570 0.4451843 0.5068157
## 17 0.012400526 0.4647954 0.5134046
## 18 0.012923005 0.4246714 0.4753286
## 19 0.001883809 0.4663078 0.4736922
## 20 0.012845233 0.4248238 0.4751762
## 21 0.016304940 0.4580429 0.5219571
## 22 0.010413043 0.4165908 0.4574092
## 23 0.014966841 0.4256655 0.4843345
## 24 0.016339838 0.4179745 0.4820255
## 25 0.011340235 0.4477735 0.4922265
## 26 0.010451057 0.4095163 0.4504837
## 27 0.012170890 0.4561455 0.5038545
## 28 0.001704722 0.4256588 0.4323412
## 29 0.015337369 0.3769393 0.4370607
## 30 0.013249383 0.4440317 0.4959683
## 31 0.016580236 0.4115033 0.4764967
## 32 0.016475089 0.3977094 0.4622906
## 33 0.014711237 0.4411665 0.4988335
## 34 0.014610041 0.4413648 0.4986352
## 35 0.014496630 0.4315871 0.4884129
## 36 0.011018764 0.4104036 0.4535964
## 37 0.017514599 0.4076720 0.4763280
## 38 0.014193655 0.4321809 0.4878191
## 39 0.015579317 0.4154651 0.4765349
## 40 0.011358664 0.4327374 0.4772626
## 41 0.012816656 0.4148798 0.4651202
## 42 0.009786814 0.4308182 0.4691818
## 43 0.011810940 0.4238510 0.4701490
## 44 0.016858185 0.4069586 0.4730414
## 45 0.016907006 0.4068629 0.4731371
## 46 0.012743626 0.3950230 0.4449770
## 47 0.012973281 0.4145728 0.4654272
## 48 0.009898535 0.4305992 0.4694008
## 49 0.012816656 0.4148798 0.4651202
## 50 0.003107749 0.3708089 0.3829911
## 51 0.012609413 0.4677860 0.5172140
## 52 0.012816656 0.4148798 0.4651202
## 53 0.012998976 0.4651225 0.5160775
## 54 0.009830663 0.4307323 0.4692677
## 55 0.009144248 0.4143776 0.4502224
## 56 0.013381202 0.4590733 0.5115267
## 57 0.009684792 0.4210182 0.4589818
## 58 0.009047108 0.4085680 0.4440320
## 59 0.013711286 0.4609264 0.5146736
## 60 0.009046705 0.4078688 0.4433312
## 61 0.013441133 0.4617559 0.5144441
## 62 0.009697721 0.4209928 0.4590072
## 63 0.006807590 0.4466574 0.4733426
## 64 0.009106200 0.4159522 0.4516478
## 65 0.009677647 0.4110322 0.4489678
## 66 0.009119319 0.4068265 0.4425735
## 67 0.010114336 0.4501763 0.4898237
28
## 68 0.009015504 0.4059299 0.4412701
## 69 0.009679479 0.4010286 0.4389714
## 70 0.008834895 0.4154839 0.4501161
3. The final tally for the popular vote was Clinton 48.2% and Trump 46.1%. Add a column called hit
to pollster_results that states if the confidence interval included the true proportion 𝑝 = 0.482 or
not. What proportion of confidence intervals included 𝑝?
# The `pollster_results` object has already been loaded. Examine it using the `head` function.
head(pollster_results)
# Add a logical variable called `hit` that indicates whether the actual value exists within the confiden
avg_hit <- pollster_results %>% mutate(hit=(lower<0.482 & upper>0.482)) %>% summarize(mean(hit))
avg_hit
## mean(hit)
## 1 0.3142857
4. If these confidence intervals are constructed correctly, and the theory holds up, what proportion of
confidence intervals should include 𝑝?
□ A. 0.05
□ B. 0.31
□ C. 0.50
⊠ D. 0.95
5. A much smaller proportion of the polls than expected produce confidence intervals containing 𝑝.
Notice that most polls that fail to include 𝑝 are underestimating. The rationale for this is that undecided
voters historically divide evenly between the two main candidates on election day.
In this case, it is more informative to estimate the spread or the difference between the proportion of two
candidates 𝑑, or 0.482 − 0.461 = 0.021 for this election.
Assume that there are only two parties and that 𝑑 = 2𝑝 − 1. Construct a 95% confidence interval for
difference in proportions on election night.
29
# Add a statement to this line of code that will add a new column named `d_hat` to `polls`. The new colu
polls <- polls_us_election_2016 %>% filter(enddate >= "2016-10-31" & state == "U.S.") %>%
mutate(d_hat = rawpoll_clinton/100 - rawpoll_trump/100)
# Assign the sample size of the first poll in `polls` to a variable called `N`. Print this value to the
N <- polls$samplesize[1]
N
## [1] 2220
# Assign the difference `d_hat` of the first poll in `polls` to a variable called `d_hat`. Print this va
d_hat <- polls$d_hat[1]
d_hat
## [1] 0.04
## [1] 0.52
# Calculate the standard error of the spread and save it to a variable called `se_hat`. Print this value
se_hat <- 2*sqrt(X_hat*(1-X_hat)/N)
se_hat
## [1] 0.02120683
# Use `qnorm` to calculate the 95% confidence interval for the difference in the proportions of voters.
ci <- c(d_hat - qnorm(0.975)*se_hat, d_hat + qnorm(0.975)*se_hat)
ci
6. Create a new object called pollster_results that contains the pollster’s name, the end date of the
poll, the difference in the proportion of voters who declared a vote either, and the lower and upper
bounds of the confidence interval for the estimate.
# The subset `polls` data with 'd_hat' already calculated has been loaded. Examine it using the `head` f
head(polls)
30
## 2 Google Consumer Surveys B 26574
## 3 Ipsos A- 2195
## 4 YouGov B 3677
## 5 Gravis Marketing B- 16639
## 6 Fox News/Anderson Robbins Research/Shaw & Company Research A 1295
## population rawpoll_clinton rawpoll_trump rawpoll_johnson rawpoll_mcmullin
## 1 lv 47.00 43.00 4.00 NA
## 2 lv 38.03 35.69 5.46 NA
## 3 lv 42.00 39.00 6.00 NA
## 4 lv 45.00 41.00 5.00 NA
## 5 rv 47.00 43.00 3.00 NA
## 6 lv 48.00 44.00 3.00 NA
## adjpoll_clinton adjpoll_trump adjpoll_johnson adjpoll_mcmullin d_hat
## 1 45.20163 41.72430 4.626221 NA 0.0400
## 2 43.34557 41.21439 5.175792 NA 0.0234
## 3 42.02638 38.81620 6.844734 NA 0.0300
## 4 45.65676 40.92004 6.069454 NA 0.0400
## 5 46.84089 42.33184 3.726098 NA 0.0400
## 6 49.02208 43.95631 3.057876 NA 0.0400
# Create a new object called `pollster_results` that contains columns for pollster name, end date, d_hat
d_hat = polls$rawpoll_clinton/100 - polls$rawpoll_trump/100
X_hat = (d_hat + 1) / 2
polls <- mutate(polls, X_hat, se_hat = 2 * sqrt(X_hat * (1 - X_hat) / samplesize), lower = d_hat - qnorm
pollster_results <- select(polls, pollster, enddate, d_hat, lower, upper)
pollster_results
## pollster enddate
## 1 ABC News/Washington Post 2016-11-06
## 2 Google Consumer Surveys 2016-11-07
## 3 Ipsos 2016-11-06
## 4 YouGov 2016-11-07
## 5 Gravis Marketing 2016-11-06
## 6 Fox News/Anderson Robbins Research/Shaw & Company Research 2016-11-06
## 7 CBS News/New York Times 2016-11-06
## 8 NBC News/Wall Street Journal 2016-11-05
## 9 IBD/TIPP 2016-11-07
## 10 Selzer & Company 2016-11-06
## 11 Angus Reid Global 2016-11-04
## 12 Monmouth University 2016-11-06
## 13 Marist College 2016-11-03
## 14 The Times-Picayune/Lucid 2016-11-07
## 15 USC Dornsife/LA Times 2016-11-07
## 16 RKM Research and Communications, Inc. 2016-11-05
## 17 CVOTER International 2016-11-06
## 18 Morning Consult 2016-11-05
## 19 SurveyMonkey 2016-11-06
## 20 Rasmussen Reports/Pulse Opinion Research 2016-11-06
## 21 Insights West 2016-11-07
## 22 RAND (American Life Panel) 2016-11-01
## 23 Fox News/Anderson Robbins Research/Shaw & Company Research 2016-11-03
## 24 CBS News/New York Times 2016-11-01
## 25 ABC News/Washington Post 2016-11-05
## 26 Ipsos 2016-11-04
31
## 27 ABC News/Washington Post 2016-11-04
## 28 YouGov 2016-11-06
## 29 IBD/TIPP 2016-11-06
## 30 ABC News/Washington Post 2016-11-03
## 31 IBD/TIPP 2016-11-03
## 32 IBD/TIPP 2016-11-05
## 33 ABC News/Washington Post 2016-11-02
## 34 ABC News/Washington Post 2016-11-01
## 35 ABC News/Washington Post 2016-10-31
## 36 Ipsos 2016-11-03
## 37 IBD/TIPP 2016-11-04
## 38 YouGov 2016-11-01
## 39 IBD/TIPP 2016-10-31
## 40 Ipsos 2016-11-02
## 41 Rasmussen Reports/Pulse Opinion Research 2016-11-03
## 42 The Times-Picayune/Lucid 2016-11-06
## 43 Ipsos 2016-11-01
## 44 IBD/TIPP 2016-11-02
## 45 IBD/TIPP 2016-11-01
## 46 Rasmussen Reports/Pulse Opinion Research 2016-11-02
## 47 Ipsos 2016-10-31
## 48 The Times-Picayune/Lucid 2016-11-05
## 49 Rasmussen Reports/Pulse Opinion Research 2016-10-31
## 50 Google Consumer Surveys 2016-10-31
## 51 CVOTER International 2016-11-05
## 52 Rasmussen Reports/Pulse Opinion Research 2016-11-01
## 53 CVOTER International 2016-11-04
## 54 The Times-Picayune/Lucid 2016-11-04
## 55 USC Dornsife/LA Times 2016-11-06
## 56 CVOTER International 2016-11-03
## 57 The Times-Picayune/Lucid 2016-11-03
## 58 USC Dornsife/LA Times 2016-11-05
## 59 CVOTER International 2016-11-02
## 60 USC Dornsife/LA Times 2016-11-04
## 61 CVOTER International 2016-11-01
## 62 The Times-Picayune/Lucid 2016-11-02
## 63 Gravis Marketing 2016-10-31
## 64 USC Dornsife/LA Times 2016-11-03
## 65 The Times-Picayune/Lucid 2016-11-01
## 66 USC Dornsife/LA Times 2016-11-02
## 67 Gravis Marketing 2016-11-02
## 68 USC Dornsife/LA Times 2016-11-01
## 69 The Times-Picayune/Lucid 2016-10-31
## 70 USC Dornsife/LA Times 2016-10-31
## d_hat lower upper
## 1 0.0400 -0.001564627 0.0815646272
## 2 0.0234 0.011380104 0.0354198955
## 3 0.0300 -0.011815309 0.0718153088
## 4 0.0400 0.007703641 0.0722963589
## 5 0.0400 0.024817728 0.0551822719
## 6 0.0400 -0.014420872 0.0944208716
## 7 0.0400 -0.011860967 0.0918609675
## 8 0.0400 -0.014696100 0.0946961005
## 9 -0.0150 -0.073901373 0.0439013728
32
## 10 0.0300 -0.039307332 0.0993073320
## 11 0.0400 -0.017724837 0.0977248370
## 12 0.0600 -0.011534270 0.1315342703
## 13 0.0100 -0.053923780 0.0739237800
## 14 0.0500 0.011013152 0.0889868476
## 15 -0.0323 -0.068233293 0.0036332933
## 16 0.0320 -0.029670864 0.0936708643
## 17 0.0278 -0.020801931 0.0764019309
## 18 0.0300 -0.020889533 0.0808895334
## 19 0.0600 0.052615604 0.0673843956
## 20 0.0200 -0.030595930 0.0705959303
## 21 0.0400 -0.023875814 0.1038758144
## 22 0.0910 0.050024415 0.1319755848
## 23 0.0150 -0.043901373 0.0739013728
## 24 0.0300 -0.034344689 0.0943446886
## 25 0.0400 -0.004497495 0.0844974954
## 26 0.0400 -0.001341759 0.0813417590
## 27 0.0500 0.002312496 0.0976875039
## 28 0.0390 0.032254341 0.0457456589
## 29 -0.0240 -0.085171524 0.0371715236
## 30 0.0400 -0.011988727 0.0919887265
## 31 0.0050 -0.060404028 0.0704040277
## 32 -0.0100 -0.075220256 0.0552202561
## 33 0.0300 -0.027745070 0.0877450695
## 34 0.0200 -0.037362198 0.0773621983
## 35 0.0000 -0.057008466 0.0570084657
## 36 0.0490 0.005454535 0.0925454654
## 37 0.0050 -0.064121736 0.0741217361
## 38 0.0300 -0.025791885 0.0857918847
## 39 0.0090 -0.052426619 0.0704266191
## 40 0.0820 0.037443982 0.1265560180
## 41 0.0000 -0.050606052 0.0506060525
## 42 0.0500 0.011491350 0.0885086495
## 43 0.0730 0.026563876 0.1194361238
## 44 -0.0010 -0.067563833 0.0655638334
## 45 -0.0040 -0.070756104 0.0627561042
## 46 -0.0300 -0.080583275 0.0205832746
## 47 0.0680 0.016894089 0.1191059111
## 48 0.0500 0.011051757 0.0889482429
## 49 0.0000 -0.050606052 0.0506060525
## 50 0.0262 0.013635277 0.0387647229
## 51 0.0333 -0.016106137 0.0827061365
## 52 0.0000 -0.050606052 0.0506060525
## 53 0.0124 -0.038560140 0.0633601401
## 54 0.0600 0.021340150 0.0986598497
## 55 -0.0475 -0.083637121 -0.0113628793
## 56 0.0009 -0.051576011 0.0533760106
## 57 0.0500 0.011807815 0.0881921851
## 58 -0.0553 -0.091100799 -0.0194992008
## 59 0.0056 -0.048162418 0.0593624177
## 60 -0.0540 -0.089809343 -0.0181906570
## 61 0.0067 -0.046002019 0.0594020190
## 62 0.0500 0.011756829 0.0882431712
## 63 0.0100 -0.016769729 0.0367697294
33
## 64 -0.0351 -0.071090499 0.0008904993
## 65 0.0300 -0.008295761 0.0682957614
## 66 -0.0503 -0.086413709 -0.0141862908
## 67 0.0200 -0.019711083 0.0597110831
## 68 -0.0547 -0.090406512 -0.0189934879
## 69 0.0200 -0.018430368 0.0584303678
## 70 -0.0362 -0.071126335 -0.0012736645
7. What proportion of confidence intervals for the difference between the proportion of voters included 𝑑,
the actual difference in election day?
# The `pollster_results` object has already been loaded. Examine it using the `head` function.
head(pollster_results)
# Add a logical variable called `hit` that indicates whether the actual value (0.021) exists within the
avg_hit <- pollster_results %>% mutate(hit=(lower<0.021 & upper>0.021)) %>% summarize(mean(hit))
avg_hit
## mean(hit)
## 1 0.7714286
8. Although the proportion of confidence intervals that include the actual difference between the propor-
tion of voters increases substantially, it is still lower that 0.95.
In the next chapter, we learn the reason for this. To motivate our next exercises, calculate the difference
between each poll’s estimate 𝑑 ̄ and the actual 𝑑 = 0.021. Stratify this difference, or error, by pollster in a
plot.
# The `polls` object has already been loaded. Examine it using the `head` function.
head(polls)
34
## 5 U.S. 2016-11-03 2016-11-06
## 6 U.S. 2016-11-03 2016-11-06
## pollster grade samplesize
## 1 ABC News/Washington Post A+ 2220
## 2 Google Consumer Surveys B 26574
## 3 Ipsos A- 2195
## 4 YouGov B 3677
## 5 Gravis Marketing B- 16639
## 6 Fox News/Anderson Robbins Research/Shaw & Company Research A 1295
## population rawpoll_clinton rawpoll_trump rawpoll_johnson rawpoll_mcmullin
## 1 lv 47.00 43.00 4.00 NA
## 2 lv 38.03 35.69 5.46 NA
## 3 lv 42.00 39.00 6.00 NA
## 4 lv 45.00 41.00 5.00 NA
## 5 rv 47.00 43.00 3.00 NA
## 6 lv 48.00 44.00 3.00 NA
## adjpoll_clinton adjpoll_trump adjpoll_johnson adjpoll_mcmullin d_hat X_hat
## 1 45.20163 41.72430 4.626221 NA 0.0400 0.5200
## 2 43.34557 41.21439 5.175792 NA 0.0234 0.5117
## 3 42.02638 38.81620 6.844734 NA 0.0300 0.5150
## 4 45.65676 40.92004 6.069454 NA 0.0400 0.5200
## 5 46.84089 42.33184 3.726098 NA 0.0400 0.5200
## 6 49.02208 43.95631 3.057876 NA 0.0400 0.5200
## se_hat lower upper
## 1 0.021206832 -0.001564627 0.08156463
## 2 0.006132712 0.011380104 0.03541990
## 3 0.021334733 -0.011815309 0.07181531
## 4 0.016478037 0.007703641 0.07229636
## 5 0.007746199 0.024817728 0.05518227
## 6 0.027766261 -0.014420872 0.09442087
# Add variable called `error` to the object `polls` that contains the difference between d_hat and the a
error <- polls$d_hat - 0.021
polls <- mutate(polls, error)
polls %>% ggplot(aes(x = pollster, y = error)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
35
error 0.04
0.00
−0.04
−0.08
CVOTER International
Gravis Marketing
IBD/TIPP
Insights West
Ipsos
Marist College
Monmouth University
Morning Consult
SurveyMonkey
The Times−Picayune/Lucid
YouGov
pollster
9. Remake the plot you made for the previous exercise, but only for pollsters that took five or more polls.
You can use dplyr tools group_by and n to group data by a variable of interest and then count the number
of observations in the groups. The function filter filters data piped into it by your specified condition.
For example:
# The `polls` object has already been loaded. Examine it using the `head` function.
head(polls)
36
## 6 Fox News/Anderson Robbins Research/Shaw & Company Research A 1295
## population rawpoll_clinton rawpoll_trump rawpoll_johnson rawpoll_mcmullin
## 1 lv 47.00 43.00 4.00 NA
## 2 lv 38.03 35.69 5.46 NA
## 3 lv 42.00 39.00 6.00 NA
## 4 lv 45.00 41.00 5.00 NA
## 5 rv 47.00 43.00 3.00 NA
## 6 lv 48.00 44.00 3.00 NA
## adjpoll_clinton adjpoll_trump adjpoll_johnson adjpoll_mcmullin d_hat X_hat
## 1 45.20163 41.72430 4.626221 NA 0.0400 0.5200
## 2 43.34557 41.21439 5.175792 NA 0.0234 0.5117
## 3 42.02638 38.81620 6.844734 NA 0.0300 0.5150
## 4 45.65676 40.92004 6.069454 NA 0.0400 0.5200
## 5 46.84089 42.33184 3.726098 NA 0.0400 0.5200
## 6 49.02208 43.95631 3.057876 NA 0.0400 0.5200
## se_hat lower upper error
## 1 0.021206832 -0.001564627 0.08156463 0.0190
## 2 0.006132712 0.011380104 0.03541990 0.0024
## 3 0.021334733 -0.011815309 0.07181531 0.0090
## 4 0.016478037 0.007703641 0.07229636 0.0190
## 5 0.007746199 0.024817728 0.05518227 0.0190
## 6 0.027766261 -0.014420872 0.09442087 0.0190
# Add variable called `error` to the object `polls` that contains the difference between d_hat and the a
error <- polls$d_hat - 0.021
polls <- mutate(polls, error)
polls %>% group_by(pollster) %>% filter(n() >= 5)
## # A tibble: 48 x 21
## # Groups: pollster [7]
## state startdate enddate pollster grade samplesize population
## <fct> <date> <date> <fct> <fct> <int> <chr>
## 1 U.S. 2016-11-03 2016-11-06 ABC New~ A+ 2220 lv
## 2 U.S. 2016-11-02 2016-11-06 Ipsos A- 2195 lv
## 3 U.S. 2016-11-04 2016-11-07 IBD/TIPP A- 1107 lv
## 4 U.S. 2016-11-05 2016-11-07 The Tim~ <NA> 2521 lv
## 5 U.S. 2016-11-01 2016-11-07 USC Dor~ <NA> 2972 lv
## 6 U.S. 2016-10-31 2016-11-06 CVOTER ~ C+ 1625 lv
## 7 U.S. 2016-11-02 2016-11-06 Rasmuss~ C+ 1500 lv
## 8 U.S. 2016-11-02 2016-11-05 ABC New~ A+ 1937 lv
## 9 U.S. 2016-10-31 2016-11-04 Ipsos A- 2244 lv
## 10 U.S. 2016-11-01 2016-11-04 ABC New~ A+ 1685 lv
## # ... with 38 more rows, and 14 more variables: rawpoll_clinton <dbl>,
## # rawpoll_trump <dbl>, rawpoll_johnson <dbl>, rawpoll_mcmullin <dbl>,
## # adjpoll_clinton <dbl>, adjpoll_trump <dbl>, adjpoll_johnson <dbl>,
## # adjpoll_mcmullin <dbl>, d_hat <dbl>, X_hat <dbl>, se_hat <dbl>,
## # lower <dbl>, upper <dbl>, error <dbl>
37
error 0.04
0.00
−0.04
−0.08
CVOTER International
Gravis Marketing
IBD/TIPP
Insights West
Ipsos
Marist College
Monmouth University
Morning Consult
SurveyMonkey
The Times−Picayune/Lucid
YouGov
pollster
Section 4 Overview
In Section 4, you will look at statistical models in the context of election polling and forecasting.
After completing Section 4, you will be able to:
• Understand how aggregating data from different sources, as poll aggregators do for poll data, can
improve the precision of a prediction.
• Understand how to fit a multilevel model to the data to forecast, for example, election results.
• Explain why a simple aggregation of data is insufficient to combine results because of factors such as
pollster bias.
• Use a data-driven model to account for additional types of sampling variability such as pollster-to-
pollster variability.
Poll Aggregators
• Poll aggregators combine the results of many polls to simulate polls with a large sample size and
therefore generate more precise estimates than individual polls.
• Polls can be simulated with a Monte Carlo simulation and used to construct an estimate of the spread
and confidence intervals.
38
• The actual data science exercise of forecasting elections involves more complex statistical modeling,
but these underlying ideas still apply.
d <- 0.039
Ns <- c(1298, 533, 1342, 897, 774, 254, 812, 324, 1291, 1056, 2172, 516)
p <- (d+1)/2
## [1] 3.6
39
round(moe*100, 1)
## [1] 1.8
• Different poll aggregators generate different models of election results from the same poll data. This
is because they use different statistical models.
• We will use actual polling data about the popular vote from the 2016 US presidential election to learn
the principles of statistical modeling.
• We analyze real 2016 US polling data organized by FiveThirtyEight. We start by using reliable national
polls taken within the week before the election to generate an urn model.
• Consider 𝑝 the proportion voting for Clinton and 1 − 𝑝 the proportion voting for Trump. We are
interested in the spread 𝑑 = 2𝑝 − 1.
• Poll results are a random normal variable with expected value of the spread 𝑑 and standard error
2√𝑝(1 − 𝑝)/𝑁 .
• Our initial estimate of the spread did not include the actual spread. Part of the reason is that different
pollsters have different numbers of polls in our dataset, and each pollster has a bias.
• Pollster bias reflects the fact that repeated polls by a given pollster have an expected value different
from the actual spread and different from other pollsters. Each pollster has a different bias.
• The urn model does not account for pollster bias. We will develop a more flexible data-driven model
that can account for effects like bias.
names(polls_us_election_2016)
# keep only national polls from week before election with a grade considered reliable
polls <- polls_us_election_2016 %>%
filter(state == "U.S." & enddate >= "2016-10-31" &
(grade %in% c("A+", "A", "A-", "B+") | is.na(grade)))
40
# compute estimated spread for combined polls
d_hat <- polls %>%
summarize(d_hat = sum(spread * samplesize) / sum(samplesize)) %>%
.$d_hat
7.5
5.0
count
2.5
0.0
## # A tibble: 15 x 2
## pollster `n()`
## <fct> <int>
## 1 ABC News/Washington Post 7
41
## 2 Angus Reid Global 1
## 3 CBS News/New York Times 2
## 4 Fox News/Anderson Robbins Research/Shaw & Company Research 2
## 5 IBD/TIPP 8
## 6 Insights West 1
## 7 Ipsos 6
## 8 Marist College 1
## 9 Monmouth University 1
## 10 Morning Consult 1
## 11 NBC News/Wall Street Journal 1
## 12 RKM Research and Communications, Inc. 1
## 13 Selzer & Company 1
## 14 The Times-Picayune/Lucid 8
## 15 USC Dornsife/LA Times 8
0.05
spread
0.00
−0.05
ABC News/Washington Post
IBD/TIPP
Ipsos
The Times−Picayune/Lucid
pollster
42
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 2
## pollster se
## <fct> <dbl>
## 1 ABC News/Washington Post 0.0265
## 2 IBD/TIPP 0.0333
## 3 Ipsos 0.0225
## 4 The Times-Picayune/Lucid 0.0196
## 5 USC Dornsife/LA Times 0.0183
Data-Driven Models
• Instead of using an urn model where each poll is a random draw from the same distribution of voters,
we instead define a model using an urn that contains poll results from all possible pollsters.
• We assume the expected value of this model is the actual spread 𝑑 = 2𝑝 − 1.
• Our new standard error 𝜎 now factors in pollster-to-pollster variability. It can no longer be calculated
from 𝑝 or 𝑑 and is an unknown parameter.
• The central limit theorem still works to estimate the sample average of many polls 𝑋1 , ..., 𝑋𝑁 because
the average of the sum of many random √ variables is a normally distributed random variable with
expected value 𝑑 and standard error 𝜎/ 𝑁 .
• We can estimate the unobserved 𝜎 as the sample standard deviation, which is calculated with the sd
function.
Code
Note that to compute the exact 95% confidence interval, we would use qnorm(.975) instead of 1.96.
43
6
4
count
1. We have been using urn models to motivate the use of probability models.
However, most data science applications are not related to data obtained from urns. More common are data
that come from individuals. Probability plays a role because the data come from a random sample. The
random sample is taken from a population and the urn serves as an analogy for the population.
Let’s revisit the heights dataset. For now, consider x to be the heights of all males in the data set. Math-
ematically speaking, x is our population. Using the urn analogy, we have an urn with the values of x in
it.
What are the population average and standard deviation of our population?
data(heights)
44
x <- heights %>% filter(sex == "Male") %>%
.$height
## [1] 69.31475
# Calculate the population standard deviation. Print this value to the console.
standaard_deviatie <- sd(x)
standaard_deviatie
## [1] 3.611024
2. Call the population average computed above 𝜇 and the standard deviation 𝜎. Now take a sample of
size 50, with replacement, and construct an estimate for 𝜇 and 𝜎.
# The vector of all male heights in our population `x` has already been loaded for you. You can examine
head(x)
## [1] 75 70 68 74 61 67
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
## [1] 70.47293
# Calculate the sample standard deviation. Print this value to the console.
sigma <- sd(X)
sigma
45
## [1] 3.426742
3. What does the central limit theory tell us about the sample average and how it is related to 𝜇, the
population average?
□ A. It is identical to 𝜇. √
⊠ B. It is a random variable with expected value 𝜇 and standard error 𝜎/ 𝑁 .
□ C. It is a random variable with expected value 𝜇 and standard error 𝜎.
□ D. It underestimates 𝜇.
4. We will use 𝑋̄ as our estimate of the heights in the population from our√sample size 𝑁 . We know from
previous exercises that the standard estimate of our error 𝑋̄ − 𝜇 is 𝜎/ 𝑁 .
# The vector of all male heights in our population `x` has already been loaded for you. You can examine
head(x)
## [1] 75 70 68 74 61 67
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# Define `se` as the standard error of the estimate. Print this value to the console.
X_hat <- mean(X)
se <- sd(X)/sqrt(N)
se
## [1] 0.4846145
# Construct a 95% confidence interval for the population average based on our sample. Save the lower and
# Construct a 95% confidence interval for the population average based on our sample. Save the lower and
ci <- c(qnorm(0.025, X_hat, se), qnorm(0.975, X_hat, se))
ci
5. Now run a Monte Carlo simulation in which you compute 10,000 confidence intervals as you have just
done.
46
# Define `mu` as the population average
mu <- mean(x)
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# Define an object `res` that contains a logical vector for simulated intervals that contain mu
res <- replicate(B, {
X <- sample(x, N, replace = TRUE)
se <- sd(X) / sqrt(N)
interval <- c(qnorm(0.025, mean(X), se) , qnorm(0.975, mean(X), se))
between(mu, interval[1], interval[2])
})
# Calculate the proportion of results in `res` that include mu. Print this value to the console.
mean(res)
## [1] 0.9479
6. In this section, we used visualization to motivate the presence of pollster bias in election polls.
Here we will examine that bias more rigorously. Lets consider two pollsters that conducted daily polls and
look at national polls for the month before the election.
Is there a poll bias? Make a plot of the spreads for each poll.
# These lines of code filter for the polls we want and calculate the spreads
polls <- polls_us_election_2016 %>%
filter(pollster %in% c("Rasmussen Reports/Pulse Opinion Research","The Times-Picayune/Lucid") &
enddate >= "2016-10-15" &
state == "U.S.") %>%
mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100)
47
0.12
0.08
spread
0.04
0.00
However, these data are subject to variability. Perhaps the differences we observe are due to chance. Under
the urn model, both pollsters should have the same expected value: the election day difference, 𝑑.
We will model the observed data 𝑌𝑖𝑗 in the following way:
𝑌𝑖𝑗 = 𝑑 + 𝑏𝑖 + 𝜀𝑖𝑗
with 𝑖 = 1, 2 indexing the two pollsters, 𝑏𝑖 the bias for pollster 𝑖, and 𝜀𝑖𝑗 poll to poll chance variability. We
assume the 𝜀 are independent from each other, have expected value 0 and standard deviation 𝜎𝑖 regardless
of 𝑗.
Which of the following statements best reflects what we need to know to determine if our data fit the urn
model?
□ A. Is 𝜀𝑖𝑗 = 0?
□ B. How close are 𝑌𝑖𝑗 to 𝑑?
⊠ C. Is 𝑏1 ≠ 𝑏2 ?
□ D. Are 𝑏1 = 0 and 𝑏2 = 0?
𝑌𝑖𝑗 = 𝑑 + 𝑏𝑖 + 𝜀𝑖𝑗
On the right side of this model, only 𝜀𝑖𝑗 is a random variable. The other two values are constants.
What is the expected value of 𝑌1𝑗 ?
48
⊠ A. 𝑑 + 𝑏1
□ B. 𝑏1 + 𝜀𝑖𝑗
□ C. 𝑑
□ D. 𝑑 + 𝑏1 + 𝜀𝑖𝑗
9. Suppose we define 𝑌1̄ as the average of poll results from the first poll and 𝜎1 as the standard deviation
of the first poll.
10. Now we define 𝑌2̄ as the average of poll results from the second poll.
11. Using what we learned by answering the previous questions, what is the expected value of 𝑌2̄ − 𝑌1̄ ?
□ A. (𝑏2 − 𝑏1 )2
□ B. 𝑏2 − 𝑏1 /√(𝑁 )
□ C. 𝑏2 + 𝑏1
⊠ D. 𝑏2 − 𝑏1
Using what we learned by answering the questions above, what is the standard error of 𝑌2̄ − 𝑌1̄ ?
13. The answer to the previous question depends on 𝜎1 and 𝜎2 , which we don’t know.
We learned that we can estimate these values using the sample standard deviation.
Compute the estimates of 𝜎1 and 𝜎2 .
# The `polls` data have already been loaded for you. Use the `head` function to examine them.
head(polls)
49
## state startdate enddate pollster grade
## 1 U.S. 2016-11-05 2016-11-07 The Times-Picayune/Lucid <NA>
## 2 U.S. 2016-11-02 2016-11-06 Rasmussen Reports/Pulse Opinion Research C+
## 3 U.S. 2016-11-01 2016-11-03 Rasmussen Reports/Pulse Opinion Research C+
## 4 U.S. 2016-11-04 2016-11-06 The Times-Picayune/Lucid <NA>
## 5 U.S. 2016-10-31 2016-11-02 Rasmussen Reports/Pulse Opinion Research C+
## 6 U.S. 2016-11-03 2016-11-05 The Times-Picayune/Lucid <NA>
## samplesize population rawpoll_clinton rawpoll_trump rawpoll_johnson
## 1 2521 lv 45 40 5
## 2 1500 lv 45 43 4
## 3 1500 lv 44 44 4
## 4 2584 lv 45 40 5
## 5 1500 lv 42 45 4
## 6 2526 lv 45 40 5
## rawpoll_mcmullin adjpoll_clinton adjpoll_trump adjpoll_johnson
## 1 NA 45.13966 42.26495 3.679914
## 2 NA 45.56041 43.13745 4.418502
## 3 NA 44.66353 44.28981 4.331246
## 4 NA 45.22830 42.30433 3.770880
## 5 NA 42.67626 45.41689 4.239500
## 6 NA 45.31041 42.34422 3.779955
## adjpoll_mcmullin spread
## 1 NA 0.05
## 2 NA 0.02
## 3 NA 0.00
## 4 NA 0.05
## 5 NA -0.03
## 6 NA 0.05
# Create an object called `sigma` that contains a column for `pollster` and a column for `s`, the standa
sigma <- polls %>% group_by(pollster) %>% summarize(s = sd(spread))
## # A tibble: 2 x 2
## pollster s
## <fct> <dbl>
## 1 Rasmussen Reports/Pulse Opinion Research 0.0177
## 2 The Times-Picayune/Lucid 0.0268
14. What does the central limit theorem tell us about the distribution of the differences between the
pollster averages, 𝑌2̄ − 𝑌1̄ ?
□ A. The central limit theorem cannot tell us anything because this difference is not the average of a
sample.
□ B. Because 𝑌𝑖𝑗 are approximately normal, the averages are normal too.
⊠ C. If we assume 𝑁2 and 𝑁1 are large enough, 𝑌2̄ and 𝑌1̄ , and their difference, are approximately normal.
□ D. These data do not contain vectors of 0 and 1, so the central limit theorem does not apply.
50
15. We have constructed a random variable that has expected value 𝑏2 − 𝑏1 , the pollster bias difference.
If our model holds, then this random variable has an approximately normal distribution. The standard error
of this random variable depends on 𝜎1 and 𝜎2 , but we can use the sample standard deviations we computed
earlier. We have everything we need to answer our initial question: is 𝑏2 − 𝑏1 different from 0?
Construct a 95% confidence interval for the difference 𝑏2 and 𝑏1 . Does this interval contain zero?
# The `polls` data have already been loaded for you. Use the `head` function to examine them.
head(polls)
# Create an object called `res` that summarizes the average, standard deviation, and number of polls for
res <- polls %>% group_by(pollster) %>% summarise(avg = mean(spread), s = sd(spread), N=n())
# Store the difference between the larger average and the smaller in a variable called `estimate`. Print
estimate <- max(res$avg) - min(res$avg)
estimate
## [1] 0.05229167
51
# Store the standard error of the estimates as a variable called `se_hat`. Print this value to the conso
se_hat <- sqrt(res$s[2]^2/res$N[2] + res$s[1]^2/res$N[1])
se_hat
## [1] 0.007031433
# Calculate the 95% confidence interval of the spreads. Save the lower and then the upper confidence int
ci <- c(estimate - qnorm(0.975)*se_hat, estimate + qnorm(0.975)*se_hat)
ci
16. The confidence interval tells us there is relatively strong pollster effect resulting in a difference of about
5%.
# We made an object `res` to summarize the average, standard deviation, and number of polls for the two
res <- polls %>% group_by(pollster) %>%
summarize(avg = mean(spread), s = sd(spread), N = n())
# The variables `estimate` and `se_hat` contain the spread estimates and standard error, respectively.
estimate <- res$avg[2] - res$avg[1]
se_hat <- sqrt(res$s[2]^2/res$N[2] + res$s[1]^2/res$N[1])
## [1] 1.030287e-13
17. We compute statistic called the t-statistic by dividing our estimate of 𝑏2 − 𝑏1 by its estimated standard
error:
𝑌2̄ −𝑌1̄
√𝑠22 /𝑁2 +𝑠21 /𝑁1
Later we learn will learn of another approximation for the distribution of this statistic for values of 𝑁2 and
𝑁1 that aren’t large enough for the CLT.
Note that our data has more than two pollsters. We can also test for pollster effect using all pollsters, not
just two. The idea is to compare the variability across polls to variability within polls. We can construct
statistics to test for effects and approximate their distribution. The area of statistics that does this is called
Analysis of Variance or ANOVA. We do not cover it here, but ANOVA provides a very useful set of tools to
answer questions such as: is there a pollster effect?
Compute the average and standard deviation for each pollster and examine the variability across the averages
and how it compares to the variability within the pollsters, summarized by the standard deviation.
52
# Execute the following lines of code to filter the polling data and calculate the spread
polls <- polls_us_election_2016 %>%
filter(enddate >= "2016-10-15" &
state == "U.S.") %>%
group_by(pollster) %>%
filter(n() >= 5) %>%
mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100) %>%
ungroup()
# Create an object called `var` that contains columns for the pollster, mean spread, and standard deviat
var <- polls %>% group_by(pollster) %>% summarise(avg = mean(spread), s = sd(spread))
var
## # A tibble: 11 x 3
## pollster avg s
## <fct> <dbl> <dbl>
## 1 ABC News/Washington Post 0.0373 0.0339
## 2 CVOTER International 0.0279 0.0180
## 3 Google Consumer Surveys 0.0303 0.0185
## 4 Gravis Marketing 0.0160 0.0152
## 5 IBD/TIPP 0.00105 0.0168
## 6 Ipsos 0.0553 0.0195
## 7 Morning Consult 0.0414 0.0146
## 8 Rasmussen Reports/Pulse Opinion Research 0.000625 0.0177
## 9 The Times-Picayune/Lucid 0.0529 0.0268
## 10 USC Dornsife/LA Times -0.0213 0.0207
## 11 YouGov 0.0398 0.00709
Section 5 Overview
In Section 5, you will learn about Bayesian statistics through looking at examples from rare disease diagnosis
and baseball.
After completing Section 5, you will be able to:
Bayesian Statistics
• In the urn model, it does not make sense to talk about the probability of 𝑝 being greater than a certain
value because 𝑝 is a fixed value.
53
• With Bayesian statistics, we assume that 𝑝 is in fact random, which allows us to calculate probabilities
related to 𝑝.
• Hierarchical models describe variability at different levels and incorporate all these levels into a model
for estimating 𝑝.
Bayes’ Theorem
• Bayes’ Theorem states that the probability of event A happening given event B is equal to the proba-
bility of both A and B divided by the probability of event B:
Pr(𝐴 ∣ 𝐵) = Pr(𝐵∣𝐴)Pr(𝐴)
Pr(𝐵)
• Bayes’ Theorem shows that a test for a very rare disease will have a high percentage of false positives
even if the accuracy of the test is high.
Pr(𝐷 = 1 ∣ +) = Pr(+∣𝐷=1)⋅
Pr(+)
Pr(𝐷=1)
table(outcome, test)
54
## test
## outcome - +
## Disease 0 27
## Healthy 99013 960
Bayes in Practice
• The techniques we have used up until now are referred to as frequentist statistics as they consider
only the frequency of outcomes in a dataset and do not include any outside information. Frequentist
statistics allow us to compute confidence intervals and p-values.
• Frequentist statistics can have problems when sample sizes are small and when the data are extreme
compared to historical results.
• Bayesian statistics allows prior knowledge to modify observed results, which alters our conclusions
about event probabilities.
• Hierarchical models use multiple levels of variability to model results. They are hierarchical because
values in the lower levels of the model are computed using values from higher levels of the model.
• We model baseball player batting average using a hierarchical model with two levels of variability:
– 𝑝 ∼ 𝑁 (𝜇, 𝜏 ) describes player-to-player variability in natural ability to hit, which has a mean 𝜇
and standard deviation 𝜏 .
– 𝑌 ∣ 𝑝 ∼ 𝑁 (𝑝, 𝜎) describes a player’s observed batting average given their ability 𝑝, which has a
mean 𝑝 and standard deviation 𝜎 = √𝑝(1 − 𝑝)/𝑁 . This represents variability due to luck.
– In Bayesian hierarchical models, the first level is called the prior distribution and the second level
is called the sampling distribution.
• The posterior distribution allows us to compute the probability distribution of 𝑝 given that we have
observed data 𝑌 .
• By the continuous version of Bayes’ rule, the expected value of the posterior distribution 𝑝 given 𝑌 = 𝑦
is a weighted average between the prior mean 𝜇 and the observed data 𝑌 :
𝜎2
E(𝑝 ∣ 𝑦) = 𝐵𝜇 + (1 − 𝐵)𝑌 where 𝐵 = 𝜎2 +𝜏 2
1
• The standard error of the posterior distribution SE(𝑝 ∣ 𝑌 )2 is 1/𝜎2 +1/𝜏 2 . Note that you will need to
take the square root of both sides to solve for the standard error.
• This Bayesian approach is also known as shrinking. When 𝜎 is large, 𝐵 is close to 1 and our prediction
of 𝑝 shrinks towards the mean (𝜇). When 𝜎 is small, 𝐵 is close to 0 and our prediction of 𝑝 is more
weighted towards the observed data 𝑌 .
55
Assessment - Bayesian Statistics
1. In 1999 in England Sally Clark was found guilty of the murder of two of her sons.
Both infants were found dead in the morning, one in 1996 and another in 1998, and she claimed the cause
of death was sudden infant death syndrome (SIDS). No evidence of physical harm was found on the two
infants so the main piece of evidence against her was the testimony of Professor Sir Roy Meadow, who
testified that the chances of two infants dying of SIDS was 1 in 73 million. He arrived at this figure by
finding that the rate of SIDS was 1 in 8,500 and then calculating that the chance of two SIDS cases was
8, 500 × 8, 500 ≈ 73 million.
Based on what we’ve learned throughout this course, which statement best describes a potential flaw in Sir
Meadow’s reasoning?
⊠ A. Sir Meadow assumed the second death was independent of the first son being affected, thereby
ignoring possible genetic causes.
□ B. There is no flaw. The multiplicative rule always applies in this way: Pr(A and B) = Pr(𝐴)Pr(𝐵)
□ C. Sir Meadow should have added the probabilities: Pr(A and B) = Pr(𝐴) + Pr(𝐵)
□ D. The rate of SIDS is too low to perform these types of statistics.
2. Let’s assume that there is in fact a genetic component to SIDS and the the probability of
Pr(second case of SIDS ∣ first case of SIDS) = 1/100, is much higher than 1 in 8,500.
# Calculate the probability of both sons dying of SIDS. Print this value to the console.
Pr_1 * Pr_2
## [1] 1.176471e-06
3. Many press reports stated that the expert claimed the probability of Sally Clark being innocent as 1
in 73 million.
Perhaps the jury and judge also interpreted the testimony this way. This probability can be written like
this:
Pr(mother is a murderer ∣ two children found dead with no evidence of harm)
Bayes’ rule tells us this probability is equal to:
4. Assume that the probability of a murderer finding a way to kill her two children without leaving
evidence of physical harm is:
56
Pr(two children found dead with no evidence of harm ∣ mother is a murderer) = 0.50
Assume that the murder rate among mothers is 1 in 1,000,000.
Pr(mother is a murderer) = 1/1, 000, 000
According to Bayes’ rule, what is the probability of:
Pr(mother is a murderer ∣ two children found dead with no evidence of harm)
# Define Pr_BA as the probability that two children die without evidence of harm, given that their mothe
Pr_BA <- 0.50
# Define Pr_AB as the probability that a mother is a murderer, given that her two children died with no
Pr_AB <- (Pr_BA * Pr_A)/Pr_B
Pr_AB
## [1] 0.425
5. After Sally Clark was found guilty, the Royal Statistical Society issued a statement saying that there
was “no statistical basis” for the expert’s claim.
They expressed concern at the “misuse of statistics in the courts”. Eventually, Sally Clark was acquitted in
June 2003.
In addition to misusing the multiplicative rule as we saw earlier, what else did Sir Meadow miss?
6. Florida is one of the most closely watched states in the U.S. election because it has many electoral
votes and the election is generally close.
Create a table with the poll spread results from Florida taken during the last days before the election using
the sample code.
The CLT tells us that the average of these spreads is approximately normal. Calculate a spread average and
provide an estimate of the standard error.
57
# Create an object `polls` that contains the spread of predictions for each candidate in Florida during
polls <- polls_us_election_2016 %>%
filter(state == "Florida" & enddate >= "2016-11-04" ) %>%
mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100)
# Create an object called `results` that has two columns containing the average spread (`avg`) and the s
results <- polls %>% summarize(avg = mean(spread), se = sd(spread)/sqrt(n()))
results
## avg se
## 1 0.004154545 0.007218692
7. Assume a Bayesian model sets the prior distribution for Florida’s election night spread 𝑑 to be normal
with expected value 𝜇 and standard deviation 𝜏 .
□ A. 𝜇 and 𝜏 are arbitrary numbers that let us make probability statements about 𝑑.
⊠ B. 𝜇 and 𝜏 summarize what we would predict for Florida before seeing any polls.
□ C. 𝜇 and 𝜏 summarize what we want to be true. We therefore set 𝜇 at 0.10 and 𝜏 at 0.01.
58
□ D. The choice of prior has no effect on the Bayesian analysis.
8. The CLT tells us that our estimate of the spread 𝑑 ̂ has a normal distribution with expected value 𝑑
and standard deviation 𝜎, which we calculated in a previous exercise.
Use the formulas for the posterior distribution to calculate the expected value of the posterior distribution
if we set 𝜇 = 0 and 𝜏 = 0.01.
# The results` object has already been loaded. Examine the values stored: `avg` and `se` of the spread
results
## avg se
## 1 0.004154545 0.007218692
# Define a variable called `sigma` that contains the standard error in the object `results`
sigma <- results$se
# Define a variable called `Y` that contains the average in the object `results`
Y <- results$avg
# Define a variable `B` using `sigma` and `tau`. Print this value to the console.
B <- sigma^2/(sigma^2+tau^2)
## [1] 0.002731286
# Compute the standard error of the posterior distribution. Print this value to the console.
SE <- sqrt(1/((1/sigma^2)+(1/tau^2)))
SE
## [1] 0.005853024
10. Using the fact that the posterior distribution is normal, create an interval that has a 95% of occurring
centered at the posterior expected value.
59
# Here are the variables we have defined in previous exercises
mu <- 0
tau <- 0.01
sigma <- results$se
Y <- results$avg
B <- sigma^2 / (sigma^2 + tau^2)
se <- sqrt( 1/ (1/sigma^2 + 1/tau^2))
# Construct the 95% credible interval. Save the lower and then the upper confidence interval to a variab
ev <- B*mu + (1-B)*Y
ci <- c(ev - qnorm(0.975) * se, ev + qnorm(0.975) * se)
ci
11. According to this analysis, what was the probability that Trump wins Florida?
# Assign the expected value of the posterior distribution to the variable `exp_value`
exp_value <- B*mu + (1-B)*Y
# Assign the standard error of the posterior distribution to the variable `se`
se <- sqrt( 1/ (1/sigma^2 + 1/tau^2))
# Using the `pnorm` function, calculate the probability that the actual spread was less than 0 (in Trump
pnorm(0, exp_value, se)
## [1] 0.3203769
12. We had set the prior variance 𝜏 to 0.01, reflecting that these races are often close.
Change the prior variance to include values ranging from 0.005 to 0.05 and observe how the probability of
Trump winning Florida changes by making a plot.
# Create a function called `p_calc` that generates `B` and calculates the probability of the spread bein
p_calc <- function(tau) {
B <- sigma^2 / (sigma^2 + tau^2)
ev <- B*mu + (1-B)*Y
se <- sqrt(1 / (1/sigma^2 + 1/tau^2))
pnorm(0,ev,se)
}
# Create a vector called `ps` by applying the function `p_calc` across values in `taus`
ps <- p_calc(taus)
60
# Plot `taus` on the x-axis and `ps` on the y-axis
plot(taus, ps)
0.36
0.34
ps
0.32
0.30
taus
Section 6 Overview
In Section 6, you will learn about election forecasting, building on what you’ve learned in the previous
sections about statistical modeling and Bayesian statistics.
After completing Section 6, you will be able to:
• Understand how pollsters use hierarchical models to forecast the results of elections.
• Incorporate multiple sources of variability into a mathematical model to make predictions.
• Construct confidence intervals that better model deviations such as those seen in election data using
the t-distribution.
Election Forecasting
• In our model:
– The spread 𝑑 ∼ 𝑁 (𝜇, 𝜏 ) describes our best guess in the absence of polling data. We set 𝜇 = 0 and
𝜏 = 0.035 using historical data.
– The average of observed data 𝑋̄ ∣ 𝑑 ∼ 𝑁 (𝑑, 𝜎) describes randomness due to sampling and the
pollster effect.
• Because the posterior distribution is normal, we can report a 95% credible interval that has a 95%
chance of overlapping the parameter using E(𝑝 ∣ 𝑌 ) and SE(𝑝 ∣ 𝑌 ).
61
• Given an estimate of E(𝑝 ∣ 𝑌 ) and SE(𝑝 ∣ 𝑌 ), we can use pnorm to compute the probability that 𝑑 > 0.
• It is common to see a general bias that affects all pollsters in the same way. This bias cannot be
predicted or measured before the election. We will include a term in later models to account for this
variability.
Code: Computing the posterior mean, standard error, credible interval and probability
Note that to compute an exact 95% credible interval, we would use qnorm(.975) instead of 1.96.
mu <- 0
tau <- 0.035
sigma <- results$se
Y <- results$avg
B <- sigma^2 / (sigma^2 + tau^2)
posterior_mean <- B*mu + (1-B)*Y
posterior_se <- sqrt(1 / (1/sigma^2 + 1/tau^2))
posterior_mean
## [1] 0.02808534
posterior_se
## [1] 0.006149604
# probability of d > 0
1 - pnorm(0, posterior_mean, posterior_se)
## [1] 0.9999975
62
Mathematical Representations of Models
• If we collect several polls with measured spreads 𝑋1 , ..., 𝑋𝑗 with a sample size of 𝑁 , these random
variables have expected value 𝑑 and standard error 2√𝑝(1 − 𝑝)/𝑁 .
• We represent each measurement as 𝑋𝑖,𝑗 = 𝑑 + 𝑏 + ℎ𝑖 + 𝜖𝑖,𝑗 where:
– The index 𝑖 represents the different pollsters
– The index 𝑗 represents the different polls
– 𝑋𝑖,𝑗 is the 𝑗th poll by the 𝑖th pollster
– 𝑑 is the actual spread of the election
– 𝑏 is the general bias affecting all pollsters
– ℎ𝑖 represents the house effect for the 𝑖th pollster
– 𝜖𝑖,𝑗 represents the random error associated with the 𝑖, 𝑗th poll.
𝑁
• The sample average is now 𝑋̄ = 𝑑 + 𝑏 + 1
𝑁 ∑ 𝑋𝑖 with standard deviation SE(𝑋)̄ = √𝜎2 /𝑁 + 𝜎𝑏2 .
𝑖=1
• The standard error of the general bias 𝜎𝑏 does not get reduced by averaging multiple polls, which
increases the variability of our final estimate.
J <- 6
N <- 2000
d <- .021
p <- (d+1)/2
X <- d + rnorm(J, 0, 2*sqrt(p*(1-p)/N))
I <- 5
J <- 6
N <- 2000
d <- .021
p <- (d+1)/2
X <- sapply(1:I, function(i){
d + rnorm(J, 0, 2*sqrt(p*(1-p)/N))
})
I <- 5
J <- 6
N <- 2000
d <- .021
p <- (d+1)/2
h <- rnorm(I, 0, 0.025) # assume standard error of pollster-to-pollster variability is 0.025
X <- sapply(1:I, function(i){
d + rnorm(J, 0, 2*sqrt(p*(1-p)/N))
})
63
Code: Calculating probability of 𝑑 > 0 with general bias
Note that sigma now includes an estimate of the variability due to general bias 𝜎𝑏 = .025.
mu <- 0
tau <- 0.035
sigma <- sqrt(results$se^2 + .025^2)
Y <- results$avg
B <- sigma^2 / (sigma^2 + tau^2)
## [1] 0.8174373
• In the US election, each state has a certain number of votes that are won all-or-nothing based on the
popular vote result in that state (with minor exceptions not discussed here).
• We use the left_join() function to combine the number of electoral votes with our poll results.
• For each state, we apply a Bayesian approach to generate an Election Day 𝑑 . We keep our prior simple
by assuming an expected value of 0 and a standard deviation based on recent history of 0.02.
• We can run a Monte Carlo simulation that for each iteration simulates poll results in each state using
that state’s average and standard deviation, awards electoral votes for each state to Clinton if the
spread is greater than 0, then compares the number of electoral votes won to the number of votes
required to win the election (over 269).
• If we run a Monte Carlo simulation for the electoral college without accounting for general bias, we
overestimate Clinton’s chances of winning at over 99%.
• If we include a general bias term, the estimated probability of Clinton winning decreases significantly.
head(results_us_election_2016)
64
## state electoral_votes clinton trump others
## 1 California 55 61.7 31.6 6.7
## 2 Texas 38 43.2 52.2 4.5
## 3 Florida 29 47.8 49.0 3.2
## 4 New York 29 59.0 36.5 4.5
## 5 Illinois 20 55.8 38.8 5.4
## 6 Pennsylvania 20 47.9 48.6 3.6
Code: Computing the average and standard deviation for each state
## # A tibble: 47 x 4
## state avg sd n
## <chr> <dbl> <dbl> <int>
## 1 Florida 0.00356 0.0163 7
## 2 North Carolina -0.00730 0.0306 9
## 3 Ohio -0.0104 0.0252 6
## 4 Nevada 0.0169 0.0441 7
## 5 Iowa -0.0197 0.0437 3
## 6 Michigan 0.0209 0.0203 6
## 7 Arizona -0.0326 0.0270 9
## 8 Pennsylvania 0.0353 0.0116 9
## 9 New Mexico 0.0389 0.0226 6
## 10 Georgia -0.0448 0.0238 4
## # ... with 37 more rows
# states with no polls: note Rhode Island and District of Columbia = Democrat
results_us_election_2016 %>% filter(!state %in% results$state)
65
# assigns sd to states with just one poll as median of other sd values
results <- results %>%
mutate(sd = ifelse(is.na(sd), median(results$sd, na.rm = TRUE), sd))
mu <- 0
tau <- 0.02
results %>% mutate(sigma = sd/sqrt(n),
B = sigma^2/ (sigma^2 + tau^2),
posterior_mean = B*mu + (1-B)*avg,
posterior_se = sqrt( 1 / (1/sigma^2 + 1/tau^2))) %>%
arrange(abs(posterior_mean))
## # A tibble: 47 x 12
## state avg sd n electoral_votes clinton trump others sigma
## <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Flor~ 0.00356 0.0163 7 29 47.8 49 3.2 0.00618
## 2 Nort~ -0.00730 0.0306 9 15 46.2 49.8 4 0.0102
## 3 Iowa -0.0197 0.0437 3 6 41.7 51.1 7.1 0.0252
## 4 Ohio -0.0104 0.0252 6 18 43.5 51.7 4.8 0.0103
## 5 Neva~ 0.0169 0.0441 7 6 47.9 45.5 6.6 0.0167
## 6 Mich~ 0.0209 0.0203 6 16 47.3 47.5 5.2 0.00827
## 7 Ariz~ -0.0326 0.0270 9 11 45.1 48.7 6.2 0.00898
## 8 New ~ 0.0389 0.0226 6 5 48.3 40 11.7 0.00921
## 9 Geor~ -0.0448 0.0238 4 16 45.9 51 3.1 0.0119
## 10 Penn~ 0.0353 0.0116 9 20 47.9 48.6 3.6 0.00387
## # ... with 37 more rows, and 3 more variables: B <dbl>, posterior_mean <dbl>,
## # posterior_se <dbl>
Code: Monte Carlo simulation of Election Night results (no general bias)
mu <- 0
tau <- 0.02
clinton_EV <- replicate(1000, {
results %>% mutate(sigma = sd/sqrt(n),
B = sigma^2/ (sigma^2 + tau^2),
posterior_mean = B*mu + (1-B)*avg,
posterior_se = sqrt( 1 / (1/sigma^2 + 1/tau^2)),
simulated_result = rnorm(length(posterior_mean), posterior_mean, posterior_se),
clinton = ifelse(simulated_result > 0, electoral_votes, 0)) %>% # award votes
summarize(clinton = sum(clinton)) %>% # total votes for Clinton
.$clinton + 7 # 7 votes for Rhode Island and DC
})
mean(clinton_EV > 269) # over 269 votes wins election
## [1] 0.998
# histogram of outcomes
data.frame(clinton_EV) %>%
ggplot(aes(clinton_EV)) +
geom_histogram(binwidth = 1) +
geom_vline(xintercept = 269)
66
250
200
150
count
100
50
mu <- 0
tau <- 0.02
bias_sd <- 0.03
clinton_EV_2 <- replicate(1000, {
results %>% mutate(sigma = sqrt(sd^2/(n) + bias_sd^2), # added bias_sd term
B = sigma^2/ (sigma^2 + tau^2),
posterior_mean = B*mu + (1-B)*avg,
posterior_se = sqrt( 1 / (1/sigma^2 + 1/tau^2)),
simulated_result = rnorm(length(posterior_mean), posterior_mean, posterior_se),
clinton = ifelse(simulated_result > 0, electoral_votes, 0)) %>% # award votes
summarize(clinton = sum(clinton)) %>% # total votes for Clinton
.$clinton + 7 # 7 votes for Rhode Island and DC
})
mean(clinton_EV_2 > 269) # over 269 votes wins election
## [1] 0.832
Forecasting
• In poll results, 𝑝 is not fixed over time. Variability within a single pollster comes from time variation.
67
• In order to forecast, our model must include a bias term 𝑏𝑡 to model the time effect.
• Pollsters also try to estimate 𝑓(𝑡), the trend of 𝑝 given time 𝑡 using a model like:
• Once we decide on a model, we can use historical data and current data to estimate the necessary
parameters to make predictions.
## empirical theoretical
## 1 0.04025194 0.03256719
30
20
count
10
68
Code: Trend across time for several pollsters
polls_us_election_2016 %>%
filter(state == "U.S." & enddate >= "2016-07-01") %>%
group_by(pollster) %>%
filter(n() >= 10) %>%
ungroup() %>%
mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100) %>%
ggplot(aes(enddate, spread)) +
geom_smooth(method = "loess", span = 0.1) +
geom_point(aes(color = pollster), show.legend = FALSE, alpha = 0.6)
0.1
spread
0.0
polls_us_election_2016 %>%
filter(state == "U.S." & enddate >= "2016-07-01") %>%
select(enddate, pollster, rawpoll_clinton, rawpoll_trump) %>%
rename(Clinton = rawpoll_clinton, Trump = rawpoll_trump) %>%
gather(candidate, percentage, -enddate, -pollster) %>%
mutate(candidate = factor(candidate, levels = c("Trump", "Clinton"))) %>%
group_by(pollster) %>%
filter(n() >= 10) %>%
ungroup() %>%
ggplot(aes(enddate, percentage, color = candidate)) +
69
geom_point(show.legend = FALSE, alpha = 0.4) +
geom_smooth(method = "loess", span = 0.15) +
scale_y_continuous(limits = c(30, 50))
50
45
percentage
candidate
40 Trump
Clinton
35
30
1. For each poll in the polling data set, use the CLT to create a 95% confidence interval for the spread.
Create a new table called cis that contains columns for the lower and upper limits of the confidence intervals.
# Create a table called `polls` that filters by state, date, and reports the spread
polls <- polls_us_election_2016 %>%
filter(state != "U.S." & enddate >= "2016-10-31") %>%
mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100)
# Create an object called `cis` that has the columns indicated in the instructions
cis <- polls %>% mutate(X_hat = (spread+1)/2, se = 2*sqrt(X_hat*(1-X_hat)/samplesize), lower = spread -
70
2. You can add the final result to the cis table you just created using the left_join function as shown
in the sample code.
Now determine how often the 95% confidence interval includes the actual result.
# Create an object called `p_hits` that summarizes the proportion of confidence intervals that contain t
p_hits <- ci_data %>% mutate(hit = lower <= actual_spread & upper >= actual_spread) %>% summarize(propor
p_hits
## proportion_hits
## 1 0.66133
Show only pollsters with at least 5 polls and order them from best to worst. Show the number of polls
conducted by each pollster and the FiveThirtyEight grade of each pollster.
# Create an object called `p_hits` that summarizes the proportion of hits for each pollster that has at
p_hits <- ci_data %>% mutate(hit = lower <= actual_spread & upper >= actual_spread) %>% group_by(pollste
p_hits
## # A tibble: 13 x 4
## pollster proportion_hits n grade
## <fct> <dbl> <int> <fct>
## 1 Quinnipiac University 1 6 A-
## 2 Emerson College 0.909 11 B
## 3 Public Policy Polling 0.889 9 B+
## 4 University of New Hampshire 0.857 7 B+
## 5 Ipsos 0.807 119 A-
## 6 Mitchell Research & Communications 0.8 5 D
## 7 Gravis Marketing 0.783 23 B-
## 8 Trafalgar Group 0.778 9 C
## 9 Rasmussen Reports/Pulse Opinion Research 0.774 31 C+
## 10 Remington 0.667 9 <NA>
## 11 Google Consumer Surveys 0.588 102 B
## 12 SurveyMonkey 0.577 357 C-
## 13 YouGov 0.544 57 B
4. Repeat the previous exercise, but instead of pollster, stratify by state. Here we can’t show grades.
71
# The `cis` data have already been loaded for you
add <- results_us_election_2016 %>% mutate(actual_spread = clinton/100 - trump/100) %>% select(state, ac
ci_data <- cis %>% mutate(state = as.character(state)) %>% left_join(add, by = "state")
# Create an object called `p_hits` that summarizes the proportion of hits for each state that has more t
p_hits <- ci_data %>% mutate(hit = lower <= actual_spread & upper >= actual_spread) %>% group_by(state)
p_hits
## # A tibble: 51 x 3
## state proportion_hits n
## <chr> <dbl> <int>
## 1 Connecticut 1 13
## 2 Delaware 1 12
## 3 Rhode Island 1 10
## 4 New Mexico 0.941 17
## 5 Washington 0.933 15
## 6 Oregon 0.929 14
## 7 Illinois 0.923 13
## 8 Nevada 0.923 26
## 9 Maine 0.917 12
## 10 Montana 0.917 12
## # ... with 41 more rows
# The `p_hits` data have already been loaded for you. Use the `head` function to examine it.
head(p_hits)
## # A tibble: 6 x 3
## state proportion_hits n
## <chr> <dbl> <int>
## 1 Connecticut 1 13
## 2 Delaware 1 12
## 3 Rhode Island 1 10
## 4 New Mexico 0.941 17
## 5 Washington 0.933 15
## 6 Oregon 0.929 14
72
Rhode Island
Delaware
Connecticut
New Mexico
Washington
Oregon
Nevada
Illinois
South Dakota
Montana
Maine
Michigan
Colorado
Vermont
Wyoming
Florida
New Jersey
Louisiana
Alaska
Pennsylvania
New York
Virginia
Massachusetts
Georgia
state
New Hampshire
Texas
Indiana
Arizona
North Dakota
Hawaii
Iowa
North Carolina
Wisconsin
Maryland
District of Columbia
South Carolina
Nebraska
Arkansas
Mississippi
Minnesota
Kentucky
California
Ohio
Kansas
Utah
Idaho
Missouri
Oklahoma
West Virginia
Alabama
Tennessee
0.00 0.25 0.50 0.75 1.00
proportion_hits
6. Even if a forecaster’s confidence interval is incorrect, the overall predictions will do better if they
correctly called the right winner.
Add two columns to the cis table by computing, for each poll, the difference between the predicted spread
and the actual spread, and define a column hit that is true if the signs are the same.
# The `cis` data have already been loaded. Examine it using the `head` function.
head(cis)
73
# Create an object called `errors` that calculates the difference between the predicted and actual sprea
errors <- cis %>% mutate(error = spread - ci_data$actual_spread, hit = sign(spread) == sign(ci_data$actu
7. Create an object called p_hits that contains the proportion of instances when the sign of the actual
spread matches the predicted spread for states with 5 or more polls.
Make a barplot based on the result from the previous exercise that shows the proportion of times the sign
of the spread matched the actual result for the data in p_hits.
# Create an object called `errors` that calculates the difference between the predicted and actual sprea
errors <- cis %>% mutate(error = spread - ci_data$actual_spread, hit = sign(spread) == sign(ci_data$actu
# Create an object called `p_hits` that summarizes the proportion of hits for each state that has 5 or m
p_hits <- errors %>% group_by(state) %>% filter(n() >= 5) %>% summarize(proportion_hits = mean(hit), n
74
Wyoming
West Virginia
Washington
Virginia
Vermont
Utah
Texas
Tennessee
South Dakota
South Carolina
Rhode Island
Oregon
Oklahoma
North Dakota
New York
New Mexico
New Jersey
Nebraska
Montana
Mississippi
Minnesota
Massachusetts
Maryland
Maine
state
Louisiana
Kentucky
Illinois
Idaho
Hawaii
District of Columbia
Delaware
Connecticut
California
Arkansas
Alaska
Alabama
Missouri
Indiana
Colorado
Kansas
Georgia
New Hampshire
Iowa
Arizona
Ohio
Nevada
North Carolina
Florida
Pennsylvania
Michigan
Wisconsin
0.00 0.25 0.50 0.75 1.00
proportion_hits
8. In the previous graph, we see that most states’ polls predicted the correct winner 100% of the time.
Only a few states polls’ were incorrect more than 25% of the time. Wisconsin got every single poll wrong.
In Pennsylvania and Michigan, more than 90% of the polls had the signs wrong.
Make a histogram of the errors. What is the median of these errors?
# The `errors` data have already been loaded. Examine them using the `head` function.
head(errors)
75
# Generate a histogram of the error
hist(errors$error)
Histogram of errors$error
400
300
Frequency
200
100
0
errors$error
# Calculate the median of the errors. Print this value to the console.
median(errors$error)
## [1] 0.037
9. We see that, at the state level, the median error was slightly in favor of Clinton. The distribution is not
centered at 0, but at 0.037. This value represents the general bias we described in an earlier section.
Create a boxplot to examine if the bias was general to all states or if it affected some states differently. Filter
the data to include only pollsters with grades B+ or higher.
# The `errors` data have already been loaded. Examine them using the `head` function.
head(errors)
76
## 3 -0.139125210 -0.0008747905 0.024 TRUE
## 4 0.004774064 0.1152259363 0.067 FALSE
## 5 -0.069295191 0.0692951912 0.036 FALSE
## 6 -0.086553820 0.0265538203 0.021 TRUE
# Create a boxplot showing the errors by state for polls with grades B+ or higher
errors %>% filter(grade %in% c("A+","A","A-","B+") | is.na(grade)) %>% mutate(state = reorder(state, err
0.1
error
0.0
−0.1
Massachusetts
Hawaii
Maryland
Washington
Connecticut
New
Illinois
California
Mexico
New
NewMississippi
York
Jersey
Oregon
Vermont
Nevada
Texas
Virginia
Colorado
Arizona
Georgia
Florida
Delaware
Montana
North
Michigan
Indiana
Carolina
Louisiana
Nebraska
Pennsylvania
Minnesota
New
Maine
Kansas
Hampshire
Oklahoma
Kentucky
South
Ohio
Iowa
Wisconsin
Carolina
South
Missouri
Tennessee
Idaho
Dakota
Arkansas
Utah
North
Alabama
West
Dakota
Virginia
state
Repeat the previous exercise to plot the errors for each state, but only include states with five good polls or
more.
# The `errors` data have already been loaded. Examine them using the `head` function.
head(errors)
77
## 2 -0.005634504 0.1056345040 -0.004 TRUE
## 3 -0.139125210 -0.0008747905 0.024 TRUE
## 4 0.004774064 0.1152259363 0.067 FALSE
## 5 -0.069295191 0.0692951912 0.036 FALSE
## 6 -0.086553820 0.0265538203 0.021 TRUE
# Create a boxplot showing the errors by state for states with at least 5 polls with grades B+ or higher
errors %>% filter(grade %in% c("A+","A","A-","B+") | is.na(grade)) %>% group_by(state) %>% filter(n() >=
0.10
0.05
error
0.00
−0.05
−0.10
Washington
New Mexico
California
NevadaVirginia
Colorado
ArizonaFloridaMichigan
North Carolina
Pennsylvania
New Hampshire
OhioWisconsin
Missouri
state
The t-Distribution
• In models where we must estimate two parameters, 𝑝 and 𝜎, the Central Limit Theorem can result in
overconfident confidence intervals for sample sizes smaller than approximately 30.
• If the population data are known to follow a normal distribution, theory tells us how much larger to
make confidence intervals to account for estimation of 𝜎.
̄
𝑋−𝑑
• Given 𝑠 as an estimate of 𝜎, then 𝑍 = 𝑠/ √
𝑁
follows a t-distribution with 𝑁 − 1 degrees of freedom.
• Degrees of freedom determine the weight of the tails of the distribution. Small values of degrees of
freedom lead to increased probabilities of extreme values.
• We can determine confidence intervals using the t-distribution instead of the normal distribution by
calculating the desired quantile with the function qt().
78
z <- qt(0.975, nrow(one_poll_per_pollster) - 1)
one_poll_per_pollster %>%
summarize(avg = mean(spread), moe = z*sd(spread)/sqrt(length(spread))) %>%
mutate(start = avg - moe, end = avg + moe)
## # A tibble: 1 x 4
## avg moe start end
## <dbl> <dbl> <dbl> <dbl>
## 1 0.0290 0.0134 0.0156 0.0424
## [1] 2.144787
qnorm(0.975)
## [1] 1.959964
1. We know that, with a normal distribution, only 5% of values are more than 2 standard deviations away
from the mean.
Calculate the probability of seeing t-distributed random variables being more than 2 in absolute value when
the degrees of freedom are 3.
# Calculate the probability of seeing t-distributed random variables being more than 2 in absolute value
1 - pt(2, df = 3) + pt(-2, df = 3)
## [1] 0.139326
2. Now use sapply to compute the same probability for degrees of freedom from 3 to 50.
Make a plot and notice when this probability converges to the normal distribution’s 5%.
# Make a function called 'pt_func' that calculates the probability that a value is more than |2| for any
pt_function <- function(f) {
1 - pt(2, f) + pt(-2, f)
}
# Generate a vector 'probs' that uses the `pt_func` function to calculate the probabilities
probs <- sapply(df, pt_function)
79
0.14
0.12
0.10
probs
0.08
0.06
10 20 30 40 50
df
3. In a previous section, we repeatedly took random samples of 50 heights from a distribution of heights.
We noticed that about 95% of the samples had confidence intervals spanning the true population mean.
Re-do this Monte Carlo simulation, but now instead of 𝑁 = 50, use 𝑁 = 15. Notice what happens to the
proportion of hits.
# Create variables for the mean height 'mu', the sample size 'N', and the number of times the simulation
mu <- mean(x)
N <- 15
B <- 10000
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# Generate a logical vector 'res' that contains the results of the simulations
res <- replicate(B, {
X <- sample(x, N, replace = TRUE)
interval <- mean(X) + c(-1,1)*qnorm(0.975)*sd(X)/sqrt(N)
between(mu, interval[1], interval[2])
}
)
# Calculate the proportion of times the simulation produced values within the 95% confidence interval. P
p_hit <- mean(res)
p_hit
## [1] 0.9331
80
4. 𝑁 = 15 is not that big. We know that heights are normally distributed, so the t-distribution should
apply.
Repeat the previous Monte Carlo simulation using the t-distribution instead of using the normal distribution
to construct the confidence intervals.
What are the proportion of 95% confidence intervals that span the actual mean height now?
# The vector of filtered heights 'x' has already been loaded for you. Calculate the mean.
mu <- mean(x)
# Generate a logical vector 'res' that contains the results of the simulations using the t-distribution
res <- replicate(B, {
X <- sample(x, N, replace = TRUE)
interval <- mean(X) + c(-1,1)*qt(0.975, N - 1) * sd(X) / sqrt(N)
between(mu, interval[1], interval[2])
}
)
# Calculate the proportion of times the simulation produced values within the 95% confidence interval. P
p_hit <- mean(res)
p_hit
## [1] 0.9512
⊠ A. The t-distribution takes the variability into account and generates larger confidence intervals.
□ B. Because the t-distribution shifts the intervals in the direction towards the actual mean.
□ C. This was just a chance occurrence. If we run it again, the CLT will work better.
□ D. The t-distribution is always a better approximation than the normal distribution.
Section 7 Overview
In Section 7, you will learn how to use association and chi-squared tests to perform inference for binary,
categorical, and ordinal data through an example looking at research funding rates.
After completing Section 7, you will be able to:
• Use association and chi-squared tests to perform inference on binary, categorical, and ordinal data.
• Calculate an odds ratio to get an idea of the magnitude of an observed effect.
Association Tests
The textbook for this section is available here up to and including the textbook section on two-by-two tables.
Key points
81
• We learn how to determine the probability that an observation is due to random variability given
categorical, binary or ordinal data.
• Fisher’s exact test determines the p-value as the probability of observing an outcome as extreme or
more extreme than the observed outcome given the null distribution.
• Data from a binary experiment are often summarized in two-by-two tables.
• The p-value can be calculated from a two-by-two table using Fisher’s exact test with the function
fisher.test().
82
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## percent_men percent_women
## 1 0.17737 0.1489899
Code: Two-by-two table and p-value for the Lady Tasting Tea problem
##
## Fisher's Exact Test for Count Data
##
## data: tab
## p-value = 0.2429
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 0.3135693 Inf
## sample estimates:
## odds ratio
## 6.408309
Chi-Squared Tests
83
• If the sums of the rows and the sums of the columns in the two-by-two table are fixed, then the
hypergeometric distribution and Fisher’s exact test can be used. Otherwise, we must use the chi-
squared test.
• The chi-squared test compares the observed two-by-two table to the two-by-two table expected by the
null hypothesis and asks how likely it is that we see a deviation as large as observed or larger by chance.
• The function chisq.test() takes a two-by-two table and returns the p-value from the chi-squared
test.
• The odds ratio states how many times larger the odds of an outcome are for one group relative to
another group.
• A small p-value does not imply a large odds ratio. If a finding has a small p-value but also a small
odds ratio, it may not be a practically significant or scientifically significant finding.
• Because the odds ratio is a ratio of ratios, there is no simple way to use the Central Limit Theorem
to compute confidence intervals. There are advanced methods for computing confidence intervals for
odds ratios that we do not discuss here.
## [1] 0.1654269
## # A tibble: 2 x 3
## awarded men women
## <chr> <dbl> <dbl>
## 1 no 1345 1011
## 2 yes 290 177
## # A tibble: 2 x 3
## awarded men women
## <chr> <dbl> <dbl>
## 1 no 1365. 991.
## 2 yes 270. 197.
# chi-squared test
chisq_test <- two_by_two %>%
select(-awarded) %>% chisq.test()
chisq_test$p.value
84
## [1] 0.05091372
# odds ratio - how many times larger odds are for men than women
odds_men/odds_women
## [1] 1.231554
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: .
## X-squared = 39.935, df = 1, p-value = 2.625e-10
1. In a previous exercise, we determined whether or not each poll predicted the correct winner for their
state in the 2016 U.S. presidential election.
Each poll was also assigned a grade by the poll aggregator. Now we’re going to determine if polls rated A-
made better predictions than polls rated C-.
In this exercise, filter the errors data for just polls with grades A- and C-. Calculate the proportion of times
each grade of poll predicted the correct winner.
# The 'errors' data have already been loaded. Examine them using the `head` function.
head(errors)
85
## 6 Georgia 2016-11-06 2016-11-06 Landmark Communications B -0.03
## lower upper error hit
## 1 -0.001331221 0.0413312213 -0.063 TRUE
## 2 -0.005634504 0.1056345040 -0.004 TRUE
## 3 -0.139125210 -0.0008747905 0.024 TRUE
## 4 0.004774064 0.1152259363 0.067 FALSE
## 5 -0.069295191 0.0692951912 0.036 FALSE
## 6 -0.086553820 0.0265538203 0.021 TRUE
# Generate an object called 'totals' that contains the numbers of good and bad predictions for polls rat
totals <- errors %>% filter(grade %in% c("A-", "C-")) %>%
group_by(grade,hit) %>% summarize(num = n()) %>% spread(grade, num)
totals
## # A tibble: 2 x 3
## hit `C-` `A-`
## <lgl> <int> <int>
## 1 FALSE 50 26
## 2 TRUE 311 106
## [1] 0.8030303
## [1] 0.8614958
2. We found that the A- polls predicted the correct winner about 80% of the time in their states and C-
polls predicted the correct winner about 86% of the time.
# The 'totals' data have already been loaded. Examine them using the `head` function.
head(totals)
## # A tibble: 2 x 3
## hit `C-` `A-`
## <lgl> <int> <int>
## 1 FALSE 50 26
## 2 TRUE 311 106
86
# Perform a chi-squared test on the hit data. Save the results as an object called 'chisq_test'.
chisq_test <- totals %>%
select(-hit) %>% chisq.test()
## [1] 0.1467902
3. It doesn’t look like the grade A- polls performed significantly differently than the grade C- polls in
their states.
Calculate the odds ratio to determine the magnitude of the difference in performance between these two
grades of polls.
# The 'totals' data have already been loaded. Examine them using the `head` function.
head(totals)
## # A tibble: 2 x 3
## hit `C-` `A-`
## <lgl> <int> <int>
## 1 FALSE 50 26
## 2 TRUE 311 106
# Generate a variable called `odds_C` that contains the odds of getting the prediction right for grade C
odds_C <- (totals[[2,2]] / sum(totals[[2]])) /
(totals[[1,2]] / sum(totals[[2]]))
# Generate a variable called `odds_A` that contains the odds of getting the prediction right for grade A
odds_A <- (totals[[2,3]] / sum(totals[[3]])) /
(totals[[1,3]] / sum(totals[[3]]))
# Calculate the odds ratio to determine how many times larger the odds ratio is for grade A- polls than
odds_A/odds_C
## [1] 0.6554539
4. We did not find meaningful differences between the poll results from grade A- and grade C- polls in
this subset of the data, which only contains polls for about a week before the election.
Imagine we expanded our analysis to include all election polls and we repeat our analysis. In this hypothetical
scenario, we get that the p-value for the difference in prediction success if 0.0015 and the odds ratio describing
the effect size of the performance of grade A- over grade B- polls is 1.07.
Based on what we learned in the last section, which statement reflects the best interpretation of this result?
□ A. The p-value is below 0.05, so there is a significant difference. Grade A- polls are significantly better
at predicting winners.
□ B. The p-value is too close to 0.05 to call this a significant difference. We do not observe a difference
in performance.
⊠ C. The p-value is below 0.05, but the odds ratio is very close to 1. There is not a scientifically significant
difference in performance.
□ D. The p-value is below 0.05 and the odds ratio indicates that grade A- polls perform significantly
better than grade C- polls.
87
Comprehensive Assessment: Brexit
Directions
There are 12 multi-part problems in this comprehensive assessment that review concepts from the entire
course. The problems are split over 3 pages. Make sure you read the instructions carefully and run all
pre-exercise code.
For numeric entry problems, you have 10 attempts to input the correct answer. For true/false problems, you
have 2 attempts.
If you have questions, visit the “Brexit poll analysis” discussion forum that follows the assessment.
IMPORTANT: Some of these exercises use dslabs datasets that were added in a July 2019 update. Make
sure your package is up to date with the command install.packages("dslabs").
Overview
In June 2016, the United Kingdom (UK) held a referendum to determine whether the country would “Re-
main” in the European Union (EU) or “Leave” the EU. This referendum is commonly known as Brexit.
Although the media and others interpreted poll results as forecasting “Remain” (𝑝 > 0.5), the actual pro-
portion that voted “Remain” was only 48.1% (𝑝 = 0.481) and the UK thus voted to leave the EU. Pollsters
in the UK were criticized for overestimating support for “Remain”.
In this project, you will analyze real Brexit polling data to develop polling models to forecast Brexit results.
You will write your own code in R and enter the answers on the edX platform.
Important definitions
Data Import
Import the brexit_polls polling data from the dslabs package and set options for the analysis:
# suggested options
options(digits = 3)
Question 1: The final proportion of voters choosing “Remain” was p=0.481. Consider a poll
with a sample of N=1500 voters. What is the expected total number of voters in the sample choosing
“Remain”?
p <- 0.481
N <- 1500
N*p
## [1] 722
88
What is the standard error of the total number of voters in the sample choosing “Remain”?
sqrt(N*p*(1-p))
## [1] 19.4
X_hat <- p
X_hat
## [1] 0.481
sqrt(p*(1-p)/N)
## [1] 0.0129
What is the expected value of 𝑑, the spread between the proportion of “Remain” voters and “Leave” voters?
2*p-1
## [1] -0.038
What is the standard error of 𝑑, the spread between the proportion of “Remain” voters and “Leave” voters?
2*sqrt(p*(1-p)/N)
## [1] 0.0258
Question 2: Load and inspect the brexit_polls dataset from dslabs, which contains actual
polling data for the 6 months before the Brexit vote. Raw proportions of voters preferring “Remain”,
“Leave”, and “Undecided” are available (remain, leave, undecided) The spread is also available (spread),
which is the difference in the raw proportion of voters choosing “Remain” and the raw proportion choosing
“Leave”.
Calculate x_hat for each poll, the estimate of the proportion of voters choosing “Remain” on the referendum
day (𝑝 = 0.481), given the observed spread and the relationship 𝑑 ̂ = 2𝑋̂ − 1. Use mutate() to add a variable
x_hat to the ‘brexit_polls object by filling in the skeleton code below:
mean(brexit_polls$spread)
89
## [1] 0.0201
sd(brexit_polls$spread)
## [1] 0.0588
mean(brexit_polls$x_hat)
## [1] 0.51
sd(brexit_polls$x_hat)
## [1] 0.0294
Question 3: Confidence interval of a Brexit poll Consider the first poll in brexit_polls, a YouGov
poll run on the same day as the Brexit referendum:
brexit_polls[1,]
## [1] 0.506
x_hat + qnorm(.975)*se_hat
## [1] 0.534
Does the 95% confidence interval predict a winner (does not cover 𝑝 = 0.5)? Does the 95% confidence
interval cover the true value of 𝑝 observed during the referendum?
90
!between(0.5, x_hat - qnorm(.975)*se_hat, x_hat + qnorm(.975)*se_hat) # predicts winner
## [1] TRUE
## [1] FALSE
## [1] 32
What proportion of polls have a confidence interval that covers the value 0?
## [1] 0.625
What proportion of polls predict “Remain” (confidence interval entirely above 0)?
mean(june_polls$ci_lower_spread > 0)
## [1] 0.125
What proportion of polls have a confidence interval covering the true value of 𝑑?
91
june_polls <- june_polls %>%
mutate(hit = (2*p-1 > ci_lower_spread) & (2*p-1 < ci_upper_spread))
mean(june_polls$hit)
## [1] 0.562
Question 5: Group and summarize the june_polls object by pollster to find the proportion of
hits for each pollster and the number of polls per pollster. Use arrange() to sort by hit rate.
june_polls %>%
group_by(pollster) %>%
summarize(hit_rate=mean(hit), n()) %>%
arrange(hit_rate)
## # A tibble: 12 x 3
## pollster hit_rate `n()`
## <fct> <dbl> <int>
## 1 BMG Research 0 2
## 2 ORB/Telegraph 0 1
## 3 Populus 0 1
## 4 ComRes 0.333 3
## 5 Ipsos MORI 0.5 2
## 6 Survation 0.5 2
## 7 YouGov 0.556 9
## 8 Opinium 0.667 3
## 9 ORB 0.667 3
## 10 ICM 1 3
## 11 Survation/IG Group 1 1
## 12 TNS 1 2
□ A. Unbiased polls and pollsters will theoretically cover the correct value of the spread 50% of the time.
□ B. Only one pollster had a 100% success rate in generating confidence intervals that covered the correct
value of the spread.
□ C. The pollster with the highest number of polls covered the correct value of the spread in their
confidence interval over 60% of the time.
□ D. All pollsters produced confidence intervals covering the correct spread in at least 1 of their polls.
⊠ E. The results are consistent with a large general bias that affects all pollsters.
Question 6: Boxplot of Brexit polls by poll type Make a boxplot of the spread in june_polls by
poll type.
92
0.10
0.05
spread
0.00
−0.05
−0.10
Online Telephone
poll_type
Which of the following are TRUE?
□ A. Online polls tend to show support for “Remain” (spread > 0).
⊠ B. Telephone polls tend to show support “Remain” (spread > 0).
⊠ C. Telephone polls tend to show higher support for “Remain” than online polls (higher spread).
⊠ D. Online polls have a larger interquartile range (IQR) for the spread than telephone polls, indicating
that they are more variable.
⊠ E. Poll type introduces a bias that affects poll results.
Question 7: Calculate the confidence intervals of the spread combined across all polls in
june_polls, grouping by poll type. Recall that to determine the standard error of the spread, you will
need to double the standard error of the estimate.
Use this code (which determines the total sample size per poll type, gives each spread estimate a weight
based on the poll’s sample size, and adds an estimate of p from the combined spread) to begin your analysis:
#What is the lower bound of the 95% confidence interval for online voters?
93
combined_by_type <- june_polls %>%
group_by(poll_type) %>%
summarize(N = sum(samplesize),
spread = sum(spread*samplesize)/N,
p_hat = (spread + 1)/2,
se_spread = 2*sqrt(p_hat*(1-p_hat)/N),
spread_lower = spread - qnorm(.975)*se_spread,
spread_upper = spread + qnorm(.975)*se_spread)
combined_by_type %>%
filter(poll_type == "Online") %>%
pull(spread_lower)
## [1] -0.0165
What is the upper bound of the 95% confidence interval for online voters?
combined_by_type %>%
filter(poll_type == "Online") %>%
pull(spread_upper)
## [1] 0.00165
Question 8: Interpret the confidence intervals for the combined spreads for each poll type
calculated in the previous problem. Which of the following are TRUE about the confidence intervals
of the combined spreads for different poll types?
Select ALL correct answers.
⊠ A. Neither set of combined polls makes a prediction about the outcome of the Brexit referendum (a
prediction is possible if a confidence interval does not cover 0).
□ B. The confidence interval for online polls is larger than the confidence interval for telephone polls.
⊠ C. The confidence interval for telephone polls is covers more positive values than the confidence interval
for online polls.
□ D. The confidence intervals for different poll types do not overlap.
⊠ E. Neither confidence interval covers the true value of �=−0.038.
94
Question 9: Define brexit_hit, with the following code, which computes the confidence inter-
vals for all Brexit polls in 2016 and then calculates whether the confidence interval covers the
actual value of the spread 𝑑 = −0.038: Use brexit_hit to make a two-by-two table of poll type and
hit status. Then use the chisq.test() function to perform a chi-squared test to determine whether the
difference in hit rate is significant.
What is the p-value of the chi-squared test comparing the hit rate of online and telephone polls?
## [1] 0.00101
Determine which poll type has a higher probability of producing a confidence interval that covers the correct
value of the spread. Also determine whether this difference is statistically significant at a p-value cutoff of
0.05.
## [1] TRUE
# statistically significant
chisq.test(brexit_chisq)$p.value < 0.05
## [1] TRUE
⊠ A. Online polls are more likely to cover the correct value of the spread and this difference is statistically
significant.
□ B. Online polls are more likely to cover the correct value of the spread, but this difference is not
statistically significant.
□ C. Telephone polls are more likely to cover the correct value of the spread and this difference is
statistically significant.
□ D. Telephone polls are more likely to cover the correct value of the spread, but this difference is not
statistically significant.
Question 10: Use the two-by-two table constructed in the previous exercise to calculate the
odds ratio between the hit rate of online and telephone polls to determine the magnitude
of the difference in performance between the poll types. Calculate the odds that an online poll
generates a confidence interval that covers the actual value of the spread.
95
# from previous question
brexit_chisq <- table(brexit_hit$poll_type, brexit_hit$hit)
## [1] 1.3
Calculate the odds that a telephone poll generates a confidence interval that covers the actual value of the
spread.
## [1] 0.312
Calculate the odds ratio to determine how many times larger the odds are for online polls to hit versus
telephone polls.
online_odds/phone_odds
## [1] 4.15
Question 11: Use brexit_polls to make a plot of the spread (spread) over time (enddate)
colored by poll type (poll_type). Use geom_smooth() with method = "loess" to plot smooth curves
with a span of 0.4. Include the individual data points colored by poll type. Add a horizontal line indicating
the final value of 𝑑 = −.038.
brexit_polls %>%
ggplot(aes(enddate, spread, color = poll_type)) +
geom_smooth(method = "loess", span = 0.4) +
geom_point() +
geom_hline(aes(yintercept = -.038))
96
0.2
0.1 poll_type
spread
Online
Telephone
0.0
−0.1
⊠ B.
97
Question 12: Use the following code to create the object brexit_long, which has a column vote
containing the three possible votes on a Brexit poll (“remain”, “leave”, “undecided”) and a
column proportion containing the raw proportion choosing that vote option on the given poll:
Make a graph of proportion over time colored by vote. Add a smooth trendline with geom_smooth() and
method = "loess" with a span of 0.3.
brexit_long %>%
ggplot(aes(enddate, proportion, col=vote)) +
geom_point() +
geom_smooth(method = "loess", span=0.3)
0.4
vote
proportion
leave
remain
undecided
0.2
0.0
⊠ A. The percentage of undecided voters declines over time but is still around 10% throughout June.
⊠ B. Over most of the date range, the confidence bands for “Leave” and “Remain” overlap.
⊠ C. Over most of the date range, the confidence bands for “Leave” and “Remain” are below 50%.
⊠ D. In the first half of June, “Leave” was polling higher than “Remain”, although this difference was
within the confidence intervals.
□ E. At the time of the election in late June, the percentage voting “Leave” is trending upwards.
98