Confidence Intervals and Hypothesis Tests: 2.1 Binomial Data
Confidence Intervals and Hypothesis Tests: 2.1 Binomial Data
This chapter focuses on how to draw conclusions about populations from sample data. We’ll
start by looking at binary data (e.g., polling), and learn how to estimate the true ratio of 1s
and 0s with confidence intervals, and then test whether that ratio is significantly different
from some baseline value using hypothesis testing. Then, we’ll extend what we’ve learned
to continuous measurements.
Suppose we’re conducting a yes/no survey of a few randomly sampled people1 , and we want
to use the results of our survey to determine the answers for the overall population.
Notice that p̂ is a random quantity, since it depends on the random quantities xi . In statistical
lingo, p̂ is known as an estimator for p. Also notice that except for the factor of 1/n in
front, p̂ is almost a binomial random variable (that is, (np̂) ∼ B(n, p)). We can compute its
expectation and variance using the properties we reviewed:
1
E[p̂] = np = p, (2.1)
n
1 p(1 − p)
var[p̂] = 2 np(1 − p) = . (2.2)
n n
1
We’ll talk about how to choose and sample those people in Chapter 7.
1
Statistics for Research Projects Chapter 2
Since the expectation of p̂ is equal to the true value of what p̂ is trying to estimate (namely p),
we say that p̂ is an unbiased estimator for p. Reassuringly, we can see that another good
property of p̂ is that its variance decreases as the number of samples increases.
2
Statistics for Research Projects Chapter 2
1200 80
1000
60
800
600 40
400
20
200
0
0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.48 0.49 0.50 0.51 0.52
(a) The sampling distribution of the estimator p̂: (b) The 95% confidence interval for a particular
i.e. the distribution of values for p̂ given a fixed observed p̂ of 0.49 (with a true value of p = 0.5).
true value p = 0.5. Note that in this case, the interval contains the
true value p. Whenever we draw a set of samples,
there’s a 95% chance that the interval that we get
is good enough to contain the true value p.
Figure 2.1
the interval
r
p(1 − p)
p̂ ± 2 · . (2.3)
|{z} | n
{z }
coeff. std. dev.
We can define similar confidence intervals, where the standard deviation remains the same,
but the coefficient depends on the desired confidence. While our variables being Gaussian
makes this relationship easy for 95% and 99%, in general we’ll have to look up or have our
software compute these coefficients.
But, there’s a problem with these formulas: they requires us to know p in order to compute
confidence intervals! Since we don’t actually know p (if we did, we wouldn’t need a confidence
interval), we’ll approximate it with p̂, so that (2.3) becomes
r
p̂(1 − p̂)
p̂ ± 2 . (2.5)
n
This approximation is reasonable if p̂ is close to p, which we expect to normally be the case.
If the approximation is not as good, there are several more robust (but more complex) ways
to compute the confidence interval.
3
Statistics for Research Projects Chapter 2
1.0
0.8
0.6
p
0.4
0.2
0.0
Figure 2.2: Multiple 95% confidence intervals computed from different sets of data, each with the
same true parameter p = 0.4 (shown by the horizontal line). Each confidence interval represents
what we might have gotten if we had collected new data and then computed a confidence interval
from that new data. Across different datasets, about 95% of them contain the true interval. But,
once we have a confidence interval, we can’t draw any conclusions about where in the interval the
true value is.
Interpretation
It’s important not to misinterpret what a confidence interval is! This interval tells us nothing
about the distribution of the true parameter p. In fact, p is a fixed (i.e., deterministic)
unknown number! Imagine that we sampled n values for xi and computed p̂ along with a
95% confidence interval. Now imagine that we repeated this whole process a huge number
of times (including sampling new values for xi ). Then about 5% of the confidence intervals
constructed won’t actually contain the true p. Furthermore, if p is in a confidence interval,
we don’t know where exactly within the interval p is.
Furthermore, adding an extra 4% to get from a 95% confidence interval to a 99% confidence
interval doesn’t mean that there’s a 4% chance that it’s in the extra little area that you
added! The next example illustrates this.
In summary, a 95% confidence interval gives us a region where, had we redone the survey
from scratch, then 95% of the time, the true value p will be contained in the interval. This
is illustrated in Figure 2.2.
4
Statistics for Research Projects Chapter 2
Usually (but not always), the null hypothesis corresponds to a baseline or boring finding,
and the alternative hypothesis corresponds to some interesting finding. Once we have the
two hypotheses, we’ll use the data to test which hypothesis we should believe. “Significance”
is usually defined in terms of a probability threshold α, such that we deem a particular result
significant if the probability of obtaining that result under the null distribution is less than
α. A common value for α is 0.05, corresponding to a 1/20 chance of error. Once we obtain
a particular value and evaluate its probability under the null hypothesis, this probability is
known as a p-value.
This framework is typically used when we want to disprove the null hypothesis and show
the value we obtained is significantly different from the null value. In the case of polling,
this may correspond to showing that a candidate has significantly more than 50% support.
In the case of a drug trial, it may correspond to showing that the recovery rate for patients
given a particular drug is significantly more than some baseline rate.
Here are some definitions:
• In a one-tailed hypothesis test, we choose one direction for our alternative hypoth-
esis: we either hypothesize that the test statistic is “significantly big”, or that the test
statistic is “significantly small”.
• A false positive or Type I error happens when the null hypothesis is true, but we
reject it. Note that the probability of a Type I error is α.
• A false negative or Type II error happens when the null hypothesis is false, but we
fail to reject it2
• The statistical power of a test is the probability of rejecting the null hypothesis when
it’s false (or equivalently, 1 − (probability of type II error).
Power is usually computed based on a particular assumed value for the quantity being
tested: “if the value is actually , then the power of this test is .” It also depends
on the threshold determined by α.
It’s often useful when deciding how many samples to acquire in an experiment, as we’ll
see later.
2
Notice our careful choice of words here: if our result isn’t significant, we can’t say that we accept the
null hypothesis. The hypothesis testing framework only lets us say that we fail to reject it.
5
Statistics for Research Projects Chapter 2
p p ∗ pa
Example
The concepts above are illustrated in Figure 2.3. Here, the null hypothesis H0 is that p = p0 ,
and the alternative hypothesis Ha is that p > p0 : this is a one-sided test. In particular, we’ll
use the value pa as the alternative value so that we can compute power. The null distribution
is shown on the left, and an alternative distribution is shown on the right. The α = 0.05
threshold for the alternative hypothesis is shown as p∗ .
• When the null hypothesis is true, p̂ is generated from the null (left) distribution, and we
make the correct decision if p̂ < p∗ and make a Type I error (false positive) otherwise.
• When the alternative hypothesis is true, and if the true proportion p is actually pa , p̂
is generated from the right distribution, and we make the correct decision when p̂ > p∗
and make a Type II error (false negative) otherwise.
The power is the probability of making the correct decision when the alternative hypothesis
is true. The probability of a Type I error (false positive) is shown in blue, the probability of
a Type II error (false negative) is shown in red, and the power is shown in yellow and blue
combined (it’s the area under the right curve minus the red part).
Notice that a threshold usually balances between Type I and Type II errors: if we always
reject the null hypothesis, then the probability of a Type I error is 1, and the probability of
a Type II error is 0, and vice versa if we always fail to reject the null hypothesis.
6
Statistics for Research Projects Chapter 2
Figure 2.4: Results of a simulated drug trial measuring the effects of statin drugs on lifespan. The top figure shows the
lifespan of subjects who did not receive treatment, and the bottom figure shows the lifespan of subjects who did receive
it.
Figure 2.4 shows results from a simulated drug triala . At first glance, it seems clear that people who
received the drug (bottom) tended to have a higher lifespan than people who didn’t (top), but it’s
important to look at hidden confounds! In this simulation, the drug actually had no effect, but the
disease occurred more often in older people: these older people had a higher average lifespan simply
because they had to live longer to get the drug.
Any statistical test we perform will say that the second distribution has a higher mean than the first
one, but this is not because of the treatment, but instead because of how we sampled the data!
a
Figure from: Støvring, et al. Statin Use and Age at Death: Evidence of a Flawed Analysis. The
American Journal of Cardiology, 2007
So far we’ve only talked about binomial random variables, but what about continuous random
variables? Let’s focus on estimating the mean of a randomPnvariable given observations of it.
1
As you can probably guess, our estimator will be µ̂ = n i=1 xi .
We’ll start with the case where we know the true population standard deviation; call it σ.
This is somewhat unrealistic, but it’ll help us set up the more general case.
7
Statistics for Research Projects Chapter 2
Just like pb, µ̂ is a random quantity. Its expectation, which we computed in Chapter 1, is µ.
Its variance is
h1 Xn i
var[µ̂] = var xi
n i=1
n
1 X
= 2 var[xi ]
n i=1
n
1 X 2 σ2
= 2 σ = . (2.6)
n i=1 n
This quantity (or to be exact, the square root of this quantity) is known as the standard
error of the mean. In general, the standard deviation of the sampling distribution of the
a particular statistic is called the standard error of that statistic.
Since µ̂ is the sum of many independent random variables, it’s approximately
√ Gaussian.
If we subtract its mean µ and divide by its standard deviation σ/ n (both of which are
deterministic), we’ll get a standard normal random variable. This will be our test statistic:
µ̂ − µ
z= √ . (2.7)
σ/ n
Hypothesis testing
In the case of hypothesis testing, we know µ (it’s the mean of the null distribution), and
we can compute the probability of getting z or something more extreme. Your software of
choice will typically do this by using the fact that z has a standard normal distribution and
report the probability to you. This is known as a z-test.
Confidence intervals
What about a confidence interval? Since z is a standard normal random variable, it has
probability 0.95 of being within 2 standard deviations of its mean. We can compute the
confidence interval by manipulating a bit of algebra:
P (−2 ≤ z ≤ 2) ≈ 0.95
µ̂ − µ
P (−2 ≤ √ ≤ 2) ≈ 0.95
σ/ n
σ σ
P (−2 √ ≤ µ̂ − µ ≤ 2 √ ) ≈ 0.95
n n
σ σ
P (µ̂ − 2 √ ≤ µ ≤ µ̂ + 2 √ ) ≈ 0.95
n n
|{z} |{z} |{z} |{z}
coeff. std. dev. coeff. std. dev.
This says that the probability that µ is within the interval µ̂ ± 2 √σn is 0.95. But remember:
the only thing that’s random in this story is µ̂! So when we use the word “probability” here,
it’s referring only to the randomness in µ̂. Don’t forget that µ isn’t random!
8
Statistics for Research Projects Chapter 2
Also, remember that we chose the confidence level 0.95 (and therefore the threshold 2)
somewhat arbitrarily, and we could just as easily compute a 99% confidence interval (which
would correspond to a threshold of about 3) or an interval for any other level of confidence:
we could compute the threshold by using the standard normal distribution.
Finally, note that for a two-tailed hypothesis test, the threshold at which we declare signifi-
cance for some particular α is the same as the width of a confidence interval with confidence
level 1 − α. Can you show why this is true?
Statistical power
Figure 2.5: A funnel plot showing conception statistics from fertility clinics in the UK. The x-axis indicates the sample size;
in this case that’s the number of conception attempts (cycles). The y-axis indicates the quantity of interest; in this case
that’s the success rate for conceiving. The funnels (dashed lines) indicate thresholds for being significantly different from
the null value of 32% (the national average). This figure comes from http://understandinguncertainty.org/fertility.
Figure 2.5 is an example of a funnel plot. We see that with a small number of samples, it’s difficult to
judge any of the clinics as significantly different from the baseline value, since exceptionally high/low
values could just be due to chance. However, as the number of cycles increases, the probability of
consistently obtaining large values by chance decreases, and we can declare clinics like Lister and CARE
Nottingham significantly better than average: while other clinics have similar success rates over fewer
cycles, these two have a high success rate over many cycles. So, we can be more certain that the higher
success rates are not just due to chance and are in fact meaningful.
9
Statistics for Research Projects Chapter 2
µ̂ − µ
t= √ . (2.8)
σ̂/ n
Since the numerator and denominator are both random, this is no longer Gaussian. The
denominator is roughly χ2 -distributed quantity3 , and the overall statistic is t-distributed. In
this case, our t distribution has n − 1 degrees of freedom.
Confidence intervals and hypothesis tests proceed just as in the known-σ case with only two
changes: using σ̂ instead of σ and using a t distribution with n−1 degrees of freedom instead
of a Gaussian distribution. The confidence interval requires only µ̂ and the standard error
s, while the hypothesis test also requires a hypothesis, in the form of a value for µ.
For example, a 95% confidence interval might look like
σ̂
µ̂ ± t∗ √ (2.9)
n
To determine the coefficient t∗ , we need to know the value where a t distribution has 95%
of its probability. This depends on the degrees of freedom (the only parameter of the t
distribution) and can easily be looked up in a table or computed from any software package.
For example, if n = 10, then the t distribution has n − 1 = 9 degrees of freedom, and
k = 2.26. Notice that this produces a wider interval than the corresponding Gaussian-based
confidence interval from before. If we don’t know the standard deviation and we estimate
it, we’re then less certain about our estimate µ̂.
To derive the t-test, we assumed that our data points were normally distributed. But, the
t-test is fairly robust to violations of this assumption.
So far, we’ve looked at the case of having one sample and determining whether it’s sig-
nificantly greater than some hypothesized amount. But what about the case where we’re
interested in the difference between two samples? We’re usually interested in testing whether
the difference is significantly different from zero. There are a few different ways of dealing
with this, depending on the underlying data.
3
In fact, the
√
quantity (n − 1)σ̂ 2 /σ 2 is χ2 -distributed with n − 1 degrees of freedom, and the test statistic
µ̂−µ
√ · σ √n−1 is therefore t-distributed.
t = σ/ n σ̂ n−1
10
Statistics for Research Projects Chapter 2
• In the case of matched pairs, we have a “before” value and an “after” value for each
data point (for example, the scores of students before and after a class). Matching
the pairs helps control the variance due to other factors, so we can simply look at the
differences for each data point, xpost
i − xpre
i and perform a one-sample test against a
null mean of 0.
• In the case of two samples with pooled variance, the means of the two samples might
be different (this is usually the hypothesis we test), but the variances of each sample
are assumed to be the same. This assumption allows us to combine, or pool, all the
data points when estimating the sample variance. So, when computing the standard
error, we’ll use this formula:
This test still provides reasonably good power, since we’re using all the data to estimate
sp .
In this setting, where the two groups have the same variance, we say the data are
homoskedastic.
• In the general case of two samples with separate (not pooled) variance, the variances
must be estimated separately. The result isn’t quite a t distribution, and this variant
is often known as Welch’s t-test. It’s important to keep in mind that this test will
have lower statistical power since we are using less data to estimate each quantity.
But, unless you have solid evidence that the variances are in fact equal, it’s best to be
conservative and stick with this test.
In this setting, where the two groups have different variances, we say the data are
heteroskedastic.
11
Statistics for Research Projects Chapter 2
• Rejecting the null hypothesis: You can never be completely sure that the null
hypothesis is false from using a hypothesis test! Any statement stronger than “the
data do not support the null hypothesis” should be made with extreme caution.
Figure 2.6: Sampling distributions for p = 0.5 (black) and p = 0.501 (blue) for n = 1000000. Note the scale of the x-axis:
the large number of samples dramatically reduces the variance of each distribution.
12
Statistics for Research Projects Chapter 2
• The probability of two children independently dying suddenly from natural causes like Sudden
Infant Death Syndrome (SIDS) is 1 in 73 million. Such an event would occur by chance only
once every 100 years, which was evidence that the death was not natural.
• If the death was not due to two independent cases of SIDS (as asserted above), the only other
possibility was that they were murdered.
The assumption of independence in the first item was later shown to be incorrect: the two children were
not only genetically similar but also were raised in similar environments, causing dependence between
the two events. This wrongful assumption of independence is a common error in statistical analysis.
The probability then goes up dramaticallya .
Also, showing the unlikeliness of two chance deaths does not imply any particular alternative! Even if it
were true, it doesn’t make sense to consider the “1 in 73 million claim” by itself: it has to be compared
to the probability of two murders (which was later estimated to be even lower). This second error is
known as the prosecutor’s fallacy. In fact, tests later showed bacterial infection in one of the children!
a
See Royal Statistical Society concerned by issues raised in Sally Clark Case, October 2001.
13