Topic 17: Simple Hypotheses: 1 Overview and Terminology
Topic 17: Simple Hypotheses: 1 Overview and Terminology
Topic 17: Simple Hypotheses: 1 Overview and Terminology
November, 2011
Statistical hypothesis testing is designed to address the question: Do the data provide sufficient evidence to conclude
that we must depart from our original assumption concerning the state of nature?
The logic of hypothesis testing is similar to the one a juror faces in a criminal trial: Is the evidence provide by the
prosecutor sufficient for the jury to depart from its original assumption that the defendant is not guilty of the charges
brought before the court?
Two of the jurys possible actions are
Find the defendant guilty.
Find the defendant not guilty.
The weight of evidence that is necessary to find the defendant guilty depends on the type of trial. In a criminal
trial the stated standard is that the prosecution must prove that the defendant is guilty beyond any reasonable doubt.
In civil trials, the burden of proof may be the intermediate level of clear and convincing evidence or the lower level of
the preponderance of evidence.
Given the level of evidence needed, a prosecutors task to present the evidence in the most powerful and convincing
manner possible. We shall see these notions reflected in the nature of hypothesis testing.
The simplest set-up for understanding the issues of statistical hypothesis, is the case of two values 0 , 1 in the
parameter space. We write the test, known as a simple hypothesis as
H0 : = 0
versus H1 : = 1 .
214
Simple Hypotheses
hypothesis tests
reject H0
fail to reject H0
H0 is true
type I error
OK
H1 is true
OK
type II error
criminal trials
the defendant is
innocent
guilty
convict
OK
do not convict
OK
Thus, the higher level necessary to secure conviction in a criminal trial corresponds to having lower significance
levels. This analogy should not be taken too far. The nature of the data and the decision making process is quite
dissimilar. For example, the prosecutor and the defense attorney are not always out to find the most honest manner
to present information. In statistical inference for hypothesis testing, the goal is something that all participants in the
endeavor ought to share.
The decision for the test is often based on first determining a critical region C. Data x in this region is determined
to be too unlikely to have occurred when the null hypothesis is true. Thus, the decision is
reject H0
if and only if x C.
Given a choice for the size of the test, the choice of a critical region C is called best or most powerful if for any
other choice of critical region C for a size test, i.e., both critical region lead to the same type I error probability,
= P0 {X C} = P0 {X C },
but perhaps different type II error probabiiities
= P1 {X
/ C},
= P1 {X
/ C },
we have the lowest probability of a type II error, ( ) associated to the critical region C.
Many critical regions are either determined by the consequences of the Neyman-Pearson lemma or by using analogies
of this fundamental lemma. Rather than presenting a proof of this lemma, we will provide some intuition for the choice
of critical region through the following game.
We will conduct a single observation X that can take values from 11 to 11 and based on that observation, decide
whether or not to reject the null hypothesis. Basing a decision on a single observation, of course, is not the usual
circumstance for hypothesis testing. We will first continue on this line of reasoning to articulate the logic behind the
Neyman-Pearson lemma before examining more typical and reasonable data collection protocols.
To begin the game, corresponding to values for x running from 11 to 11, write a row of the number from 0 up to
10 and back down to 0 and add an addition 0 at each end. These numbers add to give 100. Now, scramble the numbers
and write them under the first row. This can be created and displayed quickly in R using the commands:
>
>
>
>
x<-c(-11:11)
L0<-c(0,0:10,9:0,0)
L1<-sample(L0,length(L0))
data.frame(x,L0,L1)
The top row, giving the values of L0 , represents the likelihood for our one observation under the null hypothesis.
The bottom row, giving the values of L1 , represents the likelihood under the alternative hypothesis. Note that the
values for L1 is a rearrangement of the values for L0 . Here is the output from one simulation.
x
L0 (x)
L1 (x)
-11
0
3
-10
0
8
-9
1
7
-8
2
5
-7
3
7
-6
4
1
-5
5
3
-4
6
10
-3
7
6
-2
8
0
215
-1
9
6
0
10
4
1
9
2
2
8
5
3
7
0
4
6
1
5
5
0
6
4
4
7
3
0
8
2
8
9
1
2
10
0
9
11
0
9
Simple Hypotheses
The goal of this game is to pick values x so that your accumulated points increase as quickly as possible from your
likelihood L0 keeping your opponents points from L1 as low as possible. The natural start is to pick values of x so
that L1 (x) = 0. Then, the points you collect begin to add up without you opponent gaining anything. We find 4 such
values for x and record their values along with running totals for L0 and L1 .
x
L0 total
L1 total
-2
8
0
3
15
0
5
20
0
7
23
0
Being ahead by a score of 23 to 0 can be translated into a best critical region in the following way. If we take as
our critical region C all the values for x except -2, 3, 5, and 7. Then, because the L0 -total is 23 points out of a possible
100, then we find
the size of the test = 0.77 and the power of the test 1 = 1.00
because there is no chance of type II error with this critical region. If the results of our one observation is one of -2, 3,
5, or 7, then we are never incorrect in failing to reject H0 .
Understanding the next choice is crucial. Candidates are
x = 4, with L0 (4) = 6 against L1 (4) = 1 and
The choice 6 against 1 is better than 9 against 2. One way to see this is to note that choosing 6 against 1 twice will
put us in a better place than the single choice of 9 against 2. Indeed, after choosing 6 against 1, a choice of 3 against
1 puts us in at least as good a position than the single choice of 9 against 2. The central point is that the best choice
comes to picking the remaining value for x that has the highest ratio of L0 (x) to L1 (x)
Now we can pick the next few candidates, keeping track of the size and the power of the test with the choice of
critical region being the values of x not yet chosen.
x
L0 (x)/L1 (x)
L0 total
L1 total
-2
8
0
0.92
1.00
15
0
0.85
1.00
20
0
0.80
1.00
23
0
0.77
1.00
4
6
29
1
0.71
0.99
1
9/2
38
3
0.62
0.97
-6
4
42
4
0.58
0.96
0
5/2
52
8
0.48
0.92
-5
5/3
57
11
0.43
0.89
From this exercise we see how the likelihood ratio test is the choice for a most powerful test. For example, for
these likelihoods, the last column states that for a = 0.43 level test, the best region consists of those values of x so
that
5
L1 (x)
.
L0 (x)
3
The power is 1 = 0.89 and thus the type II error probability is = 0.11. In genuine examples, we will typically
look for level much below 0.43 and we will make not one observation but many. We now summarize carefully the
insights from this game before examining more genuine examples. A proof of this theorem is provided in Section 4.
Theorem 1 (Neyman-Pearson Lemma). Let L(|x) denote the likelihood function for the random variable X corresponding to the probability measure P . If there exists a critical region C of size and a nonnegative constant k
such that
L(1 |x)
k for x C
L(0 |x)
and
L(1 |x)
k
L(0 |x)
for x
/ C,
(1)
Simple Hypotheses
0.0
0.2
0.4
1 - beta
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
alpha
Figure 1: Receiver Operator Characteristic. The graph of = P {X C|H0 is true} (significance) versus 1 = P {X
C|H1 is true} (power) in the example. The horizontal axis is also called the false positive fraction (FPF). The vertical axis
1 called the power is also called the true positive fraction (TPF).
We, thus, reject the null hypothesis if and only if the likelihood ratio exceeds a value k with
L(1 |X)
=P
k .
L(0 |X)
We shall learn that many of the standard tests use critical values for the t-statistic, the chi-square statistic, or the
F -statistic. These critical values are related to the critical value k in extensions of the ideas of likelihood ratios.
Using R, we can complete the table for L0 total and L1 total.
>
>
>
>
>
>
o<-order(L1/L0)
sumL0<-cumsum(L0[o])
sumL1<-cumsum(L1[o])
alpha<-1-sumL0/100
beta<-sumL1/100
data.frame(x[o],L0[o],L1[o],sumL0,sumL1,alpha,1-beta)
Completing the curve, known as the receiver operator characteristic (ROC), is shown in the figure above. The
ROC shows the inevitable trade-offs between Type I and Type II errors. For example, by the mere fact that the graph
is increasing, we can see that by setting a more rigorous test achieved by lowering the level of significance (decreasing
the value on the horizontal axis) necessarily reduces the power (decreasing the value on the vertical axis.)
Examples
Example 2. Let X = (X1 , . . . , Xn ) be independent normal observations with unknown mean and known variance
02 . The hypothesis is
H0 : = 0 versus H1 : = 1 .
217
Simple Hypotheses
For the moment consider the case in which 1 > 0 . The likelihood ratio
2
1 )
1 )
1 2 exp (x12
1 2 exp (xn2
2
2
L(1 |x)
20
20
0
0
=
2
0 )
1 )2
L(0 |x)
1 2 exp (x12
1 2 exp (xn2
2
2
20
2
0
0
0
P
n
exp 21 2 i=1 (xi 1 )2
0
Pn
=
exp 21 2 i=1 (xi 0 )2
0
n
1 X
= exp 2
(xi 1 )2 (xi 0 )2
20 i=1
= exp
n
0 1 X
(2xi 1 0 )
202 i=1
Because the exponential function is increasing, the likelihood ratio test (1) is equivalent to
n
1 0 X
(2xi 1 0 ),
202 i=1
(2)
of X,
0
X
,
Z=
(3)
0 / n
is a standard normal. Set z so that P {Z z } = . Then, by rearranging (3), we can determne k .
0
0 +
X
z = k .
n
.
Equivalently, we can use the standardized score Z as our test statistic and z as the critical value. Note that the
only role played by 1 , the value of the mean under the alternative, is that is greater than 0 . However, it will play a
role in determining the power of the test.
Exercise 3. In the example above, give the value of k explicitly in terms of k , 0 , 1 , 02 and n.
Exercise 4. Modify the calculations in the example above to show that for the case 0 < 1 , using the same value of
z as above, the we reject the null hypothesis precisely when
0
0
X
z . or Z z
n
Exercise 5. Give an intuitive explanation why the power should
increase as a function of |1 0 |,
decrease as a function of 02 , and
increase as a function of n.
218
Simple Hypotheses
Writing for the distribution function for the standard normal, the type II error probability, in this situation, is
0
< 0 +
= P1 {X
/ C} = P1 {X
z }
n
X 1
|1 0 |
|1 0 |
< z
= P1
= z
0 / n
0 / n
0 / n
and the power
|1 0 |
1 = 1 z
0 / n
(4)
Example 6. Mimicry is the similarity of one species to another in a manner that enhances the survivability of one or
both species - the model and mimic . This similarity can be, for example, in appearance, behavior, sound, or scent.
Lets consider a model butterfly species with mean wingspan 0 = 10 cm and and a mimic species with mean
wingspan 1 = 7 cm. Both species have standard deviation 0 = 3 cm. Collect 16 specimen to decide if the mimic
species has migrated into a given region. If we assume, for the null hypothesis, that the habitat under study is populated
by the model species, then
a type I error is falsely concluding that the species is the mimic when indeed the model species is resident and
a type II error is falsely concluding that the species is the model when indeed the mimic species has invaded.
If are action is to begin an eradication program if the mimic has invaded, then a type I error would result in the
eradication of the resident model species and a type II error would result in the letting the invasion by the mimic take
its course.
To begin, we set a significance level. The choice of an = 0.05 test means that we are accepting a 5% chance of
having this error. If the goal is to design a test that has the lowest type II error probability, then the Neyman-Pearson
lemma tells us that the critical region is determined by a threshold level k for the likelihood ratio.
L(1 |x)
k .
C = x;
L(0 |x)
This region can also be defined as
n
o x
0
z .
C = x; x
k = x;
/ n
has a normal distribution with mean 0 = 10 and standard deviation /n = 3/4.
Under the null hypothesis, X
This using the distribution function of the normal we can find either k
> qnorm(0.05,10,3/4)
[1] 8.76636
or z ,
> qnorm(0.05)
[1] -1.644854
and z = 1.645 for the test statistic Z. Now lets
Thus, the critical value is k = 8.767 for the test statistic X
look at data.
> x
[1] 8.9 2.4 12.1 10.0
> mean(x)
[1] 8.93125
9.2
3.7 13.9
219
9.1
8.8
4.5
8.2 10.2
0.6
power
0.4
0.3
0.0
0.0
0.1
0.2
0.2
density
0.4
0.8
0.5
1.0
Simple Hypotheses
0.6
10
12
0.0
0.2
0.4
0.6
0.8
1.0
alpha
for normal data under the null hypothesis - 0 = 10 and 0 /n = 3/ 16 = 3/4. With an = 0.05 level
Figure 2: Left: (black) Density of X
test, the critical value k = 0 0 z / n = 8.766. The area to the left of the dashed line and below the density function is . The alternatives
shown are 1 = 9 and 8 (in blue) and 1 = 7 (in red). The areas below the curves and to the left of the dashed line is the power 1 . These values
are 0.3777, 0.8466, and 0.9907 for respective alternatives 1 = 9, 8 and 7. Right: The corresponding receiver operator characteristics curves the
power 1 versus the significance using equation (4). The power for an = 0.05 test are indicated by the intersection of dashed line and the
receiver operator characteristics curves.
Then
= 8.931
X
Z=
8.93124 10
= 1.425.
3/ 16
k = 8.766 < 9.931 or z = 1.645 < 1.425 and we fail to reject the null hypothesis.
A type II error is falsely failing to conclude that the mimic species have inhabited the study area when indeed they
have. To compute the probability of a type II error, note that for = 0.05, we substitute into (4),
z
3
|1 0 |
= 1.645 = 2.355
0 / n
3/ 16
> pnorm(-2.355)
[1] 0.009261353
Thus the power 1 = 1 0.0093 = 0.9907.
Lets expand the examination of equation (4). As we move the alternative value 1 downward, the density of X
moves leftward. The values for 1 = 9, 8, and 7 are displayed on the left in Figure 2. This shift in the values is a
way of saying that the alternative is becoming more and more distinct as 1 decreases. The mimic species becomes
easier and easier to detect. We express this by showing that the test is more and more powerful with decreasing values
of 1 . This is displayed by the increasing area under the density curve to the left of the dashed line from 0.377 for
the alternative 1 = 9 to 0.9907 for 1 = 7. We can also see this relationship in the receiver operator characteristic
graphed, the graph of the power 1 versus the significance . This is displayed for = 0.05 by the dashed line.
Often, we wish to know in advance the number of observations n needed to obtain a given power. In this case, we
use (4) with a fixed value of , the size of the test, and determine the power of the test as a function of n. We display
this in Figure 3 with the value of = 0.01. Notice how the number of observations needed to achieve a desired power
is high when the wingspan of the mimic species is close to that of the model species.
220
Simple Hypotheses
0.0
0.2
0.4
power
0.6
0.8
1.0
20
40
60
80
100
observations
Figure 3: Power as a function of the number of observations for an = 0.01 level test. The null hypothesis - 0 = 10. The alternatives shown
are 1 = 9 and 8 (in blue) and 1 = 7 (in red). Here 0 = 3. The low level for is chosen to reflect the desire to have a stringent criterion for
rejecting the null hypothesis that the resident species is the model species.
The example above is called the z-test. If n is sufficiently large, then even if the data are not normally distributed,
is well approximated by a normal distribution and, as long as the variance 2 is known, the z-test is used in this
X
0
1, . . . , X
n ) can be approximated by a normal distribution using the
case. In addition, the z-test can be used when g(X
delta method.
Example 7 (Bernoulli trials). Here X = (X1 , . . . , Xn ) is a sequence of Bernoulli trials with unknown success
probability p, the likelihood
x1 ++xn
p
L(p|x) = (1 p)n
.
1p
For the test
H0 : p = p0
the likelihood ratio
L(p1 |x)
=
L(p0 |x)
1 p1
1 p0
n
versus H1 : p = p1
p1
1 p1
.
p0
1 p0
x1 ++xn
.
(5)
Exercise 8. Show that the likelihood ratio (5) results in a test to reject H0 whenever
n
X
xi k when p0 < p1
or
i=1
n
X
xi k when p0 > p1 .
(6)
i=1
In words, if the alternative is a higher proportion than the null hypothesis, we reject H0 when the data have too
many successes. If the alternative is lower than the null,
Pn we eject H0 when the data do not have enough successes .
In either situation, the number of successes N = i=1 Xi has a Bin(n, p0 ) distribution under the null hypothesis.
Thus, in the case p0 < p1 , we choose k so that
( n
)
X
P
Xi k .
(7)
i=1
221
Simple Hypotheses
In general, we cannot choose k to obtain exactly the value . Thus, we take the minimum value of k to achieve the
inequality in (7).
To give a concrete example take p0 = 0.6 and n = 20 and look at a part of the cumulative distribution function.
x
FN (x) = P {N x}
13
0.7500
14
0.8744
15
0.9491
16
0.9840
17
0.9964
18
0.9994
19
0.99996
20
1
p =
and
Z=p
1X
Xi
n i=1
p p0
p0 (1 p0 )/n
is approximately a standard normal random variable and we perform the z-test as in the previous exercise.
For example, if we take p0 = 1/2 and p1 = 3/5 and = 0.05, then with 60 heads in 100 coin tosses
Z=
0.60 0.50
= 2.
0.05
> qnorm(0.95)
[1] 1.644854
Thus, 2 > 1.645 = z0.05 and we reject the null hypothesis.
Example 11. We are examining the survivability of bee hives over a given winter. Typically for a given region, the
probability of survival is p0 = 0.7. We are checking to see if, for a particularly mild winter, this probability moved up
to p1 = 0.8. This leads us to consider the hypotheses
H0 : p = p0
versus H1 : p = p1 .
for a test of the probability that a feral bee hive survives a winter. If we use the central limit theorem, then, under the
null hypothesis,
p p0
z=p
p0 (1 p0 )/n
has a distribution approximately that of a standard normal random variable. For an level test, the critical value is
z where is the probability that a standard normal is at least z If the significance level is = 0.05, then we will
reject H0 for any value of z > z = 1.645
For this study, 112 colonies have been chosen and 88 survive. Thus p = 0.7875 and
z=p
0.7875 0.7
0.7(1 0.7)/112
Consequently, reject H0 .
222
= 1.979.
Simple Hypotheses
(8)
E*
0.2
As before, we use the symbols and denote,
respectively, the probability of type II error for
the critical regions C and C respectively. The
0.4
Neyman-Pearson lemma is the statement that
.
0.6
Divide both critical regions C and C into two
Figure 4: Critical region C as determined by the Neyman-Pearson lemma is
by
left. The circle
on the
region
disjoint subsets, the subset that the critical regions indicated
0.8
0.6the circle
0.4on the 0.2
0
0.2 right C0.4is the critical
0.6
0.8
= S E.
is
for
a
second
level
test.
Thus,
C
=
S
E
and
C
share S = C C and the subsets E = C\C and
E = C \C that are exclusive to one region. In
symbols, we write this as
C = S E, and C = S E .
(9)
Simple Hypotheses
Now subtract from both of the integrals the quantity P1 {X S}, the probability that the hypothesis would be falsely
rejected by both tests to obtain
= P1 {X E} P1 {X E }
(10)
We can use the likelihood ratio criterion on each of the two integrals above.
For x E, then x is in the critical region and consequently L(1 |x) k L(0 |x) and
Z
Z
P1 {X E} =
L(1 |x) dx k
L(0 |x) dx = k P0 {X E}.
E
For x E , then x is not in the critical region and consequently L(1 |x) k L(0 |x) and
Z
Z
P1 {X E } =
L(1 |x) dx k
L(0 |x) dx = k P0 {X E }.
E
n
0 1 X
L(1 |x)
(2xi 1 0 ) k
= exp
L(0 |x)
202 i=1
Thus,
1 0
202
Pn
i=1 (2xi
2
x 1 0
202
n(1 0 )
ln k
202
ln k
(2xi 1 0 ) 1
i=1
0
202
1
2 n(1 0 ) ln k + 1 + 0 = k
Pn
1 0 ) ln k
x
SInce 1 < 0 , division by 1 0 changes the direction of the inequality. The rest of the argument procceeds as
before. we obtain that x
k .
5. If power means easier to distinguish using the data, then this is true when the means are farther apart, the measurements are less variable or the number of measurements increases. This can be seen explicitly is the power equation
(4).
8. For the likelihood ratio (5), take the logarithm to obtain
.
1 p1
p1
p0
L(p1 |x)
= n ln
+ (x1 + + xn ) ln
ln k .
ln
L(p0 |x)
1 p0
1 p1
1 p0
If p0 < p1 then the ratio in the expression for the
Pnlogarithm in the second term is greater than 1 and consequently, the
logarithm is positive. Thus, we isolate the sum i=1 xi to give the test (6). For p0 > p1 , the logarithm is negative and
the direction of the inequality in (6) is reversed.
224
Simple Hypotheses
225