Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

MIT18_05S14_Reading19

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Null Hypothesis Significance Testing III

Class 19, 18.05


Jeremy Orloff and Jonathan Bloom

1 Learning Goals

1. Given hypotheses and data, be able to identify to identify an appropriate significance


test from a list of common ones.

2. Given hypotheses, data, and a suggested significance test, know how to look up details
and apply the significance test.

2 Introduction

In these notes we will collect together some of the most common significance tests, though
by necessity we will leave out many other useful ones. Still, all significance tests follow the
same basic pattern in their design and implementation, so by learning the ones we include
you should be able to easily apply other ones as needed.
Designing a null hypothesis significance test (NHST):

• Specify null and alternative hypotheses.

• Choose a test statistic whose null distribution and alternative distribution(s) are
known.

• Specify a rejection region. Most often this is done implicitly by specifying a signif­
icance level α and a method for computing p-values based on the tails of the null
distribution.

• Compute power using the alternative distribution(s).

Running a NHST:

• Collect data and compute the test statistic.

• Check if the test statistic is in the rejection region. Most often this is done implicitly
by checking if p < α. If so, we ‘reject the null hypothesis in favor of the alternative
hypothesis’. Otherwise we conclude ‘the data does not support rejecting the null
hypothesis’.

Note the careful phrasing: when we fail to reject H0 , we do not conclude that H0 is true.
The failure to reject may have other causes. For example, we might not have enough data
to clearly distinguish H0 and HA , whereas more data would indicate that we should reject
H0 .

1
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 2

3 Population parameters and sample statistics

Example 1. If we randomly select 10 men from a population and measure their heights
we say we have sampled the heights from the population. In this case the sample mean, say
x, is the mean of the sampled heights. It is a statistic and we know its value explicitly. On
the other hand, the true average height of the population, say µ, is unknown and we can
only estimate its value. We call µ a population parameter.
The main purpose of significance testing is to use sample statistics to draw conlusions about
population parameters. For example, we might test if the average height of men in a given
population is greater than 70 inches.

4 A gallery of common significance tests related to the nor­


mal distribution

We will show a number of tests that all assume normal data. For completeness we will
include the z and t tests we’ve already explored.
You shouldn’t try to memorize these tests. It is a hopeless task to memorize the tests given
here and even more hopeless to memorize all the tests we’ve left out. Rather, your goal
should be to be able to find the correct test when you need it. Pay attention to the types
of hypotheses the tests are designed to distinguish and the assumptions about the data
needed for the test to be valid. We will work through the details of these tests in class and
on homework.
The null distributions for all of these tests are all related to the normal distribution by
explicit formulas. We will not go into the details of these distributions or the arguments
showing how they arise as the null distributions in our significance tests. However, the
arguments are accessible to anyone who knows calculus and is interested in undersdanding
them. Given the name of any distribution, you can easily look up the details of its con­
struction and properties online. You can also use R to explore the distribution numerically
and graphically.
When analyzing data with any of these tests one thing of key importance is to verify that
the assumptions are true or at least approximately true. For example, you shouldn’t use a
test that assumes the data is normal unless you’ve checked that the data is approximately
normal.
The script class19.r contains examples of using R to run some of these tests. It is posted in
our usual place for R code.

4.1 z-test

• Use: Test if the population mean equals a hypothesized mean.


• Data: x1 , x2 , . . . , xn .
• Assumptions: The data are independent normal samples:
xi ∼ N (µ, σ 2 ) where µ is unknown, but σ is known.
• H0 : For a specified µ0 , µ = µ0 .
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 3

• HA :
Two-sided: µ = µ0
one-sided-greater: µ > µ0
one-sided-less: µ < µ0
x − µ0
• Test statistic: z = √
σ/ n
• Null distribution: f (z | H0 ) is the pdf of Z ∼ N (0, 1).
• p-value:
Two-sided: p = P (|Z| > z) = 2*(1-pnorm(abs(z), 0, 1))
one-sided-greater: p = P (Z > z) = 1 - pnorm(z, 0, 1)
one-sided-less: p = P (Z < z) = pnorm(z, 0, 1)
• R code: There does not seem to be a single R function to run a z-test. Of course it
is easy enough to get R to compute the z score and p-value.

Example 2. We quickly reprise our example from the class 17 notes.


IQ is normally distributed in the population according to a N(100, 152 ) distribution. We
suspect that most MIT students have above average IQ so we frame the following hypothe­
ses.
H0 = MIT student IQs are distributed identically to the general population
= MIT IQ’s follow a N(100, 152 ) distribution.
HA = MIT student IQs tend to be higher than those of the general population
= the average MIT student IQ is greater than 100.
Notice that HA is one-sided.
Suppose we test 9 students and find they have an average IQ of x̄ = 112. Can we reject H0
at a significance level α = 0.05?
answer: Our test statistic is
x̄ − 100 36
z= √ = = 2.4.
15/ 9 15

The right-sided p-value is thereofre

p = P (Z ≥ 2.4) = 1- pnorm(2.4,0,1) = 0.0081975.

Since p ≤ α we reject the null hypothesis in favor of the alternative hypothesis that MIT
students have higher IQs on average.

4.2 One-sample t-test of the mean

• Use: Test if the population mean equals a hypothesized mean.


• Data: x1 , x2 , . . . , xn .
• Assumptions: The data are independent normal samples:
xi ∼ N (µ, σ 2 ) where both µ and σ are unknown.
• H0 : For a specified µ0 , µ = µ0
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 4

• HA :
Two-sided: µ = µ0
one-sided-greater: µ > µ0
one-sided-less: µ < µ0
x − µ0
• Test statistic: t = √ ,
s/ n
n
2 1
where s2 is the sample variance: s = (xi − x)2
n−1
i=1
• Null distribution: f (t | H0 ) is the pdf of T ∼ t(n − 1).
(Student t-distribution with n − 1 degrees of freedom)
• p-value:
Two-sided: p = P (|T | > t) = 2*(1-pt(abs(t), n-1))
one-sided-greater: p = P (T > t) = 1 - pt(t, n-1)
one-sided-less: p = P (T < t) = pt(t, n-1)
• R code example: For data x = 1, 3, 5, 7, 2 we can run a one-sample t-test with H0 :
µ = 2.5 using the R command:
t.test(x, mu = mu0, alternative=t́wo.sided´ =TRUE)
This will return a several pieces of information including the mean of the data, t-value
and the two-sided p-value. See the help for this function for other argument settings.

Example 3. Look in the class 18 notes or slides for an example of this test. The class 19
example R code also gives an example.

4.3 Two-sample t-test for comparing means

4.3.1 The case of equal variances

We start by describing the test assuming equal variances.

• Use: Test if the population means from two populations differ by a hypothesized
amount.
• Data: x1 , x2 , . . . , xn and y1 , y2 , . . . , ym .
• Assumptions: Both groups of data are independent normal samples:
xi ∼ N (µx , σ 2 )
yj ∼ N (µy , σ 2 )
where both µx and µy are unknown and possibly different. The variance σ 2 is un­
known, but the same for both groups.
• H0 : For a specified µ0 : µx − µy = µ0
• HA :
Two-sided: µx − µy = µ0
one-sided-greater: µx − µy > µ0
one-sided-less: µx − µy < µ0
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 5

x − ȳ − µ0
• Test statistic: t = ,
sP
where sx2 and sy2 are the sample variances of the x and y data respectively, and sP2
is (sometimes called) the pooled sample variance:

(n − 1)sx2 + (m − 1)sy2 1 1
s2p = + and df = n + m − 2
n+m−2 n m

• Null distribution: f (t | H0 ) is the pdf of T ∼ t(df ), the t-distribution with df =


n + m − 2 degrees of freedom.
• p-value:
Two-sided: p = P (|T | > t) = 2*(1-pt(abs(t), df))
one-sided-greater: p = P (T > t) = 1 - pt(t, df)
one-sided-less: p = P (T < t) = pt(t, df)
• R code: The R function t.test will run a two-sample t-test. See the example code
in class19.r

Example 4. Look in the class 18 notes or slides for an example of the two-sample t-test.
Notes: 1. Most often the test is done with µ0 = 0. That is, the null hypothesis is the the
means are equal, i.e. µx − µy = 0.
2. If the x and y data have the same length, n, then the formula for s2p becomes simpler:

s2x + sy2
s2p =
n

4.3.2 The case of unequal variances

There is a form of the t-test for when the variances are not assumed equal. It is sometimes
called Welch’s t-test.
This looks exactly the same as the case of equal except for a small change in the assumptions
and the formula for the pooled variance:

• Use: Test if the population means from two populations differ by a hypothesized
amount.
• Data: x1 , x2 , . . . , xn and y1 , y2 , . . . , ym .
• Assumptions: Both groups of data are independent normal samples:
xi ∼ N (µx , σx2 )
yj ∼ N (µy , σy2 )
where both µx and µy are unknown and possibly different. The variances σx2 and σy2
are unknown and not assumed to be equal.
• H0 , HA : Exactly the same as the case of equal variances.
x − ȳ − µ0
• Test statistic: t = ,
sP
where sx2 and s2y are the sample variances of the x and y data respectively, and s2P
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 6

is (sometimes called) the pooled sample variance:

s2x s2y (s2x /n + sy2 /m)2


s2p = + and df =
n m (s2x /n)2 /(n − 1) + (s2y /m)2 /(m − 1)

• Null distribution: f (t | H0 ) is the pdf of T ∼ t(df ), the t distribution with df degrees


of freedom.
• p-value: Exactly the same as the case of equal variances.
• R code: The function t.test also handles this case by setting the argument var.equal=FALSE.

4.3.3 The paired two-sample t-test

When the data naturally comes in pairs (xi , yi ), we can us the paired two-sample t-test.
(After checking the assumptions are valid!)
Example 5. To measure the effectiveness of a cholesterol lowering medication we might
test each subject before and after treatment with the drug. So for each subject we have
a pair of measurements: xi = cholesterol level before treatment and yi = cholesterol level
after treatment.
Example 6. To measure the effectiveness of a cancer treatment we might pair each subject
who received the treatment with one who did not. In this case we would want to pair subjects
who are similar in terms of stage of the disease, age, sex, etc.

• Use: Test if the average difference between paired values in a population equals a
hypothesized value.
• Data: x1 , x2 , . . . , xn and y1 , y2 , . . . , yn must have the same length.
• Assumptions: The differences wi = xi −yi between the paired samples are independent
draws from a normal distribution N(µ, σ 2 ), where µ and σ are unknown.
• NOTE: This is just a one-sample t-test using wi .
• H0 : For a specified µ0 , µ = µ0 .
• HA :
Two-sided: µ 6= µ 0
one-sided-greater: µ > µ0
one-sided-less: µ < µ0
w − µ0
• Test statistic: t = √ ,
s/ n
n
2 1 X
where s2 is the sample variance: s = (wi − w)2
n−1
i=1
• Null distribution: f (t | H0 ) is the pdf of T ∼ t(n − 1).
(Student t-distribution with n − 1 degrees of freedom)
• p-value:
Two-sided: p = P (|T | > t) = 2*(1-pt(abs(t), n-1))
one-sided-greater: p = P (T > t) = 1 - pt(t, n-1)
one-sided-less: p = P (T < t) = pt(t, n-1)
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 7

• R code: The R function t.test will do a paired two-sample test if you set the argu­
ment paired=TRUE. You can also run a one-sample t-test on x−y. There are examples
of both of these in class19.r

Example 7. The following example is taken from Rice 1

To study the effect of cigarette smoking on platelet aggregation Levine (1973) drew blood
samples from 11 subjects before and after they smoked a cigarette and measured the extent
to which platelets aggregated. Here is the data:
Before 25 25 27 44 30 67 53 53 52 60 28
After 27 29 37 56 46 82 57 80 61 59 43
Difference 2 4 10 12 16 15 4 27 9 -1 15
The null hypothesis is that smoking had no effect on platelet aggregation, i.e. that the dif­
ference should have mean µ0 = 0. We ran a paired two-sample t-test to test this hypothesis.
Here is the R code: (It’s also in class19.r.)
before.cig = c(25,25,27,44,30,67,53,53,52,60,28)
after.cig = c(27,29,37,56,46,82,57,80,61,59,43)
mu0 = 0
result = t.test(after.cig, before.cig, alternative="two.sided", mu=mu0, paired=TRUE)
print(result)
Here is the output:
Paired t-test
data: after.cig and before.cig
t = 4.2716, df = 10, p-value = 0.001633
alternative hypothesis: true difference in means is not equal to 0
mean of the differences: 10.27273
We got the same results with the one-sample t-test:
t.test(after.cig - before.cig, mu=0)

4.4 One-way ANOVA (F -test for equal means)

• Use: Test if the population means from n groups are all the same.
• Data: (n groups, m samples from each group)
x1,1 , x1,2 , . . . , x1,m
x2,1 , x2,2 , . . . , x2,m
...
xn,1 , xn,2 , . . . , xn,m
• Assumptions: Data for each group is an independent normal sample drawn from
distributions with (possibly) different means but the same variance:
x1,j ∼ N (µ1 , σ 2 )
x2,j ∼ N (µ2 , σ 2 )
...
xn,j ∼ N (µn , σ 2 )
1
John Rice, Mathematical Statistics and Data Analysis, 2nd edition, p. 412. This example references P.H
Levine (1973) An acute effect of cigarette smoking on platelet function. Circulation, 48, 619-623.
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 8

The group means µi are unknown and possibly different. The variance σ is unknown,
but the same for all groups.
• H0 : All the means are identical µ1 = µ2 = . . . = µn .
• HA : Not all the means are the same.
MSB
• Test statistic: w = MS W
, where
x̄i = mean of group i
xi,1 + xi,2 + . . . + xi,m
= .
m
x = grand mean of all the data.
s2i = sample variance of group i
m
1 X
= (xi,j − x¯ i )2 .
m−1
j=1
MSB = between group variance
= m × sample variance of group means
n
m X
= (x̄i − x)2 .
n−1
i=1
MSW = average within group variance
= sample mean of s21 , . . . , sn2
s2 + s22 + . . . + s2n
= 1
n
• Idea: If the µi are all equal, this ratio should be near 1. If they are not equal then
MSB should be larger while MSW should remain about the same, so w should be
larger. We won’t give a proof of this.
• Null distribution: f (w | H0 ) is the pdf of W ∼ F (n − 1, n(m − 1)).
This is the F -distribution with (n − 1) and n(m − 1) degrees of freedom. Several
F -distributions are plotted below.
• p-value: p = P (W > w) = 1- pf(w, n-1, n*(m-1)))
1.0

F(3,4)
F(10,15)
0.8

F(30,15)
0.6
0.4
0.2
0.0

0 2 4 6 8 10
x
Notes: 1. ANOVA tests whether all the means are the same. It does not test whether
some subset of the means are the same.
2. There is a test where the variances are not assumed equal.
3. There is a test where the groups don’t all have the same number of samples.
4. R has a function aov() to run ANOVA tests. See:
https://personality-project.org/r/r.guide/r.anova.html#oneway
http://en.wikipedia.org/wiki/F-test
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 9

Example 8. The table shows patients’ perceived level of pain (on a scale of 1 to 6) after
3 different medical procedures.

T1 T2 T3
2 3 2
4 4 1
1 6 3
5 1 3
3 4 5

(1) Set up and run an F-test comparing the means of these 3 treatments.
(2) Based on the test, what might you conclude about the treatments?
answer: Using the code below, the F statistic is 0.325 and the p-value is 0.729 At any
reasonable significance level we will fail to reject the null hypothesis that the average pain
level is the same for all three treatments..
Note, it is not reasonable to conclude the the null hypothesis is true. With just 5 data
points per procedure we might simply lack the power to distinguish different means.
R code to perform the test
# DATA ---­
T1 = c(2,4,1,5,3)
T2 = c(3,4,6,1,4)
T3 = c(2,1,3,3,5)
procedure = c(rep(’T1’,length(T1)),rep(’T2’,length(T2)),rep(’T3’,length(T3)))
pain = c(T1,T2,T3)
data.pain = data.frame(procedure,pain)
aov.data = aov(pain∼procedure,data=data.pain) # do the analysis of variance
print(summary(aov.data)) # show the summary table
# class19.r also show code to compute the ANOVA by hand.
The summary shows a p-value (shown as Pr(>F)) of 0.729. Therefore we do not reject the
null hypothesis that all three group population means are the same.

4.5 Chi-square test for goodness of fit

This is a test of how well a hypothesized probability distribution fits a set of data. The test
statistic is called a chi-square statistic and the null distribution associated of the chi-square
statistic is the chi-square distribution. It is denoted by χ2 (df ) where the parameter df is
called the degrees of freedom.
Suppose we have an unknown probability mass function given by the following table.
Outcomes ω1 ω2 ... ωn
Probabilities p1 p2 ... pn
In the chi-square test for goodness of fit we hypothesize a set of values for the probabilities.
Typically we will hypothesize that the probabilities follow a known distribution with certain
parameters, e.g. binomial, Poisson, multinomial. The test then tries to determine if this
set of probabilities could have reasonably generated the data we collected.
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 10

• Use: Test whether discrete data fits a specific finite probability mass function.
• Data: An observed count Oi for each possible outcome ωi .
• Assumptions: None
• H0 : The data was drawn from a specific discrete distribution.
• HA : The data was drawn from a different distribution.
• Test statistic: The data consists of observed counts Oi for each ωi . From the null hy-
pothesis probability table we get a set of expected counts Ei . There are two statistics
that we can use:
 
X Oi
Likelihood ratio statistic G = 2 ∗ Oi ln
Ei
X (Oi − Ei )2
Pearson’s chi-square statistic X 2 = .
Ei

It is a theorem that under the null hypthesis X 2 ≈ G and both are approximately
chi-square. Before computers, X 2 was used because it was easier to compute. Now,
it is better to use G although you will still see X 2 used quite often.
• Degrees of freedom df : For chi-square tests the number of degrees of freedom can be
a bit tricky. In this case df = n − 1. It is computed as the number of cell counts
that can be freely set under HA consistent with the statistics needed to compute the
expected cell counts assuming H0 .
• Null distribution: Assuming H0 , both statistics (approximately) follow a chi-square
distribution with df degrees of freedom. That is both f (G | H0 ) and f (X 2 | H0 ) have
the same pdf as Y ∼ χ2 (df ).
• p-value:
p = P (Y > G) = 1 - pchisq(G, df)
p = P (Y > X 2 ) = 1 - pchisq(X 2 , df)
• R code: The R function chisq.test can be used to do the computations for a chi-
square test use X 2 . For G you either have to do it by hand or find a package that has
a function. (It will probably be called likelihood.test or G.test.

Notes. 1. When the likelihood ratio statistic G is used the test is also called a G-test or
a likelihood ratio test.
Example 9. First chi-square example. Suppose we have an experiment that produces
numerical data. For this experiment the possible outcomes are 0, 1, 2, 3, 4, 5 or more. We
run 51 trials and count the frequency of each outcome, getting the following data:
Outcomes 0 1 2 3 4 ≥5
Observed counts 3 10 15 13 7 3
Suppose our null hypothesis H0 is that the data is drawn from 51 trials of a binomial(8,
0.5) distribution and our alternative hypothesis HA is that the data is drawn from some
other distribution. Do all of the following:
1. Make a table of the observed and expected counts.
2. Compute both the likelihood ratio statistic G and Pearson’s chi-square statistic X 2 .
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 11

3. Compute the degrees of freedom of the null distribution.


4. Compute the p-values corresponding to G and X 2 .
answer: All of the R code used for this example is in class19.r.
1. Assuming H0 the data truly comes from a binomial(8, 0.5) distribution. We have 51
total observations, so the expected count for each outcome is just 51 times its probability.
We computed the binomial(8, 0.5) probabilities and expected counts in R:
Outcomes 0 1 2 3 4 ≥5
Observed counts 3 10 15 13 7 3
H0 probabilities 0.0039 0.0313 0.1094 0.2188 0.2734 0.3633
Expected counts 0.19 1.53 5.36 10.72 13.40 17.80
2. Using the formulas above we compute that X 2 = 116.41 and G = 66.08
3. The only statistic used in computing the expected counts was the total number of
observations 51. So, the degrees of freedom is 5, i.e we can set 5 of the cell counts freely
and the last is determined by requiring that the total number is 51.
4. The p-values are pG =1 - pchisq(G, 5) and pX2 = 1 - pchisq(X 2 , 5). Both p-
values are effectively 0. For almost any significance level we would reject H0 in favor of
HA .
Example 10. (Degrees of freedom.) Suppose we have the same data as in the previous
example, but our null hypothesis is that the data comes from independent trials of bino­
mial(8, θ) distribution, where θ can be anything. (HA is that the data comes from some
other distribution.) In this case we must estimate θ from the data, e.g. using the MLE.
In total we have computed two values from the data: the total number of counts and the
estimate of θ. So, the degrees of freedom is 6 − 2 = 4.
Example 11. Mendel’s genetic experiments (Adapted from Rice Mathematical Statis­
tics and Data Analysis, 2nd ed., example C, p.314)
In one of his experiments on peas Mendel crossed 556 smooth, yellow male peas with
wrinkled green female peas. Assuming the smooth and wrinkled genes occur with equal
frequency we’d expect 1/4 of the pea population to have two smooth genes (SS), 1/4 to
have two wrinkled genes (ss), and the remaining 1/2 would be heterozygous Ss. We also
expect these fractions for yellow (Y ) and green (y) genes. If the color and smoothness
genes are inherited independently and smooth and yellow are both dominant we’d expect
the following table of frequencies for phenotypes.
Yellow Green
Smooth 9/16 3/16 3/4
Wrinkled 3/16 1/16 1/4
3/4 1/4 1
Probability table for the null hypothesis

So from the 556 crosses the expected number of smooth yellow peas is 556 × 9/16 = 312.75.
Likewise for the other possibilities. Here is a table giving the observed and expected counts
from Mendel’s experiments.
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 12

Observed count Expected count


Smooth yellow 315 312.75
Smooth green 108 104.25
Wrinkled yellow 102 104.25
Wrinkled green 31 34.75
The null hypothesis is that the observed counts are random samples distributed according
to the frequency table given above. We use the counts to compute our statistics
The likelihood ratio statistic is
 
X Oi
G = 2∗ Oi ln
Ei
        
315 108 102 31
= 2 ∗ 315 ln + 108 ln + 102 ln + 31 ln
412.75 104.25 104.25 34.75
= 0.618

Pearson’s chi-square statistic is


X (Oi − Ei )2 2.75 3.75 2.25 3.75
X2 = = + + + = 0.604
Ei 312.75 104.25 104.25 34.75
You can see that the two statistics are very close. This is usually the case. In general the
likelihood ratio statistic is more robust and should be preferred.
The degrees of freedom is 3, because there are 4 observed quantities and one relation between
them, i.e. they sum to 556. So, under the null hypothesis G follows a χ2 (3) distribution.
Using R to compute the p-value we get

p = 1- pchisq(0.618, 3) = 0.892

Assuming the null hypothesis we would see data at least this extreme almost 90% of the
time. We would not reject the null hypothesis for any reasonable significance level.
The p-value using Pearson’s statistic is 0.985 –nearly identical.
The script class19.r shows these calculations and also how to use chisq.test to run a
chi-square test directly.

4.6 Chi-square test for homogeneity

This is a test to see if several independent sets of random data are all drawn from the same
distribution. (The meaning of homogeneity in this case is that all the distributions are the
same.)

• Use: Test whether m different independent sets of discrete data are drawn from the
same distribution.
• Outcomes: ω1 , ω2 , . . . , ωn are the possible outcomes. These are the same for each set
of data.
• Data: We assume m independent sets of data giving counts for each of the possible
outcomes. That is, for data set i we have an observed count Oi,j for each possible
outcome ωj .
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 13

• Assumptions: None
• H0 : Each data set is drawn from the same distribution. (We don’t specify what this
distribution is.)
• HA : The data sets are not all drawn from the same distribution.
• Test statistic: See the example below. There are mn cells containing counts for each
outcome for each data set. Using the null distribution we can estimate expected counts
for each of the data sets. The statistics X 2 and G are computed exactly as above.
• Degrees of freedom df : (m − 1)(n − 1). (See the example below.)
• The null distribution χ2 (df ). The p-values are computed just as in the chi-square test
for goodness of fit.
• R code: The R function chisq.test can be used to do the computations for a chi-
square test use X 2 . For G you either have to do it by hand or find a package that has
a function. (It will probably be called likelihood.test or G.test.

Example 12. Someone claims to have found a long lost work by William Shakespeare.
She asks you to test whether or not the play was actually written by Shakespeare .
You go to http://www.opensourceshakespeare.org and pick a random 12 pages from
King Lear and count the use of common words. You do the same thing for the ‘long lost
work’. You get the following table of counts.

Word a an this that


King Lear 150 30 30 90
Long lost work 90 20 10 80
Using this data, set up and evaluate a significance test of the claim that the long lost book
is by William Shakespeare. Use a significance level of 0.1.
answer: The null hypothesis H0 : For the 4 words counted the long lost book has the same
relative frequencies as the counts taken from King Lear.
The total word count of both books combined is 500, so the the maximum likelihood estimate
of the relative frequencies assuming H0 is simply the total count for each word divided by
the total word count.
Word a an this that Total count
King Lear 150 30 30 90 300
Long lost work 90 20 10 80 200
totals 240 50 40 170 500
rel. frequencies under H0 240/500 50/500 40/500 170/500 500/500
Now the expected counts for each book under H0 are the total count for that book times
the relative frequencies in the above table. The following table gives the counts: (observed,
expected) for each book.
Word a an this that Totals
King Lear (150, 144) (30, 30) (30, 24) (90, 102) (300, 300)
Long lost work (90, 96) (20, 20) (10, 16) (80, 68) (200, 200)
Totals (249, 240) (50, 50) (40, 40) (170, 170) (500, 500)
18.05 class 19, Null Hypothesis Significance Testing III, Spring 2014 14

The chi-square statistic is


X (Oi − Ei )2
X2 =
Ei
6202 62 122 62 02 62 122
= + + + + + + +
144 30 24 102 96 20 16 68
≈ 7.9

There are 8 cells and all the marginal counts are fixed because they were needed to determine
the expected counts. To be consistent with these statistics we could freely set the values
in 3 cells in the table, e.g. the 3 blue cells, then the rest of the cells are determined
in order to make the marginal totals correct. Thus df = 3. (Or we could recall that
df = (m − 1)(n − 1) = (3)(1) = 3, where m is the number of columns and n is the number
of rows.)
Using R we find p = 1-pchisq(7.9,3) = 0.048. Since this is less than our significance
level of 0.1 we reject the null hypothesis that the relative frequencies of the words are the
same in both books.
If we make the further assumption that all of Shakespeare’s plays have similar word fre­
quencies (which is something we could check) we conclude that the book is probably not
by Shakespeare.

4.7 Other tests

There are far too many other tests to even make a dent. We will see some of them in
class and on psets. Again, we urge you to master the paradigm of NHST and recognize the
importance of choosing a test statistic with a known null distribution.
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.

You might also like