Goodness of Fit: Topic 21
Goodness of Fit: Topic 21
Goodness of Fit
O , A , B , and AB .
The actual fraction pO , pA , pB , and pAB of these blood types in the community for a given blood bank may be different
than what is seen in the national database. As a consequence, the local blood bank may choose to alter its distribution
of blood supply to more accurately reflect local conditions.
To place this assessment strategy in terms of formal hypothesis testing, let = (1 , . . . , k ) be postulated values
of the probability
P {individual is a member of i-th category} = i
and let p = (p1 , . . . , pk ) denote the possible states of nature. Then, the parameter space is
k
X
= {p = (p1 , . . . , pk ); pi 0 for all i = 1, . . . , k, pi = 1}.
i=1
This parameter space has k 1 free parameters. Once these are chosen, the remaining parameter value is determined
by the requirement that the sum of the pi equals 1. Thus, dim() = k 1.
The hypothesis is
The parameter space for the null hypothesis is a single point = (1 , . . . , k ). Thus, dim(0 ) = 0. Consequently,
the likelihood ratio test will have a chi-square test statistic with dim() dim(0 ) = k 1 degrees of freedom. The
data x = (x1 , . . . , xn ) are the categories for each of the n observations.
Lets use the likelihood ratio criterion to create a test for the distribution of human blood types in a given popula-
tion. For the data
x = {O, B, O, A, A, A, A, A, O, AB}
for the blood types of tested individuals, then, in the case of independent observations, the likelihood is
323
Introduction to the Science of Statistics Goodness of Fit
Notice that the likelihood has a factor of pi whenever an observation take on the value i. In other words, if we
summarize the data using
ni = #{observations from category i}
to create n = (n1 , n2 , , nk ), a vector that records the number of observations in each category, then, the likelihood
function
L(p|n) = pn1 1 pnk k . (21.2)
The likelihood ratio is the ratio of the maximum value of the likelihood under the null hypothesis and the maxi-
mum likelihood for any parameter value. In this case, the numerator is the likelihood evaluated at .
n1 nk
L(|n) n1 n2 knk 1 k
(n) = = 1n1 n2 2 = . (21.3)
L(p|n) p1 p2 pnk k p1 pk
To find the maximum likelihood estimator p, we, as usual, begin by taking the logarithm in (21.2),
k
X
ln L(p|n) = ni ln pi .
i=1
Because not every set of values for pi is admissible, we cannot just take derivatives, set them equal to 0 and solve.
Indeed, we must find a maximum under the constraint
k
X
s(p) = pi = 1.
i=1
The maximization problem is now stated in terms of the method of Lagrange multipliers. This method tells us
that at the maximum likelihood estimator ( p1 , . . . , pk ), the gradient of ln L(p|n) is proportional to the gradient of the
constraint s(p). To explain this briefly, recall that the gradient of a function is a vector that is perpendicular to a level
set of that function. In this case,
Now imagine walking along the set of parameter values of p given by the constraint s(p) = 1, keeping track of
the values of the function ln L(p|n). If the walk takes us from a value of this function below `0 to values above `0
then (See Figure 21.1.), the level surfaces
{p; s(p) = 1}
and
{ln L(p|n = `0 }
intersect. Consequently, the gradients
rp s(p) and rp ln L(p|n)
point in different directions on the intersection of these two surfaces. At a local maximum or minimum of the log-
likelihood function, the level surfaces are tangent and the two gradients are parallel. In other words, the these two
gradients vectors are related by a constant of proportionality, , known as the Lagrange multiplier. Consequently, at
extreme values,
rp ln L( p|n) = rp s(p).
@ @ @ @
p|n), . . . ,
ln L( ln L(p|n) = s(p), . . . , s(p)
@p1 @pk @p1 @pk
n1 nk
,..., = (1, . . . , 1)
p1 pk
324
Introduction to the Science of Statistics Goodness of Fit
s(p) = 1
Figure 21.1: Lagrange multipliers Level sets of the log-likelihood function.shown in dashed blue. The level set {s(p) = 1} shown in black. The
gradients for the log-likelihood function and the constraint are indicated by dashed blue and black arrows, respectively. At the maximum, these two
arrows are parallel. Their ratio is called the Lagrange multiplier. If we view the blue dashed lines as elevation contour lines and the black line as
a trail, crossing contour line indicates walking either up or down hill. When the trail reaches its highest elevation, the trail is tangent to a contour
line and the gradient for the hill is perpendicular to the trail.
Each of the components of the two vectors must be equal. In other words,
ni
= , ni = pi for all i = 1, . . . , k. (21.4)
pi
Now sum this equality for all values of i and use the constraint s(p) = 1 to obtain
k
X k
X
n= ni = pi = s(
p) = .
i=1 i=1
325
Introduction to the Science of Statistics Goodness of Fit
the number of observations in category i. Next, create the vector N = (N1 , . . . , Nk ) to be the vector of observed
number of occurrences for each category i. In the example we have the vector (3,5,1,1) for the number of occurrences
of the 4 blood types.
When the null hypothesis holds true, 2 ln (N) has approximately a 2k 1 distribution. Using (21.5) we obtain
the the likelihood ratio test statistic
k
X Xk
ni Ni
2 ln (N) = 2 Ni ln =2 Ni ln
i=1
Ni i=1
ni
The last equality uses the identity ln(1/x) = ln x for the logarithm of reciprocals.
The test statistic 2 ln n (n) is generally rewritten using the notation Oi = ni for the number of observed
occurrences of i and Ei = ni for the number of expected occurrences of i as given by H0 . Then, we can write the
test statistic as
Xk
Oi
2 ln n (O) = 2 Oi ln (21.6)
i=1
Ei
This is called the G2 test statistic. Thus, we can perform our inference on the hypothesis (??) by evaluating G2 . The
p-value will be the probability that the a 2k 1 random variable takes a value greater than 2 ln n (O)
The traditional method for a test of goodness of fit, we use, instead of the G2 statistic, the chi-square statistic
k
X
2 (Ei Oi )2
= . (21.7)
i=1
Ei
This was introduced between 1895 and 1900 by Karl Pearson and consequently has been in use for longer that the
concept of likelihood ratio tests. We establish the relation between (21.6) and (21.7), through the following two
exercises.
Exercise 21.2. Show the relationship between the G2 and 2 statistics in (21.6) and (21.7) by applying the quadratic
Taylor polynomial approximation for the natural logarithm,
1 2
ln(1 + i) i i
2
and keeping terms up to the square of i
i 1 2 k
observed O1 O2 Ok
expected E1 E2 Ek
326
Introduction to the Science of Statistics Goodness of Fit
Example 21.3. The Red Cross recommends that a blood bank maintains 44% blood type O, 42% blood type A, 10%
blood type B, 4% blood type AB. You suspect that the distribution of blood types in Tucson is not the same as the
recommendation. In this case, the hypothesis is
H0 : pO = 0.44, pA = 0.42, pB = 0.10, pAB = 0.04 versus H1 : at least one pi is unequal to the given values
Based on 400 observations, we observe 228 for type O, 124 for type A, 40 for type B and 8 for type AB by
computing 400 pi using the values in H0 . This gives the table
type O A B AB
observed 228 124 40 8
expected 176 168 40 16
Using this table, we can compute the value of either (21.6) and (21.7). The chisq.test command in R uses
(21.7). The program computes the expected number of observations.
> chisq.test(c(228,124,40,8),p=c(0.44,0.42,0.10,0.04))
Example 21.4. Is sudden infant death syndrome seasonal (SIDS)? Here we are hypothesizing that 1/4 of each of the
occurrences of sudden infant death syndrome take place in the spring, summer, fall, and winter. Let p1 , p2 , p3 , and p4
be the respective probabilities for these events. Then the hypothesis takes the form
1 1
H 0 : p 1 = p 2 = p3 = p4 = , versus H1 : at least one pi is unequal to .
4 4
To test this hypothesis, public health officials from King County, Washington, collect data on n = 322 cases,
finding
n1 = 78, n2 = 71, n3 = 87, n4 = 86
327
Introduction to the Science of Statistics Goodness of Fit
3
2
1
0
-1
-2
-3
O A B AB
Figure 21.2: The heights of the bars for each category are the standardized scores (21.8). Thus, blood type O is overrepresented and types A and
AB are underrepresented compare to the expectations under the null hypothesis.s
for deaths in the spring, summer, fall, and winter, respectively. Thus, we find more occurrences of SIDS in the fall and
winter. Is this difference statistical significant or are these difference better explained by chance fluctuations?
We carry out the chi square test. In this case, each of the 4 categories is equally probable. Because this is the
default value in R, we need not include this in the command.
> chisq.test(c(78,71,87,86))
Example 21.5 (Hardy-Weinberg equilibrium). As we saw with Gregor Mendels pea experiments, the two-allele
Hardy-Weinberg principle states that after two generations of random mating the genotypic frequencies can be repre-
sented by a binomial distribution. So, if a population is segregating for two alleles A1 and A2 at an autosomal locus
with frequencies p1 and p2 , then random mating would give a proportion
p11 = p21 for the A1 A1 genotype, p12 = 2p1 p2 for the A1 A2 genotype, and p22 = p22 for the A2 A2 genotype. (21.9)
Then, with both genes in the homozygous genotype and half the genes in the heterozygous genotype, we find that
1 1
p1 = p11 + p12 p2 = p22 + p12 . (21.10)
2 2
Our parameter space = {(p11 , p12 , p22 ); p11 + p12 + p22 = 1} is 2 dimensional. 0 , the parameter space
for the null hypothesis, are those values p1 , p2 that satisfy (21.10). With the choice of p1 , the value p2 is determined
because p1 + p2 = 1. Thus, dim(0 ) = 1. Consequently, the chi-square test statistic will have 2-1=1 degree of
freedom. Another way to see this is the following.
McDonald et al. (1996) examined variation at the CVJ5 locus in the American oyster, Crassostrea virginica. There
were two alleles, L and S, and the genotype frequencies in Panacea, Florida were 14 LL, 21 LS, and 25 SS. So,
14 21 25
p11 = , p12 = , p22 = .
60 60 60
328
Introduction to the Science of Statistics Goodness of Fit
0.25
0.20
chi square density
0.15
0.10
0.05
0.00
0 2 4 6 8 10 12
x
Figure 21.3: Plot of the chisquare density function with 3 degrees of freedom. The black vertical bar indicates the value of the test statistic in
Example 21.3. The area 0.552 under the curve to the right of the vertical line is the p-value for this test. This is much to high to reject the null
hypothesis. The red vertical lines show the critical values for a test with significance = 0.05 (to the left) and = 0.01 (to the right). Thus, the
area under the curve to the right of these vertical lines is 0.05 and 0.01, respectively. These values can be found using qchisq(1-,3). We can
also see that the test statistic value of 30.8874 in Example 21.3 has a very low p-value.
329
Introduction to the Science of Statistics Goodness of Fit
B1 B2 Bc total
A1 O11 O12 O1c O1
A2 O21 O22 O2c O2
.. .. .. .. .. ..
. . . . . .
Ar Or1 Or2 Orc Or
total O1 O2 Oc n
Example 21.7. Returning to the study of the smoking habits of 5375 high school children in Tucson in 1967, here is a
two-way table summarizing some of the results.
student student
smokes does not smoke total
2 parents smoke 400 1380 1780
1 parent smokes 416 1823 2239
0 parents smoke 188 1168 1356
total 1004 4371 5375
For a contingency table, the null hypothesis we shall consider is that the factors A and B are independent. To set
the parameters for this model, we define
and
r
X
pj = pij = P {an individual is a member of category Bj }.
i=1
Follow the procedure as before for the goodness of fit test to end with a G2 and its corresponding 2 test statistic.
The G2 statistic follows from the likelihood ratio test criterion. The 2 statistics is a second order Taylor series
approximation to G2 .
X r X c X r X c
Eij (Oij Eij )2
2 Oij ln .
i=1 j=1
Oij i=1 j=1
Eij
330
Introduction to the Science of Statistics Goodness of Fit
The null hypothesis pij = pi pj can be written in terms of observed and expected observations as
Eij Oi Oj
= .
n n n
or
Oi Oj
Eij =
.
n
The test statistic, under the null hypothesis, has a 2 distribution. To determine the number of degrees of freedom,
consider the following. Start with a contingency table with no entries but with the prescribed marginal values.
B1 B2 Bc total
A1 O1
A2 O2
.. ..
. .
Ar Or
total O1 O2 Oc n
The number of degrees of freedom is the number of values that we can place in the contingency table before all the
remaining values are determined. To begin, fill in the first row with values E11 , E12 , . . . , E1,c 1 . The final value E1,c
in this determined by the other values in the row and the constraint that the row sum must be O1 . Continue filling the
rows, noting that the value in column c is determined by the constraint on the row sum. Finally, when the time comes
to fill in the bottom row r, notice that all the values are determined by the constraint on the row sums Oj . Thus, we
can fill c 1 values in each of the r 1 rows before the remaining values are determined. Thus, the number of degrees
of freedom is (r 1) (c 1),
Example 21.8. Returning to the data set on smoking habits in Tucson, we find that the expected table is
student student
smokes does not smoke total
2 parents smoke 332.49 1447.51 1780
1 parent smokes 418.22 1820.78 2239
0 parents smoke 253.29 1102.71 1356
total 1004 4371 5375
For example,
O1 O1 1780 1004
E11 = = = 332.49.
n 5375
To compute the chi-square statistic
(400 332.49)2 (1380 1447.51)2
332.49 + 1447.51
= 13.71 + 3.15
+ 0.012 + 0.003
+ 16.83 + 3.866
= 37.57
331
Introduction to the Science of Statistics Goodness of Fit
data: smoking
X-squared = 37.5663, df = 2, p-value = 6.959e-09
p
We can look at the residuals (Oij Eij )/ Eij for the entries in the 2
test as follows.
> smokingtest<-chisq.test(smoking)
> residuals(smokingtest)
[,1] [,2]
[1,] 3.7025160 -1.77448934
[2,] -0.1087684 0.05212898
[3,] -4.1022973 1.96609088
Notice that if we square these values, we obtain the entries found in computing the test statistic.
> residuals(smokingtest)2
[,1] [,2]
[1,] 13.70862455 3.14881241
[2,] 0.01183057 0.00271743
[3,] 16.82884348 3.86551335
Exercise 21.9. Make three horizontally placed chigrams that summarize the residuals for this 2
test in the example
above.
Exercise 21.10 (two-by-two tables). Here is the contingency table can be thought of as two sets of Bernoulli trials as
shown.
group 1 group 2 total
successes x1 x2 x1 + x2
332
Introduction to the Science of Statistics Goodness of Fit
B1 B2 total
A1 O11 O12 O1
A2 O21 O22 O2
total O1 O2 n
The idea behind Fishers exact test is to begin with an empty table:
B1 B2 total
A1 O1
A2 O2
total O1 O2 n
and a null hypothesis that uses equally likely outcomes to fill in the table. We will use as an analogy the model of
mark and recapture. Normally the goal is to find n, the total population. In this case, we assume that this population
size is known and will consider the case that the individuals in the two captures are independent. This is assumed in
the mark and recapture protocol. Here we test this independence.
In this regard,
A1 - an individual in the first capture and thus tagged.
A2 - an individual not in the first capture and thus not tagged.
B1 - an individual in the second capture.
B2 - an individual not in the second capture
Then, from the point of view of the A classification:
We have O1 from a population n with the A1 classification (tagged individuals). This can be accomplished in
n n!
=
O1 O1 !O2 !
ways. The remaining O2 = n O1 have the A2 classification (untagged individuals). Next, we fill in the
values for the B classification
From the O1 belonging to category B1 (individuals in the second capture), O11 also belong to A1 (have a tag).
This outcome can be accomplished in
O1 O1 !
=
O11 O11 !O21 !
ways.
From the O2 belonging to category B2 (individuals not in the second capture), O12 also belong to A1 (have a
tag). This outcome can be accomplished in
O2 O2 !
=
O21 O12 !O22 !
ways.
333
Introduction to the Science of Statistics Goodness of Fit
Under the null hypothesis that every individual can be placed in any group, provided we have the given marginal
information. In this case, the probability of the table above has the formula from the hypergeometric distribution
O1 O2
O11 O21 O1 !/(O11 !O21 !) O2 !/(O12 !O22 !) O1 !O2 !O1 !O2 !
n = = . (21.11)
O1
n!/(O1 !O2 !) O11 !O12 !O21 !O22 !n!
Notice that the formula is symmetric in the column and row variables. Thus, if we had derived the hypergeometric
formula from the point of view of the B classification we would have obtained exactly the same formula (21.11).
To complete the exact test, we rely on statistical software to do the following:
compute the hypergeometric probabilities over all possible choices for entries in the cells that result in the given
marginal values, and
For a one-sided test of too rare, the p-value is the sum of probabilities of the ranking lower than that of the data.
A similar procedure applies to provide the Fisher exact test for r c tables.
Example 21.11. As a test of the assumptions for mark and recapture. We examine a small population of 120 fish. The
assumption are that each group of fish are equally likely to be capture in the first and second capture and that the two
captures are independent. This could be violated, for example, if the tagged fish are not uniformly dispersed in the
pond.
Twenty-five are tagged and returned to the pond. For the second capture of 30, seven are tagged. With this
information, given in red in the table below, we can complete the remaining entries.
Fishers exact test show a much too high p-value to reject the null hypothesis.
> fish<-matrix(c(7,23,18,72),ncol=2)
> fisher.test(fish)
data: fish
p-value = 0.7958
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.3798574 3.5489546
sample estimates:
odds ratio
1.215303
Example 21.13. We now return to a table on hemoglobin genotypes on two Indonesian islands. Recall that heterozy-
gotes are protected against malaria.
334
Introduction to the Science of Statistics Goodness of Fit
genotype AA AE EE
Flores 128 6 0
Sumba 119 78 4
We noted that heterozygotes are rare on Flores and that it appears that malaria is less prevalent there since the
heterozygote does not provide an adaptive advantage. Here are both the chi-square test and the Fisher exact test.
> genotype<-matrix(c(128,119,6,78,0,4),nrow=2)
> genotype
[,1] [,2] [,3]
[1,] 128 6 0
[2,] 119 78 4
> chisq.test(genotype)
data: genotype
X-squared = 54.8356, df = 2, p-value = 1.238e-12
Warning message:
In chisq.test(genotype) : Chi-squared approximation may be incorrect
and
> fisher.test(genotype)
data: genotype
p-value = 3.907e-15
alternative hypothesis: two.sided
Note that R cautions against the use of the chi-square test with these data.
k
X k
X k
X
Oi Ei
Ei i = Ei = (Oi Ei ) = n n=0
i=1 i=1
Ei i=1
21.2. We apply the quadratic Taylor polynomial approximation for the natural logarithm,
1 2
ln(1 + i) i i,
2
335
Introduction to the Science of Statistics Goodness of Fit
and use the identities in the previous exercise. Keeping terms up to the square of i, we find that
k
X Xk
Oi
2 ln n (O) = 2 Oi ln =2 Ei (1 + i ) ln(1 + i)
i=1
Ei i=1
k
X k
X
1 2 1 2
2 Ei (1 + i )( i i) 2 Ei ( i + i)
i=1
2 i=1
2
k
X k
X
2
=2 Ei i + Ei i
i=1 i=1
k
X (Ei Oi )2
=0+ .
i=1
Ei
336
Introduction to the Science of Statistics Goodness of Fit
4
2 parents 1 parent 0 parents
4
2
2
0
0
-2
-2
-2
-4
-4
-4
smokes does not smoke smokes does not smoke smokes does not smoke
Figure 21.4: Chigram for the data on teen smoking in Tucson, 1967. R commands found in Exercise 21.9.
from the likelihood ratio computation for the two-sided two sample proportion test.
21.12. The R commands follow:
> chisq.test(fish)
data: fish
X-squared = 0.0168, df = 1, p-value = 0.8967
The p-value is notably higher for the 2
test.
337