MATH10282: Introduction To Statistics Lecture Notes
MATH10282: Introduction To Statistics Lecture Notes
Lecture Notes
1
1 Introduction: What is Statistics?
Statistics is:
‘the science of learning from data, and of measuring, controlling, and communicating uncertainty;
and it thereby provides the navigation essential for controlling the course of scientific and societal
advances.’
Davidian, M. and Louis, T.A. (2012), Science.
http://dx.doi.org/10.1126/science.1218685
There are two basic forms: descriptive statistics and inferential statistics. In this course we will discuss both,
with inferential statistics being the major emphasis.
• Descriptive Statistics is primarily about summarizing a given data set through numerical summaries and
graphs, and can be used for exploratory analysis to visualize the information contained in the data and
suggest hypotheses etc.
It is useful and important. It has become more exciting nowadays with people regularly using fancy
interactive computer graphics to display numerical information (e.g. Hans Rosling’s visualisation of the
change in countries’ health and wealth over time – see Youtube).
• Inferential Statistics is concerned with methods for making conclusions about a population using infor-
mation from a sample, and assessing the reliability of, and uncertainty in, these conclusions.
This allows us to make judgements in the presence of uncertainty and variability, which is extremely
important in underpinning evidence-based decision making in science, government, business etc.
Many statistical analyses and calculations are easiest to perform using a computer. We will learn how to
use the statistical software R, which is freely available to download from http://r-project.org for use
on your own computer. A good introductory guide is ‘Introduction to R’ by Venables et al. (2006), which
can be downloaded as a PDF from the R project website, or accessed from the R software itself via the menu
(Help→Manuals).
To interact with R, we type commands into the console, or write script files which contain several commands
for longer analyses. These commands are written in the R computer programming language, whose syntax
is fairly easy to learn. In this way, we can perform mathematical and statistical calculations. R has many
existing built-in functions, and users are also able to create their own functions. The R software also has very
good graphical facilities, which can produce high quality statistical plots. Datasets for use in the R sessions
are available from the course website https://minerva.it.manchester.ac.uk/~saralees/intro.html You
can download these and store them for use in the lab sessions.
(i) All adults in the UK who are eligible to vote; the variable of interest is the political party supported.
(ii) Car batteries of a particular type manufactured by a particular company; the variable of interest is the
lifetime of the battery before failure.
(iii) All adult males working full-time in Manchester; the variable of interest is the person’s gross income.
2
(iv) All potential possible outcomes of a planned laboratory experiment; the variable of interest is the value
of a particular measurement.
In general, the variables of interest may be either qualitative or quantitative. Qualitative variables are
either nominal, e.g. gender or political party supported, or ordinal, e.g. a measurement of size grouped into
three categories: small, medium or large. Quantitative variables are either discrete, for example a count, or
continuous, such as the variables income and lifetime above.
We wish to make conclusions, or inferences, about the population characteristics of variables of interest.
One way to do so is to conduct a census, i.e. to collect data for each individual in the population. However
often this is not feasible, due to one or more of the following:
• Testing may be destructive, e.g. (ii), and we need to have some products left to sell!
Instead, we collect data only for a sample, i.e. a subset of the population. We then use the characteristics
of the sample to estimate the characteristics of the population. In order for this procedure to give a good
estimate, the sample must be representative of the population. Otherwise, if an unrepresentative or ‘biased’
sample is used the conclusions will be systematically incorrect.
Some examples of samples from populations are given below:
(i) In an opinion poll in May 2015, a sample of 1000 adults was obtained and asked which political party
they intended to vote for in the upcoming UK General Election on 7 May 2015. A summary of these
responses is:
(ii) A random sample of 40 manufactured car batteries was taken from the production line, and their lifetimes
(in years) determined. The data are as follows, arranged in ascending order for convenience:
1.6, 1.9, 2.2, 2.5, 2.6, 2.6, 2.9, 3.0, 3.0, 3.1,
3.1, 3.1, 3.1, 3.2, 3.2, 3.2, 3.3, 3.3, 3.3, 3.4,
3.4, 3.4, 3.5, 3.5, 3.6, 3.7, 3.7, 3.7, 3.8, 3.8,
3.9, 3.9, 4.1, 4.1, 4.2, 4.3, 4.4, 4.5, 4.7, 4.7
(iii) We could obtain a sample of 500 adult males working full-time in Manchester. The following table
summarizes a hypothetical data set of the annual incomes in thousands of pounds for such a sample.
3
Interval Frequency Percentage
5 to 15 83 16.6
15 to 25 142 28.4
25 to 35 90 18.0
35 to 45 79 15.8
45 to 55 46 9.2
55 to 65 28 5.6
65 to 75 13 2.6
75 to 85 6 1.2
85 to 95 4 0.8
95 to 105 3 0.6
105 to 115 0 0.0
115 to 125 2 0.4
125 to 135 0 0.0
135 to 145 0 0.0
145 to 155 1 0.2
155 to 165 0 0.0
165 to 175 1 0.2
175 to 185 1 0.2
185 to 195 1 0.2
Totals 500 100.0
The intervals in the table are open on the left and closed on the right, e.g. the first row gives the count
of incomes in the range (5, 15].
Step 1. Select an individual at random with equal probability from the remaining population of size N −i+1
Step 2. Include the selected individual as the ith member of the sample, and remove the selected individual
from the population, leaving N − i individuals remaining.
• Sampling with replacement: each individual may appear any number of times in the sample, leading to N n
possible samples. The probability of selecting any particular sample is N −n . This can be implemented using
a similar sequential algorithm to before, where instead in Step 2 the selected individual is not removed from
the population.
Example. Let v1 , . . . , vN denote the values of the variable X for the 1st, . . ., N th individuals in the population.
Suppose that interest lies in estimating the population mean of X,
N
1 X
µ= vj .
N
j=1
4
Let X1 , . . . , Xn be the values of X in a sample of size n chosen by sampling without replacement. The
population mean µ can be estimated by
n
1X
X= Xi .
n
i=1
The value of X will be different for different samples, and so X is a random variable because the sample is
chosen randomly. Thus, X has its own probability distribution, which is known as its sampling distribution.
How can we measure the performance of the above method of estimating µ? One way is to calculate the
expectation and variance of the sampling distribution of X. In particular, it can be shown that under sampling
without replacement
E(X) = µ .
As a result, X is said to be unbiased. We will study this unbiasedness property further in Chapter 5. Moreover
it is possible to show that under sampling without replacement
σ2
N −n
Var(X) = , (1)
n N −1
N
1 X
σ2 = (vj − µ)2 .
N
j=1
N
2 1 X 2
σ = v j − µ2 .
N
j=1
We can verify that E(X) = µ as follows. First note from the table that the p.m.f. of X is
x 5/2 2 3/2
pX (x) 1/3 1/3 1/3
Hence
X 1 5 1 1 3
E(X) = x pX (x) = × + × 2 + × = 2.
3 2 3 3 2
x∈RX
5
Y , Var(Y ) = E(Y 2 ) − E(Y )2 . Note further that
2 2
2 X
2 5 1 2 1 3 1 25
E(X ) = x pX (x) = × +2 × + × = ,
2 3 3 2 3 6
x∈RX
2
and so Var(X) = E(X ) − (E X)2 = 25/6 − 22 = 1/6.
FX (x) = P(X ≤ x) .
If X is a continuous random variable then there is also an associated probability density function (p.d.f.)
fX (x), which satisfies
dFX (x)
= fX (x) .
dx
If X is a discrete random variable then there is instead a probability mass function (p.m.f.) pX (x) satisfying
X
pX (t) = FX (x) .
t≤x
We now recall several concepts from MATH10141 Probability I. For a continuous random variable, the popu-
lation mean µ and variance σ 2 of X are
Z ∞
µ= x fX (x) dx
Z−∞
∞
σ2 = (x − µ)2 fX (x) dx .
−∞
For a discrete random variable, these quantities are instead defined in terms of the p.m.f.
X
µ= x pX (x)
x∈RX
X
2
σ = (x − µ)2 pX (x) ,
x∈RX
More generally, events B1 , . . . , Bn are mutually independent if for every subset {Bi1 , . . . , Bik }, (k ≥ 2) of
{B1 , . . . , Bn },
P(Bi1 ∩ · · · ∩ Bik ) = P(Bi1 ) × · · · × P(Bik ) .
6
2.2.2 Independent random variables
• If X1 , . . . , Xn are identically distributed with c.d.f. FX (x), then X1 , . . . , Xn are independent if and only
if
P(X1 ≤ x1 , . . . , Xn ≤ xn ) = FX (x1 ) × · · · × FX (xn ) .
• If X1 , . . . , Xn are discrete random variables with common p.m.f. pX (x), then X1 , . . . , Xn are independent
if and only if the joint p.m.f satisfies
• If X1 , . . . , Xn are continuous random variables with common p.d.f. fX (x), then X1 , . . . , Xn are indepen-
dent if and only if the joint p.d.f. satistifies
The idea of independence is now used to define sampling from a general population. We say that X1 , . . . , Xn
are a random sample from X if X1 , . . . , Xn ∼ FX (x) independently. We may also say that X1 , . . . , Xn is a
random sample from FX (x), fX (x) or pX (x).
Example. Simple random sampling of n individuals with replacement from a finite population of size N with
X-values v1 , . . . , vn corresponds to independent random sampling of X1 , . . . , Xn from the p.m.f.
1
pX (x) = × {number of j such that vj = x} .
N
Similar to the previous section, we may use the characteristics of the sample to estimate the characteristics
of the population. For example, suppose we are interested in the population mean µ. This may again be
estimated by the sample mean, i.e.
n
1X
X= Xi .
n
i=1
Once again, the value of X is random because X1 , . . . , Xn is a random sample from the population. Moreover,
it is again true that
E(X) = µ .
N −n
f.p.c. = ,
N −1
7
which is called the finite population correction (f.p.c.). The difference in Var(X) occurs because under sam-
pling without replacement the Xi are not independent. However, the Xi can be considered to be approximately
independent when N is large and the sampling proportion n/N is small. In this case,
1 − n/N
f.p.c. = ≈ 1.
1 − 1/N
In the remainder of this course we will always assume that X1 , . . . , Xn are sampled independently from a c.d.f.
FX (x).
A sample of n = 50 components was taken from a production line, and their lifetimes (in hours) determined. A
tabulation of the sample values is given overleaf. A possible parametric model for these data is to assume that
they are a random sample from a normal distribution N (µ, σ 2 ). The parameters µ and σ 2 can be estimated
from the sample by µ b2 = s2 = 15.288.
b = x = 334.6, σ
We can informally investigate how well this distribution fits the data by superimposing the probability
density function of a N (334.6, 3.9122 ) distribution onto a histogram of the data. This is illustrated in the
figure overleaf, which shows the fit to be reasonably good, particularly for data greater than the mean.
8
Intervals Frequencies Percents
323.75 to 326.25 1 2
326.25 to 328.75 0 0
328.75 to 331.25 9 18
331.25 to 333.75 12 24
333.75 to 336.25 11 22
336.25 to 338.75 10 20
338.75 to 341.25 5 10
341.25 to 343.75 1 2
343.75 to 346.25 1 2
Totals 50 100
0.06
0.04
0.02
0.00
lifetime (hours)
Figure 1: Histogram of the component lifetime data together with a N (334.6, 3.9122 ) p.d.f.
This figure can be obtained using the R code below. The lines command draws a curve through the (x, y)
co-ordinates provided.
xx <- comp_lifetime$lifetime
xv <- seq(320, 350, 0.1)
yv <- dnorm(xv, mean=mean(xx), sd=sd(xx))
hist(xx, freq=F, breaks=seq(from=323.75, to= 346.25, by=2.5),
xlim=c(320, 350), ylim=c(0, 0.12), main="Histogram of
lifetime data with Normal pdf", xlab="lifetime (hours)")
lines(xv, yv)
The fitted normal distribution appears to be a reasonably good fit to the observed data, thus we may use it
to calculate estimated probabilities. For example, consider the question ‘what is the estimated probability that
a randomly selected component lasts between 330 and 340 hours?’. To answer this, let the random variable
X be the lifetime of a randomly selected component. We require P(330 < X < 340) under the fitted normal
model, X ∼ N (334.6, 3.9122 ):
330.0 − 334.6 X − 334.6 340.0 − 334.6
P(330 < X < 340) = P < <
3.912 3.912 3.912
= P(−1.18 < Z < 1.38) , where Z ∼ N (0, 1)
= Φ(1.38) − Φ(−1.18) = 0.9162 − 0.1190 = 0.7972 .
Hence, using the fitted normal model we estimate that 79.72% of randomly selected components will have
lifetimes between 330 and 340 hours.
9
3.1.2 Manchester income data
If we superimpose a normal density curve onto the histogram for these data, then we see that the symmetric
normal distribution is a poor fit, since the data are skewed. In particular, the normal density extends to
negative income values despite the fact that all of the incomes in the sample are positive.
0.015
0.010
0.005
0.000
Figure 2: Histogram of the income data with the p.d.f. of the fitted normal distribution.
xx <- income$income
xv <- seq(0, 200, 0.5)
yv <- dnorm(xv, mean=mean(xx), sd=sd(xx))
hist(xx, freq=F, breaks=seq(from=5, to=195, by=10),
ylim=c(0, 0.030), xlab="income (GBP x 1000)",
main="Histogram of income data with Normal pdf")
lines(xv, yv)
One way forward is to look for a transformation which will make the data appear to be more normally
distributed. Because the data are strongly positively skewed on the positive real line one possibility is to take
logarithms.
In the figure below, we see a histogram of the log transformed income data. The fit of the superimposed
normal p.d.f. now looks reasonable, although there are perhaps slightly fewer sample observations than might
be expected according to the normal model in the left-hand tail and centre. There are also some outliers in
the right-hand tail.
10
Histogram of log(income) data with Normal pdf
0.8
0.6
Density
0.4
0.2
0.0
1 2 3 4 5 6
log(income)
Even if it is not clear whether or not we can find a completely satisfactory parametric model, we will see in
a later section that we can still make approximate inferences about the mean income in the population by
appealing to the central limit theorem.
where ‘Other’ includes all other parties. As suggested earlier, we can estimate the probabilities pC , pL , etc. by
the proportions of sampled individuals supporting the corresponding party. Specifically we obtain the following
11
estimates:
pbC = P(X
b = Conservatives) = 369/1000 = 0.369,
pbL = P(X
b = Labour) = 314/1000 = 0.314,
pbLD = P(X
b = Liberal Democrats) = 75/1000 = 0.075,
pbU = P(X
b = UKIP) = 118/1000 = 0.118 ,
pbO = P(X
b = Other party) = 124/1000 = 0.124 .
It is beyond the scope of this module to consider a joint probability model for the vector (nC , nL , nLD , nU , nO )
containing the numbers of individuals supporting each of the five possible choices in a sample of size n. How-
ever we may slightly simplify the situation by focussing on whether or not a randomly chosen voter supports
Labour.
Let the random variable XL denote the number of voters out of the 1000 who support Labour. An
appropriate model may be
XL ∼ Bi(n, pL ) ,
with n = 1000, and pL is estimated by pbL = 0.314. We may use the fitted model to answer various questions,
e.g. ‘what is the estimated probability that in a random sample of 1000 voters at least 330 will support Labour?’.
We require P(XL ≥ 330) under the fitted model Bi(1000, 0.314). It is easiest to use a normal approximation
to the binomial distribution, which gives
329.5 − 1000 × 0.314
P(XL ≥ 330) ≈ 1 − Φ √
1000 × 0.314 × 0.686
= 1 − Φ(1.0561) = 0.1455 ,
h(X1 , . . . , Xn ) .
The value of this statistic will usually be different for different samples. As the sample data is random, the
statistic is also a random variable. If we repeatedly drew samples of size n, calculating and recording the
value of the sample statistic each time, then we would build up its probability distribution. The probability
distribution of a sample statistic is referred to as its sampling distribution.
In this section we will see how to analytically determine the sampling distributions of some statistics, while
with certain others we can appeal to the central limit theorem. Simulation techniques can also be used to
investigate sampling distributions of statistics empirically.
12
4.1 Sample mean
The mean and variance of the distribution FX (x) are denoted by µ and σ 2 respectively. In the case that the
distribution is continuous with p.d.f. fX (x),
Z ∞
µ = E(X) = x fX (x)dx
−∞
When the distribution is discrete with p.m.f. pX (x), µ and σ 2 are defined by:
X
µ = E(X) = xpX (x)
x∈RX
Here we have used Var(X1 + . . . + Xn ) = Var(X1 ) + . . . + Var(Xn ), which holds because the Xi are independent.
These results tell us that the sampling distribution of the sample mean X is centered on the common
mean µ of each of the sample variables X1 , . . . , Xn (i.e. the mean of the distribution from which the sample is
obtained) and has variance equal to the common variance of the Xi divided by n. Thus, as the sample size n
increases, the sampling distribution of X becomes more concentrated around the true mean µ.
In the above discussion nothing specific has been said regarding the actual distribution from which the Xi
have been sampled. All we are assuming is that the mean and variance of the underlying distribution are both
13
finite.
In the special case that the Xi are normally distributed then we can make use of some important results. Let
2 ) and let the random variable Y ∼ N (µ , σ 2 ), independently of X.
the random variable X ∼ N (µX , σX Y Y
Then we have the following results:
2 + σ2 )
(i) X + Y ∼ N (µX + µY , σX Y
2 + σ2 )
(ii) X − Y ∼ N (µX − µY , σX Y
These results extend in a straightforward manner to the linear combination of n independent normal random
variables. Let X1 , . . . Xn be n independent normally distributed random variables with E(Xi ) = µi and
Var(Xi ) = σi2 for i = 1, . . . , n. Thus, here the normal distributions for different Xi may have different means
and variances. We then have that
n n n
!
X X X
ci Xi ∼ N ci µi , c2i σi2
i=1 i=1 i=1
where the ci ∈ R.
If now the Xi in the sample are i.i.d. N (µ, σ 2 ) random variables then the sample mean, X, is a linear
combination of the Xi (with ci = n1 , i = 1, . . . , n, using the notation above). Thus, X is normally distributed
with mean µ and variance σ 2 /n, i.e. X n ∼ N (µ, σ 2 /n). This result enables us to make probabilistic statements
about the mean under the assumption of normality.
In the previous section, we saw that the random quantity X has a sampling distribution with mean µ and
variance σ 2 /n. In the special case when we are sampling from a normal distribution, X is also normally
distributed. However, there are many situations when we cannot determine the exact form of the distribution of
X. In such circumstances, we may appeal to the central limit theorem and obtain an approximate distribution.
The central limit theorem: Let X be a random variable with mean µ and variance σ 2 . If X n is the
mean of a random sample of size n drawn from the distribution of X, then the distribution of the statistic
Xn − µ
√
σ/ n
14
tends to the standard normal distribution as n → ∞.
This means that, for a large random sample from a population with mean µ and variance σ 2 , the sample
mean X n is approximately normally distributed with mean µ and variance σ 2 /n. Since, for large n, X n ∼
N (µ, σ 2 /n) approximately we have that ni=1 Xi ∼ N (nµ, nσ 2 ) approximately.
P
There is no need to specify the form of the underlying distribution FX , which may be either discrete or
continuous, in order to use this result. As a consequence it is of tremendous practical importance.
A common question is ‘how large does n have to be before the normality of X is reasonable?’ The answer
depends on the degree of non-normality of the underlying distribution from which the sample has been drawn.
The more non-normal FX is, the larger n needs to be. A useful rule-of-thumb is that n should be at least 30.
Example 2. (Income data). What is the approximate probability that the mean gross income based on a new
random sample of size n = 500 lies between 33.0 and 33.5 thousand pounds?
The underlying distribution is not normal but we can appeal to the central limit theorem to say that
We may estimate µ and σ 2 from the data by µ b2 = s2 = 503.554. Therefore, using the fitted
b = x = 33.27, σ
values of the parameters we may estimate the probability as
33.50 − 33.27 33.00 − 33.27
P(33.0 < X 500 < 33.5) ≈ Φ √ −Φ √
22.44/ 500 22.44/ 500
≈ Φ(0.23) − Φ(−0.27) = 0.5910 − 0.3936
≈ 0.1974 .
Hence we estimate the probability X lies between 33.0 and 33.5 to be 0.1974.
Example 3. Suppose that, in a particular country, the unemployment rate is 9.2%. Suppose that a random
sample of 400 individuals is obtained. What are the approximate probabilities that:
15
(ii) The proportion unemployed is greater than 0.125.
Solution:
approximately.
400
! P400 !
X
i=1 Xi − 36.8 40.5 − 36.8
P Xi ≤ 40 =P √ ≤ √
i=1
33.414 33.414
≈ P (Z ≤ 0.640) , where Z ∼ N (0, 1) approx.
= Φ(0.640)
= 0.7390 .
p(1−p) 0.092×0.908
(ii) Here, n = 400 = 0.0002088. Thus,
X 400 − 0.092 0.125 − 0.092
P X 400 > 0.125 = P √ > √
0.0002088 0.0002088
≈ 1 − Φ(2.284)
= 1 − 0.9888
= 0.0112 .
where X1 , . . . , Xn are a random sample from the distribution with c.d.f. FX (·) with mean µ and variance σ 2 .
16
If FX is any discrete or continuous distribution with a finite variance then
" n #
1 X
E(S 2 ) = E (Xi − X)2
(n − 1)
i=1
" n #
1 X
2
= E [(Xi − µ) − (X − µ)]
(n − 1)
i=1
" n #
1 X
= E [(Xi − µ)2 − 2(Xi − µ)(X − µ) + (X − µ)2 ]
(n − 1)
i=1
" n #
1 X
= E (Xi − µ)2 − 2n(X − µ)(X − µ) + n(X − µ)2
(n − 1)
i=1
" n #
1 X
2
2 2
= E (Xi − µ) − 2n E[(X − µ) ] + n E[(X − µ) ]
(n − 1)
i=1
σ2 σ2
1 2
= nσ − 2n + n
(n − 1) n n
σ2
since E[(X − µ)2 ] = Var(X) =
n
1 2
2
= (n − 1)σ = σ .
(n − 1)
Hence, we can see that by using divisor (n − 1) in the definition of S 2 , we obtain a statistic whose sampling
distribution is centered on the true distribution value of σ 2 . This would not be the case if we had used the
perhaps more intuitively obvious value of n.
We will look more specifically at the case when the Xi are sampled from the N (µ, σ 2 ) distribution. In order
to do so, we first need to introduce a new continuous probability distribution, the chi-squared (χ2 ) distribution.
The continuous random variable Y is said to have χ2 distribution with k degrees of freedom (d.f.), written as
χ2 (k), iff its pdf is given by
(
1
2k/2 Γ(k/2)
y (k/2)−1 e−y/2 , y>0
f (y) =
0, otherwise.
Note that this is a special case of the Gamma distribution with parameters α = k/2 and β = 1/2. Note
that when k = 2, Y ∼ Exp(1/2). The mean and variance are given by E(Y ) = k and Var(Y ) = 2k.
The p.d.f.s of chi-squared random variables with d.f. = 1, 3, 6, and 12 are shown in Figure 1. Note that the
p.d.f. becomes more symmetric as the number of degrees of freedom k becomes larger.
17
k=1
k=3
k=6
k = 12
0.8
0.6
fX(x)
0.4
0.2
0.0
0 5 10 15 20 25 30
Let Z1 , . . . , Zk be k i.i.d. standard normal random variables, i.e. each has a N (0, 1) distribution. Then, the
random variable
k
X
Y = Zi2
i=1
1 = Var(Zi )
= E(Zi2 ) − [E(Zi )]2
= E(Zi2 ), since E(Zi ) = 0 .
k k
" #
X X
E[Y ] = E Zi2 = E(Zi2 ) = k .
i=1 i=1
Suppose now the random variables X1 , . . . , Xn are a random sample from the N (µ, σ 2 ) distribution. We have
that
Xi − µ
∼ N (0, 1) , i = 1, . . . , n ,
σ
18
so that
n
(Xi − µ) 2
X
∼ χ2 (n) .
σ
i=1
If we modify the above by replacing the population mean µ by the sample estimate X, the distribution changes
and we obtain the following result.
Theorem. If X1 , . . . , Xn ∼ N (µ, σ 2 ) independently, then
n 2
(n − 1)S 2 X (Xi − X)
= ∼ χ2 (n − 1) .
σ2 σ
i=1
By replacing µ with X, the χ2 distribution of the sum of squares has lost one degree of freedom. This is
because there is a single linear constraint on the variables (Xi − X)/σ, namely ni=1 (Xi − X)/σ = 0. Thus we
P
are only summing n − 1 independent sums of squares. Important fact: X and S 2 are independent random
variables.
Example 4. Let X1 , . . . , X40 be a random sample of size n = 40 from the N (25, 42 ) distribution. Find the
probability that the sample variance, S 2 , exceeds 20.
Solution. We need to calculate
39 × S 2
2
39 × 20
P S > 20 = P >
16 16
= P(Y > 48.75) where Y ∼ χ2 (39)
= 1 − P(Y < 48.75) = 1 − 0.8638 = 0.1362 ,
where the probability calculation has been carried out using the pchisq command in R:
5 Point estimation
5.1 Introduction
The objective of a statistical analysis is to make inferences about a population based on a sample. Usually we
begin by assuming that the data were generated by a probability model for the population. Such a model will
typically contain one or more parameters θ whose value is unknown. The value of θ needs to be estimated using
the sample data. For example, in previous chapters we have used the sample mean to estimate the population
mean, and the sample proportion to estimate the population proportion.
A given estimation procedure will typically yield different results for different samples, thus under random
sampling from the population the result of the estimation will be a random variable with its own sampling
distribution. In this chapter, we will discuss further the properties that we would like an estimation procedure
to have. We begin to answer questions such as:
19
5.2 General framework
Let X1 , . . . , Xn be a random sample from a distribution with c.d.f. FX (x; θ), where θ is a parameter whose
value is unknown. A (point) estimator of θ, denoted by θb is a real, single-valued function of the sample, i.e.
θb = h(X1 , . . . , Xn ) .
As we have seen already, because the Xi are random variables, the estimator θb is also a random variable whose
probability distribution is called its sampling distribution.
The value θb = h(x1 , . . . , xn ) assumed for a particular sample x1 , . . . , xn of observed data is called a (point)
estimate of θ. Note the point estimate will almost never be exactly equal to the true value of θ, because of
sampling error.
Often θ may in fact be a vector of p scalar parameters. In this case, we require p separate estimators for
each of the components of θ. For example, the normal distribution has two scalar parameters µ and σ 2 . These
could be combined into a single parameter vector, θ = (µ, σ 2 ), for which one possible estimator is θb = (X, S 2 ).
If an estimator has properties (i) and (ii) above then we can expect estimates resulting from statistical exper-
iments to be close to the true value of the population parameter we are trying to estimate.
We now define some mathematical concepts formalizing these notions. The bias of a point estimator θb is
bias(θ) b − θ. The estimator is said to be unbiased if
b = E(θ)
E(θ)
b = θ,
i.e. if bias(θ)
b = 0. Unbiasedness corresponds to property (i) above, and is generally seen as a desirable property
for an estimator. Note that sometimes biased estimators can be modified to obtain unbiased estimators. For
b = kθ, where k 6= 1 a constant, then bias(θ)
example, if E(θ) b = (k − 1)θ. However, θ/k
b is an unbiased estimator
of θ.
The spread
q of the sampling distribution can be measured by Var(θ).
b In this context, the standard deviation
of θ,
b i.e. Var(θ), b is called the standard error. Suppose that we have two different unbiased estimators of
θ, called θb1 and θb2 , which are both based on samples of size n. By principle (ii) above, we would prefer to use
the estimator with the smallest variance, i.e. choose θb1 if Var(θb1 ) < Var(θb2 ), otherwise choose θb2 .
Example 5. Let X1 , . . . , Xn be a random sample from a N (µ, σ 2 ) distribution where σ 2 is assumed known.
Recall that the Xi ∼ N (µ, σ 2 ) independently in this case. We can estimate µ by the sample mean, i.e.
n
1X
µ
b=X = Xi .
n
i=1
We have already seen that E(X) = µ, thus bias(X) = 0. Moreover, Var(X) = σ 2 /n. Note that Var(X) → 0 as
n → ∞. Thus, as the sample size increases, the sampling distribution of X becomes more concentrated about
the true parameter value µ. The standard error of X is
σ
q
s.e.(X) = Var(X) = √ .
n
20
Note that if σ 2 were in fact unknown, then this standard error would also need to be estimated from the data,
via
s
s.e.(X)
c =√ .
n
Importantly, the results E(X) = µ, Var(X) = σ 2 /n also hold if X1 , . . . , Xn are sampled independently
from any continuous or discrete distribution with mean µ and variance σ 2 . Thus the sample mean is always
an unbiased estimator of the population mean.
1 2 3 2 1
µ
e = X1 + X2 + X3 + X4 + X5 .
9 9 9 9 9
We have that
µ 2µ 3µ 2µ µ
E[e
µ] = + + + + = µ,
9 9 9 9 9
and
σ 2 4σ 2 9σ 2 4σ 2 σ 2 19σ 2
Var[e
µ] = + + + + = .
81 81 81 81 81 81
19σ 2
Thus, µ
e is an unbiased estimator of µ with variance 81 . b = X is also unbiased for µ and
The sample mean µ
σ2
has variance 5 .
The two estimators µ
b and µ
e both have normal sampling distributions centered on µ but the variance of the
σ2 19σ 2
sampling distribution of µ
b is smaller than that of µ
e because 5 < 81 . Hence, in practice, we would prefer to
use µ
b.
Example 7. Let X1 , . . . , Xn be a random sample from a N (µ, σ 2 ) distribution where now both µ and σ 2 are
assumed to be unknown. We can use X as an estimator of µ and S 2 as an estimator of σ 2 . We have already
seen that
n
1 X
b2 = S 2 =
σ (Xi − X)2
n−1
i=1
(n−1) 2
σ2] =
we see that E[e n σ . e2 is a biased estimator of σ 2 with bias −σ 2 /n. Notice that bias(σ
Thus σ f2 ) → 0
as n → ∞. We say e2 is
that σ asymptotically unbiased. It is common practice to use S 2 , with the denominator
n − 1 rather than n. This results in an unbiased estimator of σ 2 for all values of n.
Exactly the same argument as above could also be made for using S 2 as an estimator of the variance of the
population distribution if the data were from another, non-normal, continuous distribution or even a discrete
distribution. The only prerequisite is that σ 2 is finite in the population distribution. Therefore, calculations
of the sample variance for any set of data should always be based on using divisor (n − 1).
Example 8. Let X1 , . . . , Xn be a random sample of Bernoulli random variables with parameter p which is
unknown. Thus, Xi ∼ Bi(1, p) for i = 1, . . . , n so that E(Xi ) = p and Var(Xi ) = p(1 − p), i = 1, . . . , n.
If we consider estimating p by the proportion of ‘successes’ in the sample then we have
n
1X
pb = Xi
n
i=1
21
so that
n
1X
E(b
p) = E(Xi )
n
i=1
1
= np ,
n
thus E(b
p) = p. Also,
n
1 X
Var(b
p) = 2 Var(Xi ) by independence
n
i=1
1 p(1 − p)
= 2
np(1 − p) = ,
n n
Hence, pb is an unbiased estimator of p with variance p(1 − p)/n. Notice that the variance of this estimator
also tends towards zero as n gets larger.
Example 9. Let X1 , . . . , Xn be a random sample from a U [θ, θ + 1] distribution where θ is unknown. Thus,
the data are uniformly distributed on a unit interval but the location of that interval is unknown. Consider
using the estimator θb = X.
Now,
θ + (θ + 1)
E(X) =
2
2θ + 1
=
2
1
=θ+
2
1
Therefore, bias(X) = θ + 1/2 − θ = 1/2 while Var(X) = 12n . However, if we instead define θb = X − 1/2 then
E(θ)
b = θ and Var(θ)b = 1 .
12n
• Application of the estimation procedure, or estimator, to a particular observed data set results in an
estimate of the unknown value of the parameter. The estimate will be different for different random data
sets.
• The properties of the sampling distribution (bias, variance) tell us how good our estimator is, and hence
how good our estimate is likely to be.
• Estimation procedures can occasionally give poor estimates due to random sampling error. For good
estimators, the probability of obtaining a poor estimate is lower.
22
6 Likelihood for discrete data
6.1 The likelihood function
The parameter estimators we have considered so far have mostly been motivated by intuition. For example,
the sample mean X is an intuitive estimator of the population mean. However in many situations, it is not
obvious how to define an appropriate estimator for the parameter(s) of interest.
One method for deriving an estimator, which works for almost any parameter of interest, is the method
of maximum likelihood. The estimators derived in this way typically have good properties. The method
revolves around the likelihood function, which is of great importance throughout Statistics. The likelihood
function is used extensively in estimation and also hypothesis testing, which we discuss in a later chapter.
Let X1 , . . . , Xn be an i.i.d. random sample from the discrete distribution with p.m.f. p(x | θ), where θ is
a parameter whose value is unknown. Given observed data values x1 , . . . , xn from this model, the likelihood
function is defined as
L(θ) = P(X1 = x1 , X2 = x2 , . . . , Xn = xn | θ) .
In other words,
the likelihood is the joint probability of the observed data considered as a function of
the unknown parameter θ.
Example 10. Let x1 , . . . , xn be a sample obtained from the Poisson(λ) distribution with p.m.f.
λx e−λ
p(x | λ) = , x = 0, 1, 2, . . .
x!
θb = h(x1 , . . . , xn ) .
θb = h(X1 , . . . , Xn ) ,
in which case θb is a random variable called the maximum likelihood estimator. The maximum likelihood
estimator possesses its own sampling distribution, which will be studied in later Statistics modules.
23
In simple cases, the maximum likelihood estimate can be found by standard calculus techniques, i.e. by
solving
dL(θ)
= 0. (3)
dθ
However, it is usually much easier algebraically to find the maximum of the log-likelihood l(θ) = log L(θ)
because for i.i.d. data,
n n
" #
Y X
log L(θ) = log p(xi | θ) = log p(xi | θ) .
i=1 i=1
Hence, the log likelihood is additive as opposed to the likelihood which is multiplicative. This is advantageous
because it is far easier to differentiate a sum of functions than to differentiate a product of functions.
To find the value of θ that maximizes l(θ) we instead find θb that solves:
n
dl(θ) X d log p(xi | θ)
= = 0. (4)
dθ dθ
i=1
d2 l(θ)
The solution is a maximum if dθ2
< 0 at θ = θ.
b The estimate found by this method, i.e. by maximizing the
log-likelihood, is identical to the one found by maximizing the likelihood directly, because the logarithm is a
monotonically increasing function.
Example 11. Let X1 , . . . , Xn be a random sample from the Poisson(λ) distribution. Find the maximum
likelihood estimator of λ.
We have seen that Pn
e−nλ λ i=1 Xi
L(λ) = Qn ,
i=1 Xi !
so that
n n
! !
X Y
l(λ) = −nλ + Xi log λ − log Xi ! .
i=1 i=1
dl(λ)
Solving dλ = 0, we obtain
Pn
dl Xi
= −n + i=1 = 0, which implies that λ
b=X.
dλ λ=λb
λ
b
d2 l − ni=1 Xi
P
−n
2
= = < 0.
dλ λ=λb
b2
λ X
Therefore, λ
b = X is indeed the maximum likelihood estimator of λ. If we have a set of data x1 , . . . , xn then
b = x, the sample mean. This is an intuitively sensible estimate, as
the maximum likelihood estimate of λ is λ
the mean of the Poisson(λ) distribution is equal to λ.
Example 12. Let X1 , . . . , Xn be a random sample from a Bi(1, p) distribution. Find the maximum likelihood
estimator of p.
In this example then the likelihood function is
n
Y Pn Pn
L(p) = pXi (1 − p)1−Xi = p i=1 Xi
(1 − p)(n− i=1 Xi )
,
i=1
24
so that the log-likelihood is given by
n n
!
X X
l(p) = Xi log p + n− Xi log(1 − p) .
i=1 i=1
dl
Solving dp p=b = 0, we obtain
p
Pn
(n − ni=1 Xi )
P
dl i=1 Xi
= − = 0,
dp p=bp pb 1 − pb
and so
n
X
Xi = nb
p.
i=1
Pn
Xi
Thus, the maximum likelihood estimator of p is pb = i=1
n = X, i.e. the sample proportion. We have
previously seen that this is unbiased for p.
Note that it is worth checking the second derivative at p = pb,
d2 l − ni=1 Xi (n − ni=1 Xi )
P P
= −
dp2 p=bp pb2 (1 − pb)2
n n
=− −
pb (1 − pb)
n
=− ,
pb(1 − pb)
25
}
The data are in the argument x while the minimum and maximum λ values to be considered are passed to the
function in the arguments lmin and lmax.
The function returns a data frame called pl.res comprising three columns. The first contains the sequence
of λ values used, the second contains the corresponding likelihood values and the third the corresponding
log-likelihood values.
Example 13. (Simulated data). The data in this example are a random sample of n = 30 simulated from the
Po(λ = 10) distribution. The data are simulated via:
The following code produces the likelihood and log-likelihood functions for these data:
-75
4e-33
-80
3e-33
L(lambda)
l(lambda)
-85
2e-33
1e-33
-90
0e+00
7 8 9 10 11 12 13 7 8 9 10 11 12 13
lambda lambda
Figure 5: Likelihood (left) and log-likelihood (right) functions for the simulated Poisson data (n = 30, λ = 10).
The maximum likelihood estimate can be computed approximately via direct numerical maximization of
the likelihood or log-likelihood:
26
> lopt1 <- pl.res4$lval[which.max(pl.res4$pl)]
> lopt1
[1] 10.23123
> lopt2 <- pl.res4$lval[which.max(pl.res4$lpl)]
> lopt2
[1] 10.23123
The maximum likelihood estimate of λ from the two plots is calculated to be 10.23. We know that the maximum
likelihood estimate can be determined analytically as the sample mean which is equal to 10.23.
> mean(xp)
[1] 10.23333
The reason for the slight discrepancy between the two results is the discretization error arising from the use of
a discrete set of λ values in the first method.
Please note that if you run the above code yourself, you will get slightly different results because you will
have sampled a different set of data using the function rpois.
Example 14. (Australian birth data). The data give the number of births per hour over a 24-hour period on
the 18 December 1997 at the Mater Mother’s Hospital in Brisbane, Australia. There were a total of n = 44
births. At the time, this was a record number of births in one 24-hour period in this hospital. We denote the
number of births in the ith hour by Xi and fit the model
Xi ∼ Po(λ) , i = 1, . . . , n ,
> birth
hour number
1 1 1
2 2 3
3 3 1
4 4 0
5 5 4
6 6 0
7 7 0
8 8 2
9 9 2
10 10 1
11 11 3
12 12 1
13 13 2
14 14 1
15 15 4
16 16 1
27
17 17 2
18 18 1
19 19 3
20 20 4
21 21 3
22 22 2
23 23 1
24 24 2
Poisson likelihood function for Australian birth data Poisson log-likelihood function for Australian birth data
2.5e-17
-50
2.0e-17
-100
1.5e-17
L(lambda
l(lambda
-150
1.0e-17
-200
5.0e-18
-250
0.0e+00
0 1 2 3 4 0 1 2 3 4
lambda lambda
Figure 6: The likelihood (left) and log-likelihood (right) functions for the Australian births data (n = 44).
The maximum likelihood estimate is 1.83, which can be found by direct numerical maximization of the
likelihood or log-likelihood function.
The result can be compared back to the sample mean, x, which gives the same result up to discretization error.
28
> mean(birth$number)
[1] 1.833333
7 Confidence intervals
7.1 Interval estimation
So far in this module, whenever we have fitted a probability model to a data set, we have done so by calculating
point estimates of the values of any unknown parameters θ. However, it is very rare for a point estimate to
be exactly equal to the true parameter value. An alternative approach is to specify an interval, or range,
of plausible parameter values. We would then expect the true parameter value to lie within this interval of
plausible values. We call such an interval an interval estimate of the parameter.
Let X = (X1 , . . . , Xn ) be an independent random sample from a distribution FX (x; θ) with unknown
parameter θ. An interval estimator,
I(X) = [l(X), u(X)]
for θ is defined by two statistics, i.e. functions of the data. The statistic u(X) defines the upper end-point of
the interval, and the statistic l(X) defines the lower end-point of the interval. We will see later how to choose
appropriate statistics for the end-points.
The key property of an interval estimator for θ is its coverage probability. This defined as the probability
that the interval contains, or ‘covers’, the true value of the parameter, i.e.
Pθ [ l(X) ≤ θ ≤ u(X) ] ,
or equivalently Pθ [ I(X) 3 θ ]. We use the notation Pθ for probabilities here to emphasize that the probability
distributions of l(X) and u(X) depend on θ.
Let α ∈ (0, 1), and suppose that we have been able to find statistics l and u such that the coverage
probability satisfies
Pθ [ l(X) ≤ θ ≤ u(X) ] = 1 − α , for all values of θ ,
Then the interval estimator I(X) and, for any particular data set x = (x1 , . . . , xn ) the resulting interval
estimate I(x), is referred to as a 100(1 − α)% confidence interval for θ. The proportion 1 − α is referred to
as the confidence level, and the interval end points l(x), u(x) are known as the confidence limits.
To illustrate the idea, let X1 , . . . , Xn be a random sample from N (µ, σ 2 ), with µ unknown but σ 2 known.
Recall that X ∼ N (µ, σ 2 /n). Thus, if we standardize X then we obtain the random variable
X −µ
Z= √ ∼ N (0, 1) .
σ/ n
A crucial property of Z above is that the distribution of Z does not depend on µ or σ, i.e. the right hand side
of the above equation is the same no matter what the value of µ or σ.
Let z1−α/2 be such that P(Z ≤ z1− α2 ) = 1 − α/2. By symmetry of the normal distribution, it is also true
that P(Z ≤ −z1− α2 ) = α/2, and furthermore P(−z1− α2 ≤ Z ≤ z1− α2 ) = α. We have therefore that
X −µ
P −z1− α2 ≤ √ ≤ z1− α = 1 − α .
σ/ n 2
29
Moreover, the inequality inside the brackets can be rearranged to show that:
z1− α2 σ z1− α2 σ
1−α=P − √ − X ≤ −µ ≤ + √ −X
n n
z1− α σ z1− α σ
= P X − √2 ≤ µ ≤ X + √2 .
n n
z1− α2 σ z1− α2 σ
I(X) = X − √ , X+ √
n n
30
50
40
Experiment number
30
20
10
0
16 18 20 22 24
In the figure above, each interval is coloured blue if it contains the true value of the parameter (µ = 20)
and green if it does not. The interval contains the true parameter value for 48/50 = 95% of the samples.
Example 15. The following n = 16 observations are a random sample from a N (µ, 22 ) distribution, where µ
is unknown:
10.43 5.42 11.10 12.41 10.14 7.83 8.84 10.42
10.44 9.65 10.36 11.48 9.33 6.81 10.55 10.41
We want to use the data to construct a 95% confidence interval for µ, i.e. here α = 0.05. The sample mean
is x = 9.73 and z1−α/2 = z0.975 = 1.96 so that the end-points of the 95% CI for µ are given by:
r
4.0
9.73 ± 1.96 × ,
16
i.e. the interval is (8.75, 10.71). These data were actually sampled (simulated) from a N (10, 22 ) distribution.
Thus the true value µ = 10 is within the CI.
7.2.2 Confidence interval for the mean of a normal distribution, variance unknown
Suppose now that X1 , . . . , Xn are independent draws from a hN (µ, σ 2 ) distribution where both µ and σ 2 are
z1− α σ z1− α σ i
unknown. It is no longer possible to use the confidence interval x − √n2 , x + √n2 , because σ is unknown.
Instead of basing a confidence interval on the random variable
X −µ
Z= √ ∼ N (0, 1) ,
σ/ n
31
we plug in an estimate of the sample variance in the denominator, namely the sample variance (with divisor
n − 1), to obtain
X −µ
T = √ .
S/ n
Now, because both X and S are random variables the distribution of T is not N (0, 1). The fact that S is also
random induces extra variability into the distribution of T . Thus, for a given value of n, the distribution of T
has a longer tail than that of Z.
We can show that the exact distribution of T above is a Student’s t-distribution with (n−1) degrees of freedom,
denoted t(n − 1) [or sometimes tn−1 in the literature].
In general, if the random variable T has a t-distribution with ν degrees of freedom then its probability
density function is given by:
− (ν+1)
Γ( ν+1 ) x2
2
fT (x) = √ 2 ν 1+ ,
νπ Γ( 2 ) ν
for ν > 0 and −∞ < x < ∞. We have that E(T ) = 0 and Var(T ) = ν/(ν − 2), for ν > 2. Moreover, the
distribution is symmetric about the origin. As the parameter ν → ∞, the p.d.f. of T approaches that of the
N (0, 1) distribution.
As an exercise, produce a plot in R of the p.d.f. of the N (0, 1) distribution, together with the p.d.f.s of the
t(5) and t(20) distributions. Use the dt function to compute the value of the t p.d.f. for a given set of x-values.
Define t1− α2 to be 1 − α/2 point of the t(n − 1) distribution, i.e. if T ∼ t(n − 1) then P(T ≥ t1− α2 ) = α/2.
Then from the preceding discussion it follows that the random interval
t1− α S t1− α S
I(X) = X − √2 , X + √2
n n
Example 16. Recall the electronic component failure time data introduced in Chapter 3. There are n = 50
observations and we found that x = 334.59 and s2 = 15.288. In Chapter 3 we saw that a normal distribution
with mean and variance equal to the sample values provides a good probability model for the data. As we do
not know the true value of σ 2 , we use the critical value t0.975 = 2.0096 for the t(49) distribution. The 95% CI
for µ has end-points: r
15.288
334.59 ± 2.0096 × ,
50
i.e. I(x) = (333.48, 335.70) which gives a range of plausible values for µ.
7.2.4 Confidence interval for the unknown mean of a non-normal distribution with either known
or unknown variance
Suppose that we now have a ‘large’ random sample from a non-normal distribution, and that we wish to use
the data to construct a confidence interval for the unknown distribution mean µ. We can appeal to the central
limit theorem and construct a 100(1 − α)% CI as follows.
If the variance σ 2 is known then, by the central limit theorem, for large n the statistic
X −µ
Z1 = √
σ/ n
is approximately distributed as N (0, 1). Thus an approximate 100(1 − α)% confidence interval for µ is given
32
by
z1− α σ z1− α σ
X − √2 , X + √2 .
n n
If the variance is unknown, then we instead plug in the sample standard deviation S for σ to obtain the
statistic
X −µ
Z2 = √ .
S/ n
It can also be shown that Z2 is also distributed approximately as N (0, 1) for large n. Thus an approximate
100(1 − α)% confidence interval for µ is given by
z1− α S z1− α S
X − √2 , X + √2 .
n n
Example 17. Recall the Manchester income data for adult males which we have clearly seen to be non-normally
distributed. The data set contains n = 500 observations and we have that x = 33.27 and s2 = 503.554. By the
above discussion, the end points r
503.554
33.27 ± 1.96 ×
500
define a 95% confidence interval for µ, namely (31.30, 35.24). This gives a range of plausible values for the
unknown value of µ.
7.2.5 Confidence interval for the unknown variance of a normal distribution, mean also un-
known
Let X1 , . . . , Xn be a random sample from the N (µ, σ 2 ) distribution where both µ and σ 2 are unknown. We
would like to construct a 100(1 − α)% confidence interval for σ 2 .
We know that
n
2 1 X
S = (Xi − X)2
n−1
i=1
(n − 1)S 2
∼ χ2 (n − 1)
σ2
where χ21− α denotes the (1 − α/2) point of a χ2 (n − 1) distribution, i.e. if Y ∼ χ2 (n − 1) then P(Y ≤ χ21− α ) =
2 2
1 − α/2. We can re-arrange the inequalities to give bounds for the parameter σ 2 , as follows
!
(n − 1)S 2 2 (n − 1)S 2
P < σ < = 1 − α.
χ21− α χ2α
2 2
Hence the 100(1 − α)% confidence interval for σ 2 , based on a sample of size n from a normal population is
given by " #
(n − 1)S 2 (n − 1)S 2
, .
χ21− α χ2α
2 2
The inference is that this random interval contains the true value of σ 2 with probability 1 − α. A 100(1 − α)%
confidence interval for σ can be obtained by taking the square roots of the confidence limits for σ 2 .
Example 18. (Component lifetime data.) For these data n = 50 and s2 = 15.288 so that a 95% confidence
33
interval for σ 2 , assuming normality, is given by
49 × 15.288 49 × 15.288
, ,
χ20.975 χ20.025
where the χ2 values correspond to a χ2 distribution with 49 degrees of freedom. From tables of the χ2 (49)
distribution we have χ20.025 = 31.5549 and χ20.975 = 70.2224 so that the required confidence interval is given by
49 × 15.288 49 × 15.288
, = (10.668, 23.740) .
70.2224 31.5549
A 95% confidence interval for σ is obtained by taking the square roots of these endpoints to give (3.910, 4.872).
Let X1 , . . . , Xn be a random sample from Bi(1, p), i.e. the Bernoulli distribution, where the value of p is
unknown. We have already seen that the estimator pb = X is an unbiased estimator of p with variance
p(1 − p)/n. By the central limit theorem, pb ∼ N (p, p(1 − p)/n) approximately for large n. Thus, for large n,
!
pb − p
P −z1−α/2 ≤ p ≤ z1−α/2 ≈ 1 − α, (5)
p(1 − p)/n
p
In fact it can also be shown that the above remains true even if Var pb in the denominator is estimated via
p
pb(1 − pb)/n, i.e. for large n,
!
pb − p
P −z1−α/2 ≤ p ≤ z1−α/2 ≈ 1 − α.
pb(1 − pb)/n
Example 19. Recall the opinion poll data collected from n = 1000 voters introduced in Chapter 1. We would
like to use these data to obtain a 95% CI for the proportion in the population who support Labour, denoted
by pL . The proportion in the sample supporting Labour was found to be 0.314 which is our sample estimate
of pL , i.e. pbL = 0.314. From the above, our 95% CI has end points
r
0.314 × 0.686
0.314 ± 1.96 × ,
1000
34
i.e. the interval (0.283, 0.345), which is a little wider than before. It is this approach which gives rise to the
frequent comment that the proportions found in a poll based on 1000 voters are accurate to plus or minus 3%.
(i) The null hypothesis, denoted by H0 , is the hypothesis to be tested. This is usually a ‘conservative’ or
‘skeptical’ hypothesis that we believe by default unless there is significant evidence to the contrary.
(ii) The alternative hypothesis, denoted by H1 , is a hypothesis about the population parameters which
we will accept if there is evidence that H0 should be rejected.
For example, when assessing a new medical treatment it is common for the null hypothesis to correspond
to the statement that the new treatment is no better (or worse) than the old one. The alternative
hypothesis would be that the new treatment is better.
In this module the null hypothesis will always be simple, while the alternative hypothesis may either be
simple or composite. For example, consider the following hypotheses about the value of the mean µ of a
normal distribution with known variance σ 2 :
(iii) Test statistic. This is a function of the sample data whose value we will use to decide whether or not
to reject H0 in favour of H1 . Clearly, the test statistic will be a random variable.
(iv) Acceptance and rejection regions. We consider the set of all possible values that the test statistic
may take, i.e. the range space of the statistic, and we examine the distribution of the test statistic under
the assumption that H0 is true. The range space is then divided into two disjoint subsets called the
acceptance region and rejection region.
35
On observing data, if the calculated value of the test statistic falls into the rejection region then we reject
H0 in favour of H1 . If the value of the test statistic falls in the acceptance region then we do not reject
H0 .
The rejection region is usually defined to be a set of extreme values of the test statistic which together
have low probability of occuring if H0 is true. Thus, if we observe such a value then this is taken as
evidence that H0 is in fact false.
(v) Type I and type II errors. The procedure described in (iv) above can lead to two types of possible
errors:
The probability of making a type I error is denoted by α and is also called the significance level or
size of the test. The value of α is usually specified in advance; the rejection region is chosen in order to
achieve this value. A common choice is α = 0.05. Note that α = P(reject H0 | H0 ).
The probability of making a type II error is β = P(do not reject H0 | H1 ). For a good testing procedure,
β should be small for all values of the parameter included in H1 .
Example 20. Is a die biased or not? It is claimed that a particular die used in a game is biased in favour
of the six. To test this claim the die is rolled 60 times, and each time it is recorded whether or not a six is
obtained. At the end of the experiment the total number of sixes is counted, and this information is used to
decide whether or not the die is biased.
The null hypothesis to be tested is that the die is fair, i.e. P(rolling a six) = 1/6. The alternative hypothesis
is that the die is biased in favour of the six so that P(rolling a six) > 1/6. Let the probability of rolling a six
be denoted by p. We can write the above hypotheses as:
H0 : p = 1/6
H1 : p > 1/6 .
Let X denote the number of sixes thrown in 60 attempts. If H0 is true then X ∼ Bi(60, 1/6), whereas if H1 is
true then X ∼ Bi(60, p), with p > 1/6. H0 is a simple hypothesis, whereas H1 is a composite hypothesis.
If H0 were true, we would expect to see 10 sixes, since E(X) = 10 under H0 . However, the actual number
observed will vary randomly around this value. If we observe a large number of sixes, then this will constitute
evidence against H0 in favour of H1 . The question is, how large does the number of sixes need to be so that
we should reject H0 in favour of H1 ?
The test statistic here is x and the rejection region is
{x : x > k} ,
for some k ∈ N. Above, we choose the smallest value of k that ensures a significance level α < 0.05, i.e. the
smallest k such that
α = P(X > k | H0 ) < 0.05 .
Note that for k = 14, P(X > k | H0 ) = 0.0648, while for k = 15, P(X > k | H0 ) = 0.0338. Thus we select
k = 15. In this case, the actual significance level of the test is 0.0338.
When, as in this case, the test statistic is a discrete random variable, for many choices of significance level
there is no corresponding rejection region achieving that significance level exactly (e.g. α = 0.05 above).
36
In summary, under H0 the probability of observing more than 15 sixes in 60 rolls is 0.0338. This event is
sufficiently unlikely under H0 that if it occurs then we reject H0 in favour of H1 . It is possible that by rejecting
H0 we may make a type I error, with probability 0.0338 if H0 is true. If 15 or fewer sixes are obtained, then
this is within the acceptable bounds of random variation under H0 . Thus, in this case we would not reject the
null hypothesis that the die is unbiased. However in making this decision we may be making a type II error,
if H1 is in fact true.
Ideally we would like the probability on the left to be high. It is straightforward to evaluate this probability for
particular values of p > 1/6. Specifically, P(reject H0 | p) = P(X > 15 | p), where X ∼ Bi(60, p). For example,
the following values have been computed using R:
p P(reject H0 | p)
0.2 0.1306
0.25 0.4312
0.3 0.7562
Clearly, the larger the true value of p, the more likely we are to correctly reject H0 .
9.2 Inference about the mean of a normal distribution when the variance is known
Let X1 , . . . , Xn be a random sample from N (µ, σ 2 ), where the value of µ is unknown but the value of σ 2 is
known. We would like to use the data to make inferences about the value of µ and, in particular, we wish to
test the following hypotheses:
H0 : µ = µ0 vs H1 : µ > µ0 .
The null hypothesis H0 posits that the data are sampled from N (µ0 , σ 2 ). In contrast, the alternative hypothesis
H1 posits that the data arise from N (µ1 , σ 2 ), where µ1 > µ0 is an unspecified value of µ. This is a one-sided
test.
We know that the sample mean, X, is an unbiased estimator of µ. Hence, if the true value of µ is µ0 , then
E[X − µ0 ] = µ0 − µ0 = 0. In contrast, if H1 is true, we would have that E[X − µ0 ] = µ − µ0 > 0. This suggests
that we should reject H0 in favour of H1 if X is ‘significantly’ larger than µ0 , i.e. if X > k, for some k > µ0 .
37
The question is, how much greater than µ0 should x be before we reject H0 ? In other words, what value should
we choose for k?
One way to decide this is to fix the probability of rejecting H0 if H0 is true, i.e. the probability of making
a Type I error; the critical value k can then be determined on this basis. This is equivalent to fixing the
significance level of the test. Suppose that we do indeed use X as the test statistic, with rejection region
C = {x > k} ,
X−µ
where Z = √0
σ/ n
∼ N (0, 1) under H0 . Let z1−α denote the α point of N (0, 1), i.e. P(Z ≤ z1−α ) = 1 − α. From
k−µ
√0
this we see that z1−α = σ/ n
and so
z1−α σ
k = µ0 + √ .
n
Thus, H0 is rejected in favour of H1 if the sample mean is greater than µ0 by z1−α standard errors.
Equivalently, we reject H0 in favour of H1 at the 100α% significance level if
X − µ0
Z= √ > z1−α .
σ/ n
The standardized version of X given by Z is the most frequently used form of the test statistic in this scenario.
The critical value z1−α can be obtained from standard normal tables. In hypothesis testing it is common to
use α = 0.05, and in this case z0.95 = 1.645.
Suppose now that we wish to use our sample to test the hypotheses
H0 : µ = µ0 vs H1 : µ < µ0 .
This is again a one-sided test. In this case we will reject H0 in favour of H1 if X < k where k < µ0 . Using
analogous arguments to those used above, we will reject H0 in favour of H1 at the 100α% significance level if
z1−α σ
X < µ0 − √ ,
n
or, equivalently, if
X − µ0
Z= √ < −z1−α .
σ/ n
For a test having a 5% significance level the critical value is −z0.95 = −1.645.
If in fact our interest is in testing
H0 : µ = µ0 vs H1 : µ 6= µ0 ,
38
then we now have a two-sided test. We will reject H0 in favour of H1 if X is either significantly greater or
significantly less than µ0 , i.e. if
X < k1 or X > k2 ,
The critical values k1 < µ0 and k2 > µ0 are chosen so that the significance level is equal to α, i.e.
It seems natural to choose the values of k1 and k2 so that the probability of rejecting H0 is split equally between
the upper and lower parts of the rejection region. In other words, we choose k1 and k2 such that
For illustration, see the figure overleaf which shows the p.d.f. of X, together with the rejection region.
α 2 1−α α 2
k1 µ0 k2
We now find appropriate values of k1 and k2 satisfying this property. We begin with k2 . Note that
X − µ0 k −µ
α/2 = P(X > k2 | H0 true ) = P √ > 2 √ 0
σ/ n σ/ n
k2 − µ0
=P Z> √ , with Z ∼ N (0, 1) .
σ/ n
39
However, we know that z1−α/2 satisfies P(Z ≤ z1−α/2 ) = 1 − α/2. Hence,
k2 − µ0
√ = z1−α/2 ,
σ/ n
k1 −µ
We know that P(Z < −z1−α/2 ) = α/2 and so √0
σ/ n
= −z1−α/2 . Hence
z1−α/2 σ
k1 = µ0 − √ .
n
X − µ0
Z= √ > z1−α/2 or if
σ/ n
X − µ0
Z= √ < −z1−α/2 .
σ/ n
9.2.1 Connection between the two-tailed test and a confidence interval for the mean when the
variance is known
Let X1 , . . . Xn be a random sample from N (µ, σ 2 ) with µ unknown and σ 2 known. Recall from Chapter 7 that
a 100(1 − α)% confidence interval for µ is given by
z1−α/2 σ z1−α/2 σ
X− √ , X+ √ .
n n
H0 : µ = µ0
H1 : µ 6= µ0 ,
or, equivalently, if
z1−α/2 σ z1−α/2 σ
X− √ ≤ µ0 ≤ X + √ .
n n
40
Thus, the values of µ in the confidence interval correspond to values of µ0 for which the corresponding null
hypothesis H0 would not be rejected. In other words, informally, the 100(1 − α)% confidence interval is a set
of values of µ which would ‘pass a hypothesis test at significance level α’. It is in this sense that we can regard
the confidence interval as a set of plausible values of µ given the data.
Example 21. (i) A random sample of n = 25 observations is taken from a normal distribution with unknown
mean but known variance σ 2 = 16. The sample mean is found to be x = 18.2. Test H0 : µ = 20 vs
H1 : µ < 20 at the 5% significance level.
Solution: the test statistic is
18.2 − 20.0
Z= p = −2.25
16/25
The appropriate 5% critical value is −z0.95 = −1.645. The observed value of Z is less than −1.645.
Hence, we reject H0 at the 5% significance level and conclude that the true value of µ in the normal
distribution from which the data are sampled satisfies µ < 20.
(ii) Find the probability that we reject H0 using this testing procedure when the true value of the mean µ is
19.0.
Solution: the null hypothesis is rejected if
X − 20.0
√ < −1.645
4/ 25
or equivalently if
4
X < 20.0 − 1.645 × √
25
The true distribution of X is N (19.0, 16/25) and so the probability of rejecting H0 is
4
P X < 20.0 − 1.645 × √
25
!
X − 19.0 20.0 − (1.645 × 45 ) − 19.0
=P <
4/5 4/5
X − 19.0
=P < −0.395
4/5
= Φ(−0.395) = 0.3464 ,
X − 19.0
since the true distribution of is N (0, 1) .
4/5
Clearly, the probability of rejecting H0 will increase as the difference µ0 − µ becomes larger. Hence, the
further the true mean from the hypothesized value, the more likely we are to reject H0 . When µ = µ0 the
above is the probability of rejecting H0 when H0 is true, i.e. the significance level. This can be verified
by substituting in µ = µ0 to obtain Φ(−z1−α ) = α.
Example 22. Suppose now that we have a random sample of n = 50 observations from a normal distribution
with unknown mean and known variance σ 2 = 36. It is found that x = 30.8.
41
Solution: here the test statistic is
30.8 − 30.0
Z= p = 0.943 .
36/50
As the alternative hypothesis is two-sided, we will now reject H0 for either small or large values of Z.
Using a 5% significance level the critical values are −z0.975 = −1.96 and z0.975 = 1.96. The observed
value of Z lies between the two critical values, thus H0 is not rejected at the 5% significance level. We
conclude that there is insufficient evidence to reject the claim that the normal distribution from which
the data arise has mean 30.
(ii) Find the probability that we reject H0 when the true value of the mean µ is 31.0.
Solution: here we require
!
X − 30.0
1 − P −1.96 < √ < 1.96 µ = 31.0
6/ 50
!
6 6
= 1 − P 30 − 1.96 × √ < X < 30 + 1.96 × √ µ = 31
50 50
30 − (1.96 × √650 ) − 31 30 + (1.96 × √650 ) − 31
!
X − 31
=1−P √ < √ < √
6/ 50 6/ 50 6/ 50
30 − 31 30 − 31
=1− Φ √ + 1.96 − Φ √ − 1.96
6/ 50 6/ 50
= 1 − [Φ(−1.179 + 1.96) − Φ(−1.179 − 1.96)] = 0.218 .
This probability increases as |µ0 − µ| becomes larger. When µ = µ0 it is equal to α, the significance level.
9.3 Inference about the mean of a normal distribution when the variance is unknown
Let X1 , . . . , Xn be a random sample from the N (µ, σ 2 ) distribution, where the value of µ is unknown but that
of σ 2 is also unknown. We want to test the following hypotheses:
H0 : µ = µ0
H1 : µ > µ0
at significance level α. Based on the discussion in the previous section, an appropriate test statistic which
measures the discrepancy between µ0 and the sample estimator X is given by
X − µ0
T = √
S/ n
where S is the sample standard deviation. This is an estimate of the standardized difference between X and µ0 .
As we have discussed previously, because the statistic T involves the random quantities X and S, its sampling
distribution is no longer N (0, 1). We have seen in Chapter 7 that T ∼ t(n − 1), under the assumption that H0
is true, i.e. T has a Student t-distribution with n − 1 degrees of freedom.
Assuming that the significance level of the test is α, we use one of the following rejection regions, depending
on the alternative hypothesis:
42
• For the one-sided alternative hypothesis H1 : µ > µ0 ,
Example 23. The drug 6-mP is used to treat leukaemia. A random sample of 21 patients using 6-mP were
found to have an average remission time of x = 17.1 weeks with a sample standard deviation of s = 10.00
weeks. A previously used drug treatment had a known mean remission time of µ0 = 12.5 weeks. Assuming
that the remission times of patients taking 6-mP are normally distributed with both the mean µ and variance
σ 2 being unknown, test at the 5% significance level whether the mean remission time of patients taking 6-mP
is greater than µ0 = 12.5 weeks.
Solution: We want to test H0 : µ = 12.5 vs H1 : µ > 12.5 at the 5% significance level.
The test statistic is
x − µ0 17.1 − 12.5
T = √ = √ = 2.108
s/ n 10/ 21
Under H0 , T ∼ t(20). For a one-tailed test at the 5% significance level we will reject H0 if T > 1.725 (from
tables). Our observed value of T is greater than 1.725 and so we reject the null hypothesis that µ = 12.5 at
the 5% significance level and conclude that µ > 12.5, i.e. the drug 6-mP improves remission times compared
to the previous drug treatment.
H0 : µ = µ 0
H1 : µ > µ0
X − µ0
Y = √
S/ n
defined above which, by asymptotic (large n) results, has an approximate N (0, 1) distribution when H0
is true (n ≥ 30). Aside from the choice of test statistic, the rejection regions for the various versions of
H1 are otherwise identical to those defined in the case of normal data with a known variance.
43
want to test the following hypotheses:
H0 : p = p0
H1 : p > p0
at significance level α. As we have seen earlier in this module, an unbiased sample estimator of the
parameter p is given by
n
1X
pb = Xi = X n .
n
i=1
By the central limit theorem, pb ∼ N (p, p(1 − p)/n) approximately for large n. As a rule of thumb,
n ≥ 9 max{p/(1 − p), (1 − p)/p} guarantees this approximation has a good degree of accuracy. A suitable
test statistic is
pb − p0
Y =p
p0 (1 − p0 )/n
p
Here we have estimated the standard error of pb by p0 (1 − p0 )/n which uses the value of p specified
under H0 . If H0 is true then Y has an approximate N (0, 1) distribution for large n. Thus, to achieve an
approximate significance level of α, we reject H0 in favour of the above H1 if Y > z1−α .
• For the one-sided alternative hypothesis H1 : p < p0 , to achieve an approximate significance level of
α, we reject H0 if Y < −z1−α .
• For the two-sided alternative hypothesis H1 : p 6= p0 , to achieve an approximate significance level of
α, we reject H0 if
Y < −z1−α/2 or Y > z1−α/2 .
Example 24. A team of eye surgeons has developed a new technique for an eye operation to restore
the sight of patients blinded by a particular disease. It is known that 30% of patients who undergo an
operation using the old method recover their eyesight.
A total of 225 operations are performed by surgeons in various hospitals using the new method and it
is found that 88 of them are successful in that the patients recover their sight. Can we justify the claim
that the new method is better than the old one? (Use a 1% level of significance).
Solution: Let p be the probability that a patient recovers their eyesight following an operation using
the new technique. We wish to test H0 : p = 0.30 vs H1 : p > 0.30 at the 1% significance level.
Our test statistic is
88
225 − 0.30
Y =q = 2.9823
0.30×0.70
225
As a check for the approximate normality of the distribution of Y under H0 , we require n > 9 max{0.429, 2.333} =
20.997 which is true since n = 225.
The approximate 1% critical value, taken from standard normal tables, is 2.3263 which is less than the
observed value of Y . Hence, we reject the null hypothesis at the 1% significance level and conclude that
p > 0.30.
44
10 Hypothesis testing (Part 3)
Procedures for two independent samples
10.1 Introduction
In this chapter we will extend hypothesis testing to the scenario in which there are two independent samples
of data, and the aim is to make an inference about the difference in the means of the two populations from
which the data have been sampled.
To this end, let X11 , . . . , X1n1 be a random sample of size n1 from a distribution with mean µ1 and variance
σ12 . Also, let X21 , . . . , X2n2 be a second random sample, independent from the first, from a distribution with
mean µ2 and variance σ22 . Suppose that we wish to test
H0 : µ1 − µ2 = φ,
where φ is a constant (often φ = 0), versus one of the following alternative hypotheses at the 100α% significance
level:
(iii) H1 : µ1 − µ2 6= φ (two-sided)
10.2 Both underlying distributions normal with known variances σ12 and σ22
An unbiased estimator of µ1 − µ2 = φ is given by X 1 − X 2 where
nk
1 X
Xk = Xki , k = 1, 2 .
nk
i=1
Under H0 , Z ∼ N (0, 1). We again find the critical value of our test by fixing the probability of a type I
error to be α, i.e. P(reject H0 | H0 is true) = α. This idea was described in detail for single sample inference
in Chapter 9. Below we list the rejection regions corresponding to the three possible alternative hypotheses
introduced in Section 10.1.
(i) For H1 : µ1 − µ2 > φ, we reject H0 at the 100α% significance level if Z > z1−α , where z1−α satisfies
Φ(z1−α ) = 1 − α. Equivalently, we reject H0 if
s
σ12 σ22
X 1 − X 2 > φ + z1−α + .
n1 n2
45
E.g. if α = 0.05 then z0.95 = 1.645.
(ii) For H1 : µ1 − µ2 < φ, we reject H0 at the 100α% significance level if Z < −z1−α . Equivalently, we reject
H0 if s
σ12 σ22
X 1 − X 2 < φ − z1−α + .
n1 n2
E.g. if α = 0.05 then −z0.95 = −1.645.
(iii) For H1 : µ1 − µ2 6= φ, we reject H0 at the 100α% significance level if |Z| > z1−α/2 . Equivalently, we
reject H0 if s
σ12 σ22
|(X 1 − X 2 ) − φ| > z1−α/2 +
n1 n2
E.g. if α = 0.05 then z0.975 = 1.96.
As the true values of σ12 and σ22 are unknown, we estimate them using the sample variances given by
kn
1 X
Sk2 = (Xki − X k )2 , k = 1, 2 .
nk − 1
i=1
Considering the estimated standardized difference between X 1 − X 2 and φ we have that, under H0 ,
X1 − X2 − φ
Y = q 2 ∼ N (0, 1) approximately
S1 S22
n1 + n2
when n1 and n2 are large, e.g. n1 > 30 and n2 > 30. To achieve an approximate significance level of 100α%,
the rejection regions for the three alternative hypotheses introduced in Section 10.1 are:
If we are prepared to assume that the unknown variances of the two normal distributions are equal, i.e.
σ12 = σ22 = σ 2 , then the common variance σ 2 may be estimated using the estimator described in Chapter 7, i.e.
which can be shown to have a Student t-distribution with (n1 + n2 − 2) degrees of freedom when H0 is true.
The rejection regions for the three alternative hypotheses in Section 9.1 are:
(i) For H1 : µ1 − µ2 > φ, we reject H0 if T > t1−α , where t1−α is the 1 − α point of a t distribution on
n1 + n2 − 2 degrees of freedom.
46
(ii) For H1 : µ1 − µ2 < φ, we reject H0 if T < −t1−α .
Each rejection region above defines a test with an exact significance level of 100α%.
Example 25. An investigation was carried out comparing a new drug with a placebo. A random sample of
n1 = 40 patients was treated with the new drug, while an independent sample of n2 = 36 patients was given
the placebo. A response was measured for each patient. Under the new drug, the response had sample mean
x1 = 10.13 and sample variance s21 = 4.721. Under placebo, the response had sample mean x2 = 12.16 and
sample variance s22 = 3.368.
Supposing that the responses in both groups are normally distributed, test at the 5% significance level
whether the population mean response under the new drug is the same as that under placebo. Conduct your
analysis assuming that (i) σ12 6= σ22 and (ii) σ12 = σ22 .
Solution: we are required to test H0 : µ1 = µ2 vs H1 : µ1 6= µ2 , where µ1 denotes the (population) mean
response under the new drug, and µ2 denotes the (population) mean response under placebo.
(i) In the case where we assume that σ12 6= σ22 , the test statistic is
10.13 − 12.16 − 0
Y = q = −4.413 .
4.721 3.368
40 + 36
For a two-sided test at the approximate 5% significance level we will reject H0 if |Y | > z0.975 = 1.96. The
observed value of |Y | is 4.413 and so we reject H0 at the approximate 5% level. Hence, we conclude that
the mean response for those receiving the new drug is not equal to the mean response for those receiving
the placebo.
(ii) In the second case, where we assume that σ12 = σ22 , we need to estimate the common variance σ 2 by
39 × 4.721 + 35 × 3.368
b2 =
σ = 4.081 .
40 + 36 − 2
This time, for a two-sided test at the 5% significance level, we will reject H0 if |T | > t0.975 = 1.993 on 74
degrees of freedom. We have |T | = 4.374 > 1.993 and so we reject H0 at the 5% level and conclude that
the two population means are not equal.
Below we give a rejection region resulting in an approximate significance level of 100α% for each of the three
alternative hypotheses listed in Section 10.1:
(i) For H1 : µ1 − µ2 > φ, we reject H0 at the approximate 100α% significance level if Y > z1−α .
(ii) For H1 : µ1 − µ2 < φ, we reject H0 at the approximate 100α% significance level if Y < −z1−α .
(iii) For H1 : µ1 − µ2 6= φ, we reject H0 at the approximate 100α% significance level if |Y | > z1− α2 .
47
If the variances of the two distributions are unknown then we substitute the sample estimators S12 and S22
and proceed as just described for the case of known variances.
H0 : p1 − p2 = φ,
where φ is a constant (often set equal to zero) against one of the three alternative hypotheses given by
(iii) H1 : p1 − p2 6= φ (two-sided)
at the approximate 100α% significance level. Here we are making an inference about the difference in the
proportions of ‘successes’ in the two underlying populations. When n1 and n2 are both large we have that
p1 (1 − p1 ) p2 (1 − p2 )
pb1 − pb2 ∼ N p1 − p2 , + approximately ,
n1 n2
where in the denominator the following sample estimate of the standard error of pb1 − pb2 has been used:
s
pb1 (1 − pb1 ) pb2 (1 − pb2 )
sd p1 − pb2 ) =
. e.(b + .
n1 n2
Provided n1 and n2 are both reasonably large, under H0 the test statistic Y ∼ N (0, 1) approximately by
asymptotic results. Note that
nk
1 X
pbk = Xki = X k , k = 1, 2 ,
nk
i=1
(i) For H1 : p1 − p2 > φ, we reject H0 at the approximate 100α% significance level if Y > z1−α
(ii) For H1 : p1 − p2 < φ, we reject H0 at the approximate 100α% significance level if Y < −z1−α
(iii) For H1 : p1 − p2 6= φ, we reject H0 at the approximate 100α% significance level if |Y | > z1− α2
The case H0 : p1 = p2
If φ = 0, then under H0 we have p1 = p2 = p, say. An estimate of the common probability p is given by the
‘pooled estimate’
r1 + r2
p= .
n1 + n2
48
In this case it makes sense to use the estimate p when forming the estimated standard error of pb1 − pb2 that
appears in the denominator of Y . The revised test statistic for the case when H0 : p1 = p2 is thus
pb1 − pb2
Y =q .
p(1−p) p(1−p)
n1 + n2
Example 26. In a random sample of n1 = 120 voters from Town I, r1 = 56 indicated that they would support
Labour in a general election. In a second independent random sample of size n2 = 110 from Town II, taken
on the same day as the sample from Town I, r2 = 63 indicated that they would support Labour in a general
election. Carry out an appropriate test at the approximate 5% significance level to examine whether the
proportions of voters supporting Labour are the same in the two towns.
Solution. Let p1 denote the (population) proportion of Labour voters in Town I and p2 denote the
(population) proportion of Labour voters in Town II. We wish to test H0 : p1 − p2 = 0 vs H1 : p1 − p2 6= 0 at
the approximate 5% significance level. We have that pb1 = r1 /n1 = 56/120 = 0.467 and pb2 = r2 /n2 = 63/110 =
0.573.
Under H0 , we have that p1 = p2 . An estimate of the common value of p is given by
r1 + r2 56 + 63 119
p= = = = 0.517 .
n1 + n2 120 + 110 230
0.467 − 0.573 − 0
Y =q = −1.607 .
0.517×0.483 0.517×0.483
120 + 110
We would reject H0 at the approximate 5% level if |Y | > z0.975 = 1.96. The observed value of |Y | = 1.607 <
1.96. Hence, there is insufficient evidence to reject H0 at the approximate 5% level. In other words, there is
insufficient evidence to reject the claim that the proportions supporting Labour in the two towns are equal.
(Note that both n1 , n2 > 9 × max 0.517 0.483
0.483 0.517 = 9.634 which justifies the normal approximations for p
, b1
and pb2 under H0 .)
49