0% found this document useful (0 votes)

33 views

MATH10282: Introduction To Statistics Lecture Notes

This document provides an introduction to statistics, including: 1) Statistics involves summarizing and analyzing data to make inferences about populations while accounting for uncertainty. Descriptive statistics summarize data, while inferential statistics make conclusions about populations from samples. 2) Populations are groups being studied, and samples are subsets used to make inferences. Random sampling is used to obtain representative samples. 3) Examples are given of populations, variables, and samples related to voting intentions, battery lifetimes, income data, and experimental outcomes. Simple random sampling without and with replacement from finite populations is explained.

Uploaded by

AnooshayMomin

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views

MATH10282: Introduction To Statistics Lecture Notes

Uploaded by

AnooshayMomin

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

MATH10282: Introduction to Statistics

Lecture Notes

1
1 Introduction: What is Statistics?
Statistics is:

‘the science of learning from data, and of measuring, controlling, and communicating uncertainty;
and it thereby provides the navigation essential for controlling the course of scientific and societal
advances.’
Davidian, M. and Louis, T.A. (2012), Science.
http://dx.doi.org/10.1126/science.1218685

There are two basic forms: descriptive statistics and inferential statistics. In this course we will discuss both,
with inferential statistics being the major emphasis.

• Descriptive Statistics is primarily about summarizing a given data set through numerical summaries and
graphs, and can be used for exploratory analysis to visualize the information contained in the data and
suggest hypotheses etc.
It is useful and important. It has become more exciting nowadays with people regularly using fancy
interactive computer graphics to display numerical information (e.g. Hans Rosling’s visualisation of the
change in countries’ health and wealth over time – see Youtube).

• Inferential Statistics is concerned with methods for making conclusions about a population using infor-
mation from a sample, and assessing the reliability of, and uncertainty in, these conclusions.
This allows us to make judgements in the presence of uncertainty and variability, which is extremely
important in underpinning evidence-based decision making in science, government, business etc.

Many statistical analyses and calculations are easiest to perform using a computer. We will learn how to
use the statistical software R, which is freely available to download from http://r-project.org for use
on your own computer. A good introductory guide is ‘Introduction to R’ by Venables et al. (2006), which
can be downloaded as a PDF from the R project website, or accessed from the R software itself via the menu
(Help→Manuals).
To interact with R, we type commands into the console, or write script files which contain several commands
for longer analyses. These commands are written in the R computer programming language, whose syntax
is fairly easy to learn. In this way, we can perform mathematical and statistical calculations. R has many
existing built-in functions, and users are also able to create their own functions. The R software also has very
good graphical facilities, which can produce high quality statistical plots. Datasets for use in the R sessions
are available from the course website https://minerva.it.manchester.ac.uk/~saralees/intro.html You
can download these and store them for use in the lab sessions.

2 Populations and samples

A population is the collection of all individuals or items under consideration in the study. For a given
population there will typically be one or more variables in which we are interested. For example, consider the
following populations together with corresponding variables of interest:

(i) All adults in the UK who are eligible to vote; the variable of interest is the political party supported.

(ii) Car batteries of a particular type manufactured by a particular company; the variable of interest is the
lifetime of the battery before failure.

(iii) All adult males working full-time in Manchester; the variable of interest is the person’s gross income.

2
(iv) All potential possible outcomes of a planned laboratory experiment; the variable of interest is the value
of a particular measurement.

In general, the variables of interest may be either qualitative or quantitative. Qualitative variables are
either nominal, e.g. gender or political party supported, or ordinal, e.g. a measurement of size grouped into
three categories: small, medium or large. Quantitative variables are either discrete, for example a count, or
continuous, such as the variables income and lifetime above.
We wish to make conclusions, or inferences, about the population characteristics of variables of interest.
One way to do so is to conduct a census, i.e. to collect data for each individual in the population. However
often this is not feasible, due to one or more of the following:

• It may be too expensive or time consuming to do so, e.g. (i)

• Testing may be destructive, e.g. (ii), and we need to have some products left to sell!

• The population may be purely conceptual, e.g. (iv)

Instead, we collect data only for a sample, i.e. a subset of the population. We then use the characteristics
of the sample to estimate the characteristics of the population. In order for this procedure to give a good
estimate, the sample must be representative of the population. Otherwise, if an unrepresentative or ‘biased’
sample is used the conclusions will be systematically incorrect.
Some examples of samples from populations are given below:

(i) In an opinion poll in May 2015, a sample of 1000 adults was obtained and asked which political party
they intended to vote for in the upcoming UK General Election on 7 May 2015. A summary of these
responses is:

Party Number of supporters

Conservative 369
Labour 314
Lib Dem 75
UKIP 118
Other 124

(ii) A random sample of 40 manufactured car batteries was taken from the production line, and their lifetimes
(in years) determined. The data are as follows, arranged in ascending order for convenience:

1.6, 1.9, 2.2, 2.5, 2.6, 2.6, 2.9, 3.0, 3.0, 3.1,
3.1, 3.1, 3.1, 3.2, 3.2, 3.2, 3.3, 3.3, 3.3, 3.4,
3.4, 3.4, 3.5, 3.5, 3.6, 3.7, 3.7, 3.7, 3.8, 3.8,
3.9, 3.9, 4.1, 4.1, 4.2, 4.3, 4.4, 4.5, 4.7, 4.7

(iii) We could obtain a sample of 500 adult males working full-time in Manchester. The following table
summarizes a hypothetical data set of the annual incomes in thousands of pounds for such a sample.

3
Interval Frequency Percentage
5 to 15 83 16.6
15 to 25 142 28.4
25 to 35 90 18.0
35 to 45 79 15.8
45 to 55 46 9.2
55 to 65 28 5.6
65 to 75 13 2.6
75 to 85 6 1.2
85 to 95 4 0.8
95 to 105 3 0.6
105 to 115 0 0.0
115 to 125 2 0.4
125 to 135 0 0.0
135 to 145 0 0.0
145 to 155 1 0.2
155 to 165 0 0.0
165 to 175 1 0.2
175 to 185 1 0.2
185 to 195 1 0.2
Totals 500 100.0

The intervals in the table are open on the left and closed on the right, e.g. the first row gives the count
of incomes in the range (5, 15].

2.1 Finite population sampling

In modern Statistics, the most common way of guaranteeing representativeness is to use a random sample
of size n chosen according to a probabilistic sampling rule. This probabilistic sampling is objective and
eliminates investigator bias. For a population of finite size N , the most common method is to use simple
random sampling. This takes two main forms: sampling without replacement and sampling with
replacement.

• Sampling without replacement: each of the N

n possible samples of n distinct individuals from the population
N −1

has equal probability of selection, n . No individual appears more than once in the sample.
This can be implemented by choosing individuals sequentially, one at a time, as follows. For i = 1, . . . , n:

Step 1. Select an individual at random with equal probability from the remaining population of size N −i+1
Step 2. Include the selected individual as the ith member of the sample, and remove the selected individual
from the population, leaving N − i individuals remaining.

The above steps are repeated until a sample of size n is obtained.

• Sampling with replacement: each individual may appear any number of times in the sample, leading to N n
possible samples. The probability of selecting any particular sample is N −n . This can be implemented using
a similar sequential algorithm to before, where instead in Step 2 the selected individual is not removed from
the population.

Example. Let v1 , . . . , vN denote the values of the variable X for the 1st, . . ., N th individuals in the population.
Suppose that interest lies in estimating the population mean of X,

N
1 X
µ= vj .
N
j=1

4
Let X1 , . . . , Xn be the values of X in a sample of size n chosen by sampling without replacement. The
population mean µ can be estimated by
n
1X
X= Xi .
n
i=1

The value of X will be different for different samples, and so X is a random variable because the sample is
chosen randomly. Thus, X has its own probability distribution, which is known as its sampling distribution.
How can we measure the performance of the above method of estimating µ? One way is to calculate the
expectation and variance of the sampling distribution of X. In particular, it can be shown that under sampling
without replacement
E(X) = µ .

As a result, X is said to be unbiased. We will study this unbiasedness property further in Chapter 5. Moreover
it is possible to show that under sampling without replacement

σ2

N −n
Var(X) = , (1)
n N −1

where the population variance σ 2 is defined as

N
1 X
σ2 = (vj − µ)2 .
N
j=1

Note that an alternative expression for the population variance is

N
2 1 X 2
σ = v j − µ2 .
N
j=1

This is often more practical for calculations.

For illustration consider the highly simplified scenario where there are three individuals A, B, C, with
corresponding X values of 1, 2, 3 respectively, and a sample of size 2 is chosen using sampling without
replacement. The table below shows all possible samples, together with the corresponding values of X.

Sample Selection probability X values X

1 5
{B,C} 3 (2,3) 2
1
{A,C} 3 (1,3) 2
1 3
{A,B} 3 (1,2) 2

Table: all possible samples in the illlustrative example

We can verify that E(X) = µ as follows. First note from the table that the p.m.f. of X is

x 5/2 2 3/2
pX (x) 1/3 1/3 1/3

Hence
X 1 5 1 1 3
E(X) = x pX (x) = × + × 2 + × = 2.
3 2 3 3 2
x∈RX

Note also that the population mean is µ = 13 (1 + 2 + 3) = 2, and so E(X) = µ as anticipated.

We can also compute the variance of the sampling distribution for X. Recall that for any random variable

5
Y , Var(Y ) = E(Y 2 ) − E(Y )2 . Note further that
2 2
2 X
2 5 1 2 1 3 1 25
E(X ) = x pX (x) = × +2 × + × = ,
2 3 3 2 3 6
x∈RX

2
and so Var(X) = E(X ) − (E X)2 = 25/6 − 22 = 1/6.

2.2 Sampling from a general population

For a general (i.e. not necessarily finite) population, the value of a quantitative variable for a randomly selected
individual can be described by a real-valued random variable X with cumulative distribution function (c.d.f.)

FX (x) = P(X ≤ x) .

If X is a continuous random variable then there is also an associated probability density function (p.d.f.)
fX (x), which satisfies
dFX (x)
= fX (x) .
dx
If X is a discrete random variable then there is instead a probability mass function (p.m.f.) pX (x) satisfying
X
pX (t) = FX (x) .
t≤x

We now recall several concepts from MATH10141 Probability I. For a continuous random variable, the popu-
lation mean µ and variance σ 2 of X are
Z ∞
µ= x fX (x) dx
Z−∞
∞
σ2 = (x − µ)2 fX (x) dx .
−∞

For a discrete random variable, these quantities are instead defined in terms of the p.m.f.
X
µ= x pX (x)
x∈RX
X
2
σ = (x − µ)2 pX (x) ,
x∈RX

where RX ⊆ R denotes the range-space of X.

2.2.1 Independent events

Let Ω be a probability sample space. A pair of events A, B ⊆ Ω is said to be independent if

P(A ∩ B) = P(A) × P(B) .

More generally, events B1 , . . . , Bn are mutually independent if for every subset {Bi1 , . . . , Bik }, (k ≥ 2) of
{B1 , . . . , Bn },
P(Bi1 ∩ · · · ∩ Bik ) = P(Bi1 ) × · · · × P(Bik ) .

6
2.2.2 Independent random variables

A collection of real-valued random variables X1 , . . . , Xn is said to be independent if for any subsets B1 , . . . , Bn ⊆

R the events {X1 ∈ B1 }, . . . , {Xn ∈ Bn } are independent, i.e.

P(X1 ∈ B1 , . . . , Xn ∈ Bn ) = P(X1 ∈ B1 ) × · · · × P(Xn ∈ Bn ) .

The following special cases have alternate equivalent definitions:

• If X1 , . . . , Xn are identically distributed with c.d.f. FX (x), then X1 , . . . , Xn are independent if and only
if
P(X1 ≤ x1 , . . . , Xn ≤ xn ) = FX (x1 ) × · · · × FX (xn ) .

• If X1 , . . . , Xn are discrete random variables with common p.m.f. pX (x), then X1 , . . . , Xn are independent
if and only if the joint p.m.f satisfies

p(X1 ,...,Xn ) (x1 , . . . , xn ) = pX (x1 ) × · · · × pX (xn ) .

• If X1 , . . . , Xn are continuous random variables with common p.d.f. fX (x), then X1 , . . . , Xn are indepen-
dent if and only if the joint p.d.f. satistifies

f(X1 ,...,Xn ) (x1 , . . . , xn ) = fX (x1 ) × · · · × fX (xn ) .

The idea of independence is now used to define sampling from a general population. We say that X1 , . . . , Xn
are a random sample from X if X1 , . . . , Xn ∼ FX (x) independently. We may also say that X1 , . . . , Xn is a
random sample from FX (x), fX (x) or pX (x).

Example. Simple random sampling of n individuals with replacement from a finite population of size N with
X-values v1 , . . . , vn corresponds to independent random sampling of X1 , . . . , Xn from the p.m.f.

1
pX (x) = × {number of j such that vj = x} .
N

Similar to the previous section, we may use the characteristics of the sample to estimate the characteristics
of the population. For example, suppose we are interested in the population mean µ. This may again be
estimated by the sample mean, i.e.
n
1X
X= Xi .
n
i=1

Once again, the value of X is random because X1 , . . . , Xn is a random sample from the population. Moreover,
it is again true that
E(X) = µ .

The variance of the sample mean is

σ2
Var(X) = . (2)
n
For a finite population of size N , we can compare the properties of X under the two types of sampling:
independent random sampling and random sampling without replacement. We see comparing equations (1)
and (2) that when using sampling without replacement, Var(X) is smaller by a factor

N −n
f.p.c. = ,
N −1

7
which is called the finite population correction (f.p.c.). The difference in Var(X) occurs because under sam-
pling without replacement the Xi are not independent. However, the Xi can be considered to be approximately
independent when N is large and the sampling proportion n/N is small. In this case,

1 − n/N
f.p.c. = ≈ 1.
1 − 1/N

In the remainder of this course we will always assume that X1 , . . . , Xn are sampled independently from a c.d.f.
FX (x).

3 Probability models for data

Let x1 , . . . , xn be the observed values in a particular random sample of the random variable X, whose distri-
bution is unknown. We may wish to use these data to estimate the probability of an event {X ∈ A}, A ⊆ RX .
One way is to use the empirical probability of the event, in other words the proportion of the sample values
that lie in A,
#{i : xi ∈ A}
P(X
b ∈ A) = .
n
An alternative approach is to assume that the data were generated as a random sample from a particular
parametric probability model, e.g. N (µ, σ 2 ). Such models usually contain unknown parameters, e.g. in the
previous example the parameters µ and σ 2 are unknown. We can use the sample to estimate the parameters
of the distribution, thereby fitting the model to the data. A fitted model can be used to calculate probabilities
of events of interest.
If the chosen model is a good fit then the empirical and model-based estimated probabilities of the event
should be similar. Small differences between the empirical and model-based estimated probabilities will occur
frequently due to the fact that we have only observed a random sample and not the entire population. Thus,
both estimates exhibit random variation around the true population probability. However, large differences
between empirical and model-based probabilities may be indicative that the chosen parametric model is a poor
approximation of the true data generating process. This is best illustrated by studying some examples.

3.1 Continuous data

3.1.1 Component lifetime data

A sample of n = 50 components was taken from a production line, and their lifetimes (in hours) determined. A
tabulation of the sample values is given overleaf. A possible parametric model for these data is to assume that
they are a random sample from a normal distribution N (µ, σ 2 ). The parameters µ and σ 2 can be estimated
from the sample by µ b2 = s2 = 15.288.
b = x = 334.6, σ
We can informally investigate how well this distribution fits the data by superimposing the probability
density function of a N (334.6, 3.9122 ) distribution onto a histogram of the data. This is illustrated in the
figure overleaf, which shows the fit to be reasonably good, particularly for data greater than the mean.

8
Intervals Frequencies Percents
323.75 to 326.25 1 2
326.25 to 328.75 0 0
328.75 to 331.25 9 18
331.25 to 333.75 12 24
333.75 to 336.25 11 22
336.25 to 338.75 10 20
338.75 to 341.25 5 10
341.25 to 343.75 1 2
343.75 to 346.25 1 2
Totals 50 100

Histogram of lifetime data with Normal pdf

0.12
0.10
0.08
Density

0.06
0.04
0.02
0.00

320 325 330 335 340 345 350

lifetime (hours)

Figure 1: Histogram of the component lifetime data together with a N (334.6, 3.9122 ) p.d.f.

This figure can be obtained using the R code below. The lines command draws a curve through the (x, y)
co-ordinates provided.

xx <- comp_lifetime$lifetime
xv <- seq(320, 350, 0.1)
yv <- dnorm(xv, mean=mean(xx), sd=sd(xx))
hist(xx, freq=F, breaks=seq(from=323.75, to= 346.25, by=2.5),
xlim=c(320, 350), ylim=c(0, 0.12), main="Histogram of
lifetime data with Normal pdf", xlab="lifetime (hours)")
lines(xv, yv)

The fitted normal distribution appears to be a reasonably good fit to the observed data, thus we may use it
to calculate estimated probabilities. For example, consider the question ‘what is the estimated probability that
a randomly selected component lasts between 330 and 340 hours?’. To answer this, let the random variable
X be the lifetime of a randomly selected component. We require P(330 < X < 340) under the fitted normal
model, X ∼ N (334.6, 3.9122 ):

330.0 − 334.6 X − 334.6 340.0 − 334.6
P(330 < X < 340) = P < <
3.912 3.912 3.912
= P(−1.18 < Z < 1.38) , where Z ∼ N (0, 1)
= Φ(1.38) − Φ(−1.18) = 0.9162 − 0.1190 = 0.7972 .

Hence, using the fitted normal model we estimate that 79.72% of randomly selected components will have
lifetimes between 330 and 340 hours.

9
3.1.2 Manchester income data

If we superimpose a normal density curve onto the histogram for these data, then we see that the symmetric
normal distribution is a poor fit, since the data are skewed. In particular, the normal density extends to
negative income values despite the fact that all of the incomes in the sample are positive.

Histogram of income data with Normal pdf

0.030
0.025
0.020
Density

0.015
0.010
0.005
0.000

0 50 100 150 200

income (GBP x 1000)

Figure 2: Histogram of the income data with the p.d.f. of the fitted normal distribution.

This figure can be obtained using the following R code:

xx <- income$income
xv <- seq(0, 200, 0.5)
yv <- dnorm(xv, mean=mean(xx), sd=sd(xx))
hist(xx, freq=F, breaks=seq(from=5, to=195, by=10),
ylim=c(0, 0.030), xlab="income (GBP x 1000)",
main="Histogram of income data with Normal pdf")
lines(xv, yv)

One way forward is to look for a transformation which will make the data appear to be more normally
distributed. Because the data are strongly positively skewed on the positive real line one possibility is to take
logarithms.
In the figure below, we see a histogram of the log transformed income data. The fit of the superimposed
normal p.d.f. now looks reasonable, although there are perhaps slightly fewer sample observations than might
be expected according to the normal model in the left-hand tail and centre. There are also some outliers in
the right-hand tail.

10
Histogram of log(income) data with Normal pdf
0.8
0.6
Density

0.4
0.2
0.0

1 2 3 4 5 6

log(income)

Figure 3: Histogram of log(income) with a normal p.d.f.

This figure can be obtained using the following R code:

lxx <- log(xx)

lxv <- seq(1, 6, 0.05)
lyv <- dnorm(lxv, mean=mean(lxx), sd=sd(lxx))
hist(lxx, freq=F, breaks=c(1, 1.5,2, 2.5, 3, 3.5, 4,
4.5, 5, 5.5, 6), ylim=c(0, 0.80), xlab="log(income)",
main="Histogram of log(income) data with Normal pdf")
lines(lxv, lyv)

Even if it is not clear whether or not we can find a completely satisfactory parametric model, we will see in
a later section that we can still make approximate inferences about the mean income in the population by
appealing to the central limit theorem.

3.2 Discrete data

3.2.1 Opinion poll data

Let X be the party supported by a randomly selected voter,




Conservative with probability pC





Labour with probability pL

X = Liberal Democrats with probability pLD



UKIP

 with probability pU



Other with probability pO ,

where ‘Other’ includes all other parties. As suggested earlier, we can estimate the probabilities pC , pL , etc. by
the proportions of sampled individuals supporting the corresponding party. Specifically we obtain the following

11
estimates:

pbC = P(X
b = Conservatives) = 369/1000 = 0.369,
pbL = P(X
b = Labour) = 314/1000 = 0.314,
pbLD = P(X
b = Liberal Democrats) = 75/1000 = 0.075,
pbU = P(X
b = UKIP) = 118/1000 = 0.118 ,
pbO = P(X
b = Other party) = 124/1000 = 0.124 .

It is beyond the scope of this module to consider a joint probability model for the vector (nC , nL , nLD , nU , nO )
containing the numbers of individuals supporting each of the five possible choices in a sample of size n. How-
ever we may slightly simplify the situation by focussing on whether or not a randomly chosen voter supports
Labour.
Let the random variable XL denote the number of voters out of the 1000 who support Labour. An
appropriate model may be
XL ∼ Bi(n, pL ) ,

with n = 1000, and pL is estimated by pbL = 0.314. We may use the fitted model to answer various questions,
e.g. ‘what is the estimated probability that in a random sample of 1000 voters at least 330 will support Labour?’.
We require P(XL ≥ 330) under the fitted model Bi(1000, 0.314). It is easiest to use a normal approximation
to the binomial distribution, which gives

329.5 − 1000 × 0.314
P(XL ≥ 330) ≈ 1 − Φ √
1000 × 0.314 × 0.686
= 1 − Φ(1.0561) = 0.1455 ,

using a continuity correction.

An interesting question is whether, in the population, voters are equally as likely to support Labour as they
are to support the Conservatives, i.e. is it true that pL = pC ? Even if it is true that the population proportions
pL and pC are equal, the numbers supporting Labour and Conservative in the sample will usually be slightly
different simply due to random variation in the sample selection. Thus, the sample only contains significant
evidence that pL 6= pC if the difference between the numbers of people in the sample supporting Labour and
Conservative is ‘large’. However, how do we decide how large the difference needs to be in order to support
the conclusion pL 6= pC ? This kind of question will be addressed in a later chapter on Hypothesis Testing.

4 Sampling distributions of sample statistics

Let X1 , . . . , Xn be a random sample from a distribution FX (x). A statistic is a function of the data,

h(X1 , . . . , Xn ) .

The value of this statistic will usually be different for different samples. As the sample data is random, the
statistic is also a random variable. If we repeatedly drew samples of size n, calculating and recording the
value of the sample statistic each time, then we would build up its probability distribution. The probability
distribution of a sample statistic is referred to as its sampling distribution.
In this section we will see how to analytically determine the sampling distributions of some statistics, while
with certain others we can appeal to the central limit theorem. Simulation techniques can also be used to
investigate sampling distributions of statistics empirically.

12
4.1 Sample mean
The mean and variance of the distribution FX (x) are denoted by µ and σ 2 respectively. In the case that the
distribution is continuous with p.d.f. fX (x),
Z ∞
µ = E(X) = x fX (x)dx
−∞

σ 2 = Var(X) = E[(X − µ)2 ]

Z ∞ Z ∞
2
= (x − µ) fX (x)dx = x2 fX (x)dx − µ2 .
−∞ −∞

When the distribution is discrete with p.m.f. pX (x), µ and σ 2 are defined by:
X
µ = E(X) = xpX (x)
x∈RX

σ 2 = Var(X) = E[(X − µ)2 ]

X X
= (x − µ)2 p(x) = x2 p(x) − µ2 ,
x∈RX x∈RX

where RX is the range space of X.

The random variables X1 , . . . , Xn are assumed to be independent and identically distributed (often abbre-
viated to i.i.d.) random variables, each being distributed as FX (x). This means that E(Xi ) = µ for i = 1, . . . , n
and Var(Xi ) = σ 2 for i = 1, . . . , n.
The sample mean of the n sample variables is:
n
1X
X= Xi .
n
i=1

It is straightforward to calculate the mean of the sampling (probability) distribution of X as follows:

1
E(X) = E (X1 + . . . + Xn )
n
1
= [E(X1 ) + . . . + E(Xn )]
n
nµ
= = µ,
n

while the variance is

1
Var(X) = Var (X1 + . . . + Xn )
n
1
= 2 [Var(X1 ) + . . . + Var(Xn )]
n
nσ 2 σ2
= 2 = .
n n

Here we have used Var(X1 + . . . + Xn ) = Var(X1 ) + . . . + Var(Xn ), which holds because the Xi are independent.
These results tell us that the sampling distribution of the sample mean X is centered on the common
mean µ of each of the sample variables X1 , . . . , Xn (i.e. the mean of the distribution from which the sample is
obtained) and has variance equal to the common variance of the Xi divided by n. Thus, as the sample size n
increases, the sampling distribution of X becomes more concentrated around the true mean µ.
In the above discussion nothing specific has been said regarding the actual distribution from which the Xi
have been sampled. All we are assuming is that the mean and variance of the underlying distribution are both

13
finite.

4.1.1 Normally distributed data

In the special case that the Xi are normally distributed then we can make use of some important results. Let
2 ) and let the random variable Y ∼ N (µ , σ 2 ), independently of X.
the random variable X ∼ N (µX , σX Y Y
Then we have the following results:
2 + σ2 )
(i) X + Y ∼ N (µX + µY , σX Y

2 + σ2 )
(ii) X − Y ∼ N (µX − µY , σX Y

(iii) In general, c1 X + c2 Y ∼ N (c1 µX + c2 µY , c21 σX

2 + c2 σ 2 ); c 6= 0, c 6= 0.
2 Y 1 2

These results extend in a straightforward manner to the linear combination of n independent normal random
variables. Let X1 , . . . Xn be n independent normally distributed random variables with E(Xi ) = µi and
Var(Xi ) = σi2 for i = 1, . . . , n. Thus, here the normal distributions for different Xi may have different means
and variances. We then have that
n n n
!
X X X
ci Xi ∼ N ci µi , c2i σi2
i=1 i=1 i=1

where the ci ∈ R.
If now the Xi in the sample are i.i.d. N (µ, σ 2 ) random variables then the sample mean, X, is a linear
combination of the Xi (with ci = n1 , i = 1, . . . , n, using the notation above). Thus, X is normally distributed
with mean µ and variance σ 2 /n, i.e. X n ∼ N (µ, σ 2 /n). This result enables us to make probabilistic statements
about the mean under the assumption of normality.

Example 1. (Component lifetime data).

In Chapter 3 we saw that the normal distribution is a reasonable probability model for the lifetime data
and it seems sensible to estimate the two parameters (µ and σ 2 ) of this distribution by the corresponding
sample quantities, x and s2 . For these data x = 334.59 and s2 = 15.288, and so our fitted model is X ∼
N (334.59, 15.288). Under this fitted model for X, the mean X of a new sample of size 50 from the population
follows a N (334.59, 15.288/50) distribution. We can then, for example, estimate the probability that the mean
of such a sample exceeds 335,
!
335.0 − 334.59
P(X > 335.0) = 1 − Φ p
15.288/50
= 1 − Φ(0.74) = 1 − 0.7704 = 0.2296 .

4.1.2 Using the central limit theorem

In the previous section, we saw that the random quantity X has a sampling distribution with mean µ and
variance σ 2 /n. In the special case when we are sampling from a normal distribution, X is also normally
distributed. However, there are many situations when we cannot determine the exact form of the distribution of
X. In such circumstances, we may appeal to the central limit theorem and obtain an approximate distribution.

The central limit theorem: Let X be a random variable with mean µ and variance σ 2 . If X n is the
mean of a random sample of size n drawn from the distribution of X, then the distribution of the statistic

Xn − µ
√
σ/ n

14
tends to the standard normal distribution as n → ∞.

This means that, for a large random sample from a population with mean µ and variance σ 2 , the sample
mean X n is approximately normally distributed with mean µ and variance σ 2 /n. Since, for large n, X n ∼
N (µ, σ 2 /n) approximately we have that ni=1 Xi ∼ N (nµ, nσ 2 ) approximately.
P

There is no need to specify the form of the underlying distribution FX , which may be either discrete or
continuous, in order to use this result. As a consequence it is of tremendous practical importance.
A common question is ‘how large does n have to be before the normality of X is reasonable?’ The answer
depends on the degree of non-normality of the underlying distribution from which the sample has been drawn.
The more non-normal FX is, the larger n needs to be. A useful rule-of-thumb is that n should be at least 30.

Example 2. (Income data). What is the approximate probability that the mean gross income based on a new
random sample of size n = 500 lies between 33.0 and 33.5 thousand pounds?
The underlying distribution is not normal but we can appeal to the central limit theorem to say that

X 500 ∼ N (µ, σ 2 /n) approximately.

We may estimate µ and σ 2 from the data by µ b2 = s2 = 503.554. Therefore, using the fitted
b = x = 33.27, σ
values of the parameters we may estimate the probability as

33.50 − 33.27 33.00 − 33.27
P(33.0 < X 500 < 33.5) ≈ Φ √ −Φ √
22.44/ 500 22.44/ 500
≈ Φ(0.23) − Φ(−0.27) = 0.5910 − 0.3936
≈ 0.1974 .

Hence we estimate the probability X lies between 33.0 and 33.5 to be 0.1974.

4.2 Sample proportion

Suppose now that we have a random sample X1 , . . . , Xn where the Xi are i.i.d. Bi(1, p) random variables.
Thus, Xi = 1 (‘success’) with probability p and Xi = 0 (‘failure’) with probability 1 − p. We know that
E(Xi ) = p for i = 1, . . . , n and Var(Xi ) = p(1 − p) for i = 1, . . . , n.
The proportion of cases in the sample who have Xi = 1, in other words the proportion of ‘successes’, is
given by
n
1X
Xn = Xi .
n
i=1
p(1−p)
We have that E(X n ) = p and
Var(Xn ) = n . By the central limit theorem, for large n, X n is
p(1−p)
approximately distributed as N p, n which enables us to easily make probabilistic statements about
the proportion of ‘successes’ in a sample of size n.
Pn
We can also say that, for large n, the total number of ‘successes’ in the sample, given by i=1 Xi , is
approximately normally distributed with mean np and variance np(1 − p).
Recall that, for the normal approximation to be reasonable in this context we require that

1−p p
n ≥ 9. max , .
p 1−p

Example 3. Suppose that, in a particular country, the unemployment rate is 9.2%. Suppose that a random
sample of 400 individuals is obtained. What are the approximate probabilities that:

(i) Forty or fewer were unemployed;

15
(ii) The proportion unemployed is greater than 0.125.

Solution:

(i) For i = 1, . . . , n let the random variable Xi satisfy


1 if the ith worker is unemployed
Xi =
0 otherwise .

From the question, P(Xi = 1) = 0.092 and P(Xi = 0) = 0.908.

We have n = 400 ≥ {0.9, 88.8} so that the normal approximation will be valid. Note that np =
400 × 0.092 = 36.8 and np(1 − p) = 400 × 0.092 × 0.908 = 33.414, and ni=1 Xi ∼ N (np, np(1 − p))
P

approximately.

400
! P400 !
X
i=1 Xi − 36.8 40.5 − 36.8
P Xi ≤ 40 =P √ ≤ √
i=1
33.414 33.414
≈ P (Z ≤ 0.640) , where Z ∼ N (0, 1) approx.
= Φ(0.640)
= 0.7390 .

p(1−p) 0.092×0.908
(ii) Here, n = 400 = 0.0002088. Thus,

X 400 − 0.092 0.125 − 0.092
P X 400 > 0.125 = P √ > √
0.0002088 0.0002088
≈ 1 − Φ(2.284)
= 1 − 0.9888
= 0.0112 .

4.3 Sample variance

In this section we will look at the sampling distribution of the sample variance, S 2 , defined by
n
1 X
S2 = (Xi − X)2
n−1
i=1

where X1 , . . . , Xn are a random sample from the distribution with c.d.f. FX (·) with mean µ and variance σ 2 .

16
If FX is any discrete or continuous distribution with a finite variance then
" n #
1 X
E(S 2 ) = E (Xi − X)2
(n − 1)
i=1
" n #
1 X
2
= E [(Xi − µ) − (X − µ)]
(n − 1)
i=1
" n #
1 X
= E [(Xi − µ)2 − 2(Xi − µ)(X − µ) + (X − µ)2 ]
(n − 1)
i=1
" n #
1 X
= E (Xi − µ)2 − 2n(X − µ)(X − µ) + n(X − µ)2
(n − 1)
i=1
" n #
1 X
2
2 2
= E (Xi − µ) − 2n E[(X − µ) ] + n E[(X − µ) ]
(n − 1)
i=1
σ2 σ2

1 2
= nσ − 2n + n
(n − 1) n n
σ2
since E[(X − µ)2 ] = Var(X) =
n
1 2
2
= (n − 1)σ = σ .
(n − 1)

Hence, we can see that by using divisor (n − 1) in the definition of S 2 , we obtain a statistic whose sampling
distribution is centered on the true distribution value of σ 2 . This would not be the case if we had used the
perhaps more intuitively obvious value of n.
We will look more specifically at the case when the Xi are sampled from the N (µ, σ 2 ) distribution. In order
to do so, we first need to introduce a new continuous probability distribution, the chi-squared (χ2 ) distribution.

4.3.1 The chi-squared (χ2 ) distribution

The continuous random variable Y is said to have χ2 distribution with k degrees of freedom (d.f.), written as
χ2 (k), iff its pdf is given by
(
1
2k/2 Γ(k/2)
y (k/2)−1 e−y/2 , y>0
f (y) =
0, otherwise.

Note that this is a special case of the Gamma distribution with parameters α = k/2 and β = 1/2. Note
that when k = 2, Y ∼ Exp(1/2). The mean and variance are given by E(Y ) = k and Var(Y ) = 2k.
The p.d.f.s of chi-squared random variables with d.f. = 1, 3, 6, and 12 are shown in Figure 1. Note that the
p.d.f. becomes more symmetric as the number of degrees of freedom k becomes larger.

17
k=1
k=3
k=6
k = 12

0.8
0.6
fX(x)
0.4
0.2
0.0

0 5 10 15 20 25 30

Figure 4: Chi-squared p.d.f.s with different degrees of freedom.

4.3.2 The connection with the normal distribution

Let Z1 , . . . , Zk be k i.i.d. standard normal random variables, i.e. each has a N (0, 1) distribution. Then, the
random variable
k
X
Y = Zi2
i=1

has a χ2 distribution with k degrees of freedom.

We may use this fact to check that for Y ∼ χ2 (k) we have E(Y ) = k, as follows. First note that if
Zi ∼ N (0, 1) then

1 = Var(Zi )
= E(Zi2 ) − [E(Zi )]2
= E(Zi2 ), since E(Zi ) = 0 .

Hence, E(Zi2 ) = 1 (i = 1, . . . , n) and so

k k
" #
X X
E[Y ] = E Zi2 = E(Zi2 ) = k .
i=1 i=1

Suppose now the random variables X1 , . . . , Xn are a random sample from the N (µ, σ 2 ) distribution. We have
that
Xi − µ
∼ N (0, 1) , i = 1, . . . , n ,
σ

18
so that
n
(Xi − µ) 2
X
∼ χ2 (n) .
σ
i=1

If we modify the above by replacing the population mean µ by the sample estimate X, the distribution changes
and we obtain the following result.
Theorem. If X1 , . . . , Xn ∼ N (µ, σ 2 ) independently, then

n 2
(n − 1)S 2 X (Xi − X)
= ∼ χ2 (n − 1) .
σ2 σ
i=1

(Proof of this result is outside the scope of the course).

By replacing µ with X, the χ2 distribution of the sum of squares has lost one degree of freedom. This is
because there is a single linear constraint on the variables (Xi − X)/σ, namely ni=1 (Xi − X)/σ = 0. Thus we
P

are only summing n − 1 independent sums of squares. Important fact: X and S 2 are independent random
variables.

Example 4. Let X1 , . . . , X40 be a random sample of size n = 40 from the N (25, 42 ) distribution. Find the
probability that the sample variance, S 2 , exceeds 20.
Solution. We need to calculate

39 × S 2

2
39 × 20
P S > 20 = P >
16 16
= P(Y > 48.75) where Y ∼ χ2 (39)
= 1 − P(Y < 48.75) = 1 − 0.8638 = 0.1362 ,

where the probability calculation has been carried out using the pchisq command in R:

> 1-pchisq(q=48.75, df=39)

[1] 0.1362011

5 Point estimation
5.1 Introduction
The objective of a statistical analysis is to make inferences about a population based on a sample. Usually we
begin by assuming that the data were generated by a probability model for the population. Such a model will
typically contain one or more parameters θ whose value is unknown. The value of θ needs to be estimated using
the sample data. For example, in previous chapters we have used the sample mean to estimate the population
mean, and the sample proportion to estimate the population proportion.
A given estimation procedure will typically yield different results for different samples, thus under random
sampling from the population the result of the estimation will be a random variable with its own sampling
distribution. In this chapter, we will discuss further the properties that we would like an estimation procedure
to have. We begin to answer questions such as:

• Is my estimation procedure a good one or not?

• What properties would we like the sampling distribution to have?

19
5.2 General framework
Let X1 , . . . , Xn be a random sample from a distribution with c.d.f. FX (x; θ), where θ is a parameter whose
value is unknown. A (point) estimator of θ, denoted by θb is a real, single-valued function of the sample, i.e.

θb = h(X1 , . . . , Xn ) .

As we have seen already, because the Xi are random variables, the estimator θb is also a random variable whose
probability distribution is called its sampling distribution.
The value θb = h(x1 , . . . , xn ) assumed for a particular sample x1 , . . . , xn of observed data is called a (point)
estimate of θ. Note the point estimate will almost never be exactly equal to the true value of θ, because of
sampling error.
Often θ may in fact be a vector of p scalar parameters. In this case, we require p separate estimators for
each of the components of θ. For example, the normal distribution has two scalar parameters µ and σ 2 . These
could be combined into a single parameter vector, θ = (µ, σ 2 ), for which one possible estimator is θb = (X, S 2 ).

5.3 Properties of estimators

We would like an estimator θb of θ to be such that:

(i) the sampling distribution of θb is centered about the target parameter, θ.

(ii) the spread of the sampling distribution of θb is small.

If an estimator has properties (i) and (ii) above then we can expect estimates resulting from statistical exper-
iments to be close to the true value of the population parameter we are trying to estimate.
We now define some mathematical concepts formalizing these notions. The bias of a point estimator θb is
bias(θ) b − θ. The estimator is said to be unbiased if
b = E(θ)

E(θ)
b = θ,

i.e. if bias(θ)
b = 0. Unbiasedness corresponds to property (i) above, and is generally seen as a desirable property
for an estimator. Note that sometimes biased estimators can be modified to obtain unbiased estimators. For
b = kθ, where k 6= 1 a constant, then bias(θ)
example, if E(θ) b = (k − 1)θ. However, θ/k
b is an unbiased estimator
of θ.
The spread
q of the sampling distribution can be measured by Var(θ).
b In this context, the standard deviation
of θ,
b i.e. Var(θ), b is called the standard error. Suppose that we have two different unbiased estimators of
θ, called θb1 and θb2 , which are both based on samples of size n. By principle (ii) above, we would prefer to use
the estimator with the smallest variance, i.e. choose θb1 if Var(θb1 ) < Var(θb2 ), otherwise choose θb2 .

Example 5. Let X1 , . . . , Xn be a random sample from a N (µ, σ 2 ) distribution where σ 2 is assumed known.
Recall that the Xi ∼ N (µ, σ 2 ) independently in this case. We can estimate µ by the sample mean, i.e.
n
1X
µ
b=X = Xi .
n
i=1

We have already seen that E(X) = µ, thus bias(X) = 0. Moreover, Var(X) = σ 2 /n. Note that Var(X) → 0 as
n → ∞. Thus, as the sample size increases, the sampling distribution of X becomes more concentrated about
the true parameter value µ. The standard error of X is

σ
q
s.e.(X) = Var(X) = √ .
n

20
Note that if σ 2 were in fact unknown, then this standard error would also need to be estimated from the data,
via
s
s.e.(X)
c =√ .
n

Importantly, the results E(X) = µ, Var(X) = σ 2 /n also hold if X1 , . . . , Xn are sampled independently
from any continuous or discrete distribution with mean µ and variance σ 2 . Thus the sample mean is always
an unbiased estimator of the population mean.

Example 6. Suppose now that n = 5, X1 , . . . , X5 ∼ N (µ, σ 2 ), and an alternative estimator of µ is given by

1 2 3 2 1
µ
e = X1 + X2 + X3 + X4 + X5 .
9 9 9 9 9

We have that
µ 2µ 3µ 2µ µ
E[e
µ] = + + + + = µ,
9 9 9 9 9
and
σ 2 4σ 2 9σ 2 4σ 2 σ 2 19σ 2
Var[e
µ] = + + + + = .
81 81 81 81 81 81
19σ 2
Thus, µ
e is an unbiased estimator of µ with variance 81 . b = X is also unbiased for µ and
The sample mean µ
σ2
has variance 5 .
The two estimators µ
b and µ
e both have normal sampling distributions centered on µ but the variance of the
σ2 19σ 2
sampling distribution of µ
b is smaller than that of µ
e because 5 < 81 . Hence, in practice, we would prefer to
use µ
b.

Example 7. Let X1 , . . . , Xn be a random sample from a N (µ, σ 2 ) distribution where now both µ and σ 2 are
assumed to be unknown. We can use X as an estimator of µ and S 2 as an estimator of σ 2 . We have already
seen that
n
1 X
b2 = S 2 =
σ (Xi − X)2
n−1
i=1

is an unbiased estimator of σ 2 , i.e. E[S 2 ] = σ 2 and bias(S 2 ) = E[S 2 ] − σ 2 = σ 2 − σ 2 = 0.

If we instead consider the estimator
n
1X
e2 =
σ (Xi − X)2 ,
n
i=1

(n−1) 2
σ2] =
we see that E[e n σ . e2 is a biased estimator of σ 2 with bias −σ 2 /n. Notice that bias(σ
Thus σ f2 ) → 0
as n → ∞. We say e2 is
that σ asymptotically unbiased. It is common practice to use S 2 , with the denominator
n − 1 rather than n. This results in an unbiased estimator of σ 2 for all values of n.
Exactly the same argument as above could also be made for using S 2 as an estimator of the variance of the
population distribution if the data were from another, non-normal, continuous distribution or even a discrete
distribution. The only prerequisite is that σ 2 is finite in the population distribution. Therefore, calculations
of the sample variance for any set of data should always be based on using divisor (n − 1).

Example 8. Let X1 , . . . , Xn be a random sample of Bernoulli random variables with parameter p which is
unknown. Thus, Xi ∼ Bi(1, p) for i = 1, . . . , n so that E(Xi ) = p and Var(Xi ) = p(1 − p), i = 1, . . . , n.
If we consider estimating p by the proportion of ‘successes’ in the sample then we have
n
1X
pb = Xi
n
i=1

21
so that
n
1X
E(b
p) = E(Xi )
n
i=1
1
= np ,
n

thus E(b
p) = p. Also,
n
1 X
Var(b
p) = 2 Var(Xi ) by independence
n
i=1
1 p(1 − p)
= 2
np(1 − p) = ,
n n

Hence, pb is an unbiased estimator of p with variance p(1 − p)/n. Notice that the variance of this estimator
also tends towards zero as n gets larger.

Example 9. Let X1 , . . . , Xn be a random sample from a U [θ, θ + 1] distribution where θ is unknown. Thus,
the data are uniformly distributed on a unit interval but the location of that interval is unknown. Consider
using the estimator θb = X.
Now,

θ + (θ + 1)
E(X) =
2
2θ + 1
=
2
1
=θ+
2
1
Therefore, bias(X) = θ + 1/2 − θ = 1/2 while Var(X) = 12n . However, if we instead define θb = X − 1/2 then
E(θ)
b = θ and Var(θ)b = 1 .
12n

5.3.1 Summary of point estimation

The key ingredients are:

• A probability model for the data.

• Unknown model parameter(s) to be estimated.

• An estimation procedure, or estimator.

• The sampling distribution of the estimator.

The main points are:

• Application of the estimation procedure, or estimator, to a particular observed data set results in an
estimate of the unknown value of the parameter. The estimate will be different for different random data
sets.

• The properties of the sampling distribution (bias, variance) tell us how good our estimator is, and hence
how good our estimate is likely to be.

• Estimation procedures can occasionally give poor estimates due to random sampling error. For good
estimators, the probability of obtaining a poor estimate is lower.

22
6 Likelihood for discrete data
6.1 The likelihood function
The parameter estimators we have considered so far have mostly been motivated by intuition. For example,
the sample mean X is an intuitive estimator of the population mean. However in many situations, it is not
obvious how to define an appropriate estimator for the parameter(s) of interest.
One method for deriving an estimator, which works for almost any parameter of interest, is the method
of maximum likelihood. The estimators derived in this way typically have good properties. The method
revolves around the likelihood function, which is of great importance throughout Statistics. The likelihood
function is used extensively in estimation and also hypothesis testing, which we discuss in a later chapter.
Let X1 , . . . , Xn be an i.i.d. random sample from the discrete distribution with p.m.f. p(x | θ), where θ is
a parameter whose value is unknown. Given observed data values x1 , . . . , xn from this model, the likelihood
function is defined as
L(θ) = P(X1 = x1 , X2 = x2 , . . . , Xn = xn | θ) .

In other words,

the likelihood is the joint probability of the observed data considered as a function of
the unknown parameter θ.

By independence, we can rewrite the likelihood as follows:

L(θ) = p(x1 | θ) × · · · × p(xn | θ) .

Example 10. Let x1 , . . . , xn be a sample obtained from the Poisson(λ) distribution with p.m.f.

λx e−λ
p(x | λ) = , x = 0, 1, 2, . . .
x!

The likelihood function for this sample is given by:

n n Pn
Y Y λxi e−λ e−nλ λ i=1 xi
L(λ) = p(xi | λ) = = Qn , for λ > 0 .
xi ! i=1 xi !
i=1 i=1

6.2 Maximum likelihood estimation

In the discrete case, given sample data x1 , . . . , xn the maximum likelihood estimate for θ is the value θb that
maximizes the joint probability of the observed data, i.e. that maximizes the value of the likelihood function
L(θ).
Qn
Maximization of L(θ) = i=1 p(xi | θ)
leads to a numerical value θb for the estimate of θ. The value of θb
depends on the observed sample values x1 , . . . , xn , i.e. θb is a function of the data,

θb = h(x1 , . . . , xn ) .

We can also consider θb as a function of the random sample, X1 , . . . , Xn ,

θb = h(X1 , . . . , Xn ) ,

in which case θb is a random variable called the maximum likelihood estimator. The maximum likelihood
estimator possesses its own sampling distribution, which will be studied in later Statistics modules.

23
In simple cases, the maximum likelihood estimate can be found by standard calculus techniques, i.e. by
solving
dL(θ)
= 0. (3)
dθ
However, it is usually much easier algebraically to find the maximum of the log-likelihood l(θ) = log L(θ)
because for i.i.d. data,
n n
" #
Y X
log L(θ) = log p(xi | θ) = log p(xi | θ) .
i=1 i=1

Hence, the log likelihood is additive as opposed to the likelihood which is multiplicative. This is advantageous
because it is far easier to differentiate a sum of functions than to differentiate a product of functions.
To find the value of θ that maximizes l(θ) we instead find θb that solves:
n
dl(θ) X d log p(xi | θ)
= = 0. (4)
dθ dθ
i=1

d2 l(θ)
The solution is a maximum if dθ2
< 0 at θ = θ.
b The estimate found by this method, i.e. by maximizing the
log-likelihood, is identical to the one found by maximizing the likelihood directly, because the logarithm is a
monotonically increasing function.

Example 11. Let X1 , . . . , Xn be a random sample from the Poisson(λ) distribution. Find the maximum
likelihood estimator of λ.
We have seen that Pn
e−nλ λ i=1 Xi
L(λ) = Qn ,
i=1 Xi !

so that
n n
! !
X Y
l(λ) = −nλ + Xi log λ − log Xi ! .
i=1 i=1
dl(λ)
Solving dλ = 0, we obtain
Pn
dl Xi
= −n + i=1 = 0, which implies that λ
b=X.
dλ λ=λb

λ
b

Checking the second derivatives, we see that

d2 l − ni=1 Xi
P
−n
2
= = < 0.
dλ λ=λb
b2
λ X

Therefore, λ
b = X is indeed the maximum likelihood estimator of λ. If we have a set of data x1 , . . . , xn then
b = x, the sample mean. This is an intuitively sensible estimate, as
the maximum likelihood estimate of λ is λ
the mean of the Poisson(λ) distribution is equal to λ.

Example 12. Let X1 , . . . , Xn be a random sample from a Bi(1, p) distribution. Find the maximum likelihood
estimator of p.
In this example then the likelihood function is
n
Y Pn Pn
L(p) = pXi (1 − p)1−Xi = p i=1 Xi
(1 − p)(n− i=1 Xi )
,
i=1

24
so that the log-likelihood is given by

n n
!
X X
l(p) = Xi log p + n− Xi log(1 − p) .
i=1 i=1

dl
Solving dp p=b = 0, we obtain
p

Pn
(n − ni=1 Xi )
P
dl i=1 Xi
= − = 0,
dp p=bp pb 1 − pb

Hence, multiplying all sides by pb(1 − pb),

n
X n
X n
X
Xi − pb Xi − pbn + pb Xi = 0 ,
i=1 i=1 i=1

and so
n
X
Xi = nb
p.
i=1
Pn
Xi
Thus, the maximum likelihood estimator of p is pb = i=1
n = X, i.e. the sample proportion. We have
previously seen that this is unbiased for p.
Note that it is worth checking the second derivative at p = pb,

d2 l − ni=1 Xi (n − ni=1 Xi )
P P
= −
dp2 p=bp pb2 (1 − pb)2
n n
=− −
pb (1 − pb)
n
=− ,
pb(1 − pb)

which is negative, and so pb = X does indeed maximize the likelihood.

6.3 Poisson likelihood examples

In this section we will look at two examples of the Poisson likelihood function. The first example is based on
some simulated Poisson data while the second uses data on the numbers of hourly births over a 24 hour period
in an Australian hospital.
The R function written and used to compute the Poisson likelihood and log-likelihood functions is as follows:

pois.lik <- function(x, lmin, lmax){

nl <- 1000
n <- length(x)
lval <- numeric(nl)
pl <- numeric(nl)
lpl <- numeric(nl)
lval <- seq(from=lmin, to=lmax, length.out=nl)
for(k in 1:nl){
pl[k] <- prod(dpois(x,lambda=lval[k]))
lpl[k] <- sum(log(dpois(x,lambda=lval[k])))
}
pl.res <- data.frame(lval, pl, lpl)
return(pl.res)

25
}

The data are in the argument x while the minimum and maximum λ values to be considered are passed to the
function in the arguments lmin and lmax.
The function returns a data frame called pl.res comprising three columns. The first contains the sequence
of λ values used, the second contains the corresponding likelihood values and the third the corresponding
log-likelihood values.

Example 13. (Simulated data). The data in this example are a random sample of n = 30 simulated from the
Po(λ = 10) distribution. The data are simulated via:

> xp <- rpois(n=30, lambda=10)

The following code produces the likelihood and log-likelihood functions for these data:

> pl.res4 <- pois.lik(xp, lmin=7, lmax=13)

> names(pl.res4)
[1] "lval" "pl" "lpl"

This can be plotted as follows:

> plot(pl.res4$lval, pl.res4$pl, type="l",

xlab="lambda", ylab="L(lambda)",
main="Poisson likelihood, simulated data")
> plot(pl.res4$lval, pl.res4$lpl, type="l",
xlab="lambda", ylab="l(lambda)",
main="Poisson log-likelihood, simulated data")

This gives the following plots.

Poisson likelihood, simulated data Poisson log-likelihood, simulated data

5e-33

-75
4e-33

-80
3e-33
L(lambda)

l(lambda)

-85
2e-33
1e-33

-90
0e+00

7 8 9 10 11 12 13 7 8 9 10 11 12 13

lambda lambda

Figure 5: Likelihood (left) and log-likelihood (right) functions for the simulated Poisson data (n = 30, λ = 10).

The maximum likelihood estimate can be computed approximately via direct numerical maximization of
the likelihood or log-likelihood:

26
> lopt1 <- pl.res4$lval[which.max(pl.res4$pl)]
> lopt1
[1] 10.23123
> lopt2 <- pl.res4$lval[which.max(pl.res4$lpl)]
> lopt2
[1] 10.23123

The maximum likelihood estimate of λ from the two plots is calculated to be 10.23. We know that the maximum
likelihood estimate can be determined analytically as the sample mean which is equal to 10.23.

> mean(xp)
[1] 10.23333

The reason for the slight discrepancy between the two results is the discretization error arising from the use of
a discrete set of λ values in the first method.
Please note that if you run the above code yourself, you will get slightly different results because you will
have sampled a different set of data using the function rpois.

Example 14. (Australian birth data). The data give the number of births per hour over a 24-hour period on
the 18 December 1997 at the Mater Mother’s Hospital in Brisbane, Australia. There were a total of n = 44
births. At the time, this was a record number of births in one 24-hour period in this hospital. We denote the
number of births in the ith hour by Xi and fit the model

Xi ∼ Po(λ) , i = 1, . . . , n ,

with the Xi assumed to be independent. The data can be read in to R as follows:

> birth <- read.table(file="https://minerva.it.manchester.ac.uk/~saralees/birth_freq.txt",

header=T)
> names(birth)
[1] "hour" "number"

> birth
hour number
1 1 1
2 2 3
3 3 1
4 4 0
5 5 4
6 6 0
7 7 0
8 8 2
9 9 2
10 10 1
11 11 3
12 12 1
13 13 2
14 14 1
15 15 4
16 16 1

27
17 17 2
18 18 1
19 19 3
20 20 4
21 21 3
22 22 2
23 23 1
24 24 2

The code to produce the likelihood plots is as follows:

> pl.res.birth <- pois.lik(birth$number, lmin=0, lmax=4)

> plot(pl.res.birth$lval, pl.res.birth$pl, type="l",
xlab="lambda", ylab="L(lambda)", main="Poisson likelihood
function for Australian birth data")
> plot(pl.res.birth$lval, pl.res.birth$lpl, type="l",
xlab="lambda", ylab="l(lambda)", main="Poisson log-likelihood
function for Australian birth data" )

This results in the following figures:

Poisson likelihood function for Australian birth data Poisson log-likelihood function for Australian birth data
2.5e-17

-50
2.0e-17

-100
1.5e-17
L(lambda

l(lambda

-150
1.0e-17

-200
5.0e-18

-250
0.0e+00

0 1 2 3 4 0 1 2 3 4

lambda lambda

Figure 6: The likelihood (left) and log-likelihood (right) functions for the Australian births data (n = 44).

The maximum likelihood estimate is 1.83, which can be found by direct numerical maximization of the
likelihood or log-likelihood function.

> lopt1 <- pl.res.birth$lval[which.max(pl.res.birth$pl)]

> lopt1
[1] 1.833834
> lopt2 <- pl.res.birth$lval[which.max(pl.res.birth$lpl)]
> lopt2
[1] 1.833834

The result can be compared back to the sample mean, x, which gives the same result up to discretization error.

28
> mean(birth$number)
[1] 1.833333

7 Confidence intervals
7.1 Interval estimation
So far in this module, whenever we have fitted a probability model to a data set, we have done so by calculating
point estimates of the values of any unknown parameters θ. However, it is very rare for a point estimate to
be exactly equal to the true parameter value. An alternative approach is to specify an interval, or range,
of plausible parameter values. We would then expect the true parameter value to lie within this interval of
plausible values. We call such an interval an interval estimate of the parameter.
Let X = (X1 , . . . , Xn ) be an independent random sample from a distribution FX (x; θ) with unknown
parameter θ. An interval estimator,
I(X) = [l(X), u(X)]

for θ is defined by two statistics, i.e. functions of the data. The statistic u(X) defines the upper end-point of
the interval, and the statistic l(X) defines the lower end-point of the interval. We will see later how to choose
appropriate statistics for the end-points.
The key property of an interval estimator for θ is its coverage probability. This defined as the probability
that the interval contains, or ‘covers’, the true value of the parameter, i.e.

Pθ [ l(X) ≤ θ ≤ u(X) ] ,

or equivalently Pθ [ I(X) 3 θ ]. We use the notation Pθ for probabilities here to emphasize that the probability
distributions of l(X) and u(X) depend on θ.
Let α ∈ (0, 1), and suppose that we have been able to find statistics l and u such that the coverage
probability satisfies
Pθ [ l(X) ≤ θ ≤ u(X) ] = 1 − α , for all values of θ ,

Then the interval estimator I(X) and, for any particular data set x = (x1 , . . . , xn ) the resulting interval
estimate I(x), is referred to as a 100(1 − α)% confidence interval for θ. The proportion 1 − α is referred to
as the confidence level, and the interval end points l(x), u(x) are known as the confidence limits.

7.2 Single sample procedures

7.2.1 Confidence interval for the mean of a normal distribution with known variance

To illustrate the idea, let X1 , . . . , Xn be a random sample from N (µ, σ 2 ), with µ unknown but σ 2 known.
Recall that X ∼ N (µ, σ 2 /n). Thus, if we standardize X then we obtain the random variable

X −µ
Z= √ ∼ N (0, 1) .
σ/ n

A crucial property of Z above is that the distribution of Z does not depend on µ or σ, i.e. the right hand side
of the above equation is the same no matter what the value of µ or σ.
Let z1−α/2 be such that P(Z ≤ z1− α2 ) = 1 − α/2. By symmetry of the normal distribution, it is also true
that P(Z ≤ −z1− α2 ) = α/2, and furthermore P(−z1− α2 ≤ Z ≤ z1− α2 ) = α. We have therefore that

X −µ
P −z1− α2 ≤ √ ≤ z1− α = 1 − α .
σ/ n 2

29
Moreover, the inequality inside the brackets can be rearranged to show that:

z1− α2 σ z1− α2 σ

1−α=P − √ − X ≤ −µ ≤ + √ −X
n n
z1− α σ z1− α σ

= P X − √2 ≤ µ ≤ X + √2 .
n n

Hence, the interval estimator I(X) for µ defined by

z1− α2 σ z1− α2 σ

I(X) = X − √ , X+ √
n n

contains the true value of µ with probability 1 − α.

The upshot of the above discussion is that for a particular set of data values x = (x1 , . . . , xn ), the interval
estimate
z1− α2 σ z1− α2 σ

I(x) = x − √ , x+ √
n n
is a 100(1 − α)% confidence interval for µ.
We must be careful how to interpret confidence intervals. Given a particular realised data set x with
corresponding calculated interval I(x), it is not true to say that the parameter θ lies within I(x) with 100(1 −
α)% probability. The value of θ is a fixed unknown, and not a random variable. Moreover, once we have
observed data x, I(x) is also fixed and no longer a random variable. Hence either θ is in I(x) or it is not: there
are no random variables remaining about which to make probability statements.
Instead, the correct interpretation is that before the experiment the probability that the interval estimator
will ultimately contain the true value of θ is 100(1 − α)%. Alternatively, if we repeated the experiment a large
number of times and calculated a confidence interval for each sample, then approximately 100(1 − α)% of the
confidence intervals would contain the true value of θ.

30
50
40
Experiment number

30
20
10
0

16 18 20 22 24

Figure 7: 95% confidence intervals computed for 50 different random samples.

In the figure above, each interval is coloured blue if it contains the true value of the parameter (µ = 20)
and green if it does not. The interval contains the true parameter value for 48/50 = 95% of the samples.

Example 15. The following n = 16 observations are a random sample from a N (µ, 22 ) distribution, where µ
is unknown:
10.43 5.42 11.10 12.41 10.14 7.83 8.84 10.42
10.44 9.65 10.36 11.48 9.33 6.81 10.55 10.41
We want to use the data to construct a 95% confidence interval for µ, i.e. here α = 0.05. The sample mean
is x = 9.73 and z1−α/2 = z0.975 = 1.96 so that the end-points of the 95% CI for µ are given by:
r
4.0
9.73 ± 1.96 × ,
16

i.e. the interval is (8.75, 10.71). These data were actually sampled (simulated) from a N (10, 22 ) distribution.
Thus the true value µ = 10 is within the CI.

7.2.2 Confidence interval for the mean of a normal distribution, variance unknown

Suppose now that X1 , . . . , Xn are independent draws from a hN (µ, σ 2 ) distribution where both µ and σ 2 are
z1− α σ z1− α σ i
unknown. It is no longer possible to use the confidence interval x − √n2 , x + √n2 , because σ is unknown.
Instead of basing a confidence interval on the random variable

X −µ
Z= √ ∼ N (0, 1) ,
σ/ n

31
we plug in an estimate of the sample variance in the denominator, namely the sample variance (with divisor
n − 1), to obtain
X −µ
T = √ .
S/ n
Now, because both X and S are random variables the distribution of T is not N (0, 1). The fact that S is also
random induces extra variability into the distribution of T . Thus, for a given value of n, the distribution of T
has a longer tail than that of Z.

7.2.3 Student’s t-distribution

We can show that the exact distribution of T above is a Student’s t-distribution with (n−1) degrees of freedom,
denoted t(n − 1) [or sometimes tn−1 in the literature].
In general, if the random variable T has a t-distribution with ν degrees of freedom then its probability
density function is given by:
− (ν+1)
Γ( ν+1 ) x2
2
fT (x) = √ 2 ν 1+ ,
νπ Γ( 2 ) ν
for ν > 0 and −∞ < x < ∞. We have that E(T ) = 0 and Var(T ) = ν/(ν − 2), for ν > 2. Moreover, the
distribution is symmetric about the origin. As the parameter ν → ∞, the p.d.f. of T approaches that of the
N (0, 1) distribution.
As an exercise, produce a plot in R of the p.d.f. of the N (0, 1) distribution, together with the p.d.f.s of the
t(5) and t(20) distributions. Use the dt function to compute the value of the t p.d.f. for a given set of x-values.
Define t1− α2 to be 1 − α/2 point of the t(n − 1) distribution, i.e. if T ∼ t(n − 1) then P(T ≥ t1− α2 ) = α/2.
Then from the preceding discussion it follows that the random interval

t1− α S t1− α S

I(X) = X − √2 , X + √2
n n

is a 100(1 − α)% confidence interval for µ.

Example 16. Recall the electronic component failure time data introduced in Chapter 3. There are n = 50
observations and we found that x = 334.59 and s2 = 15.288. In Chapter 3 we saw that a normal distribution
with mean and variance equal to the sample values provides a good probability model for the data. As we do
not know the true value of σ 2 , we use the critical value t0.975 = 2.0096 for the t(49) distribution. The 95% CI
for µ has end-points: r
15.288
334.59 ± 2.0096 × ,
50
i.e. I(x) = (333.48, 335.70) which gives a range of plausible values for µ.

7.2.4 Confidence interval for the unknown mean of a non-normal distribution with either known
or unknown variance

Suppose that we now have a ‘large’ random sample from a non-normal distribution, and that we wish to use
the data to construct a confidence interval for the unknown distribution mean µ. We can appeal to the central
limit theorem and construct a 100(1 − α)% CI as follows.
If the variance σ 2 is known then, by the central limit theorem, for large n the statistic

X −µ
Z1 = √
σ/ n

is approximately distributed as N (0, 1). Thus an approximate 100(1 − α)% confidence interval for µ is given

32
by
z1− α σ z1− α σ

X − √2 , X + √2 .
n n
If the variance is unknown, then we instead plug in the sample standard deviation S for σ to obtain the
statistic
X −µ
Z2 = √ .
S/ n
It can also be shown that Z2 is also distributed approximately as N (0, 1) for large n. Thus an approximate
100(1 − α)% confidence interval for µ is given by

z1− α S z1− α S

X − √2 , X + √2 .
n n

Example 17. Recall the Manchester income data for adult males which we have clearly seen to be non-normally
distributed. The data set contains n = 500 observations and we have that x = 33.27 and s2 = 503.554. By the
above discussion, the end points r
503.554
33.27 ± 1.96 ×
500
define a 95% confidence interval for µ, namely (31.30, 35.24). This gives a range of plausible values for the
unknown value of µ.

7.2.5 Confidence interval for the unknown variance of a normal distribution, mean also un-
known

Let X1 , . . . , Xn be a random sample from the N (µ, σ 2 ) distribution where both µ and σ 2 are unknown. We
would like to construct a 100(1 − α)% confidence interval for σ 2 .
We know that
n
2 1 X
S = (Xi − X)2
n−1
i=1

is an unbiased estimator of σ 2 . Also, we have the distributional result that

(n − 1)S 2
∼ χ2 (n − 1)
σ2

It then follows that

(n − 1)S 2

2 2
P χα < < χ1− α = 1 − α ,
2 σ2 2

where χ21− α denotes the (1 − α/2) point of a χ2 (n − 1) distribution, i.e. if Y ∼ χ2 (n − 1) then P(Y ≤ χ21− α ) =
2 2
1 − α/2. We can re-arrange the inequalities to give bounds for the parameter σ 2 , as follows
!
(n − 1)S 2 2 (n − 1)S 2
P < σ < = 1 − α.
χ21− α χ2α
2 2

Hence the 100(1 − α)% confidence interval for σ 2 , based on a sample of size n from a normal population is
given by " #
(n − 1)S 2 (n − 1)S 2
, .
χ21− α χ2α
2 2

The inference is that this random interval contains the true value of σ 2 with probability 1 − α. A 100(1 − α)%
confidence interval for σ can be obtained by taking the square roots of the confidence limits for σ 2 .

Example 18. (Component lifetime data.) For these data n = 50 and s2 = 15.288 so that a 95% confidence

33
interval for σ 2 , assuming normality, is given by

49 × 15.288 49 × 15.288
, ,
χ20.975 χ20.025

where the χ2 values correspond to a χ2 distribution with 49 degrees of freedom. From tables of the χ2 (49)
distribution we have χ20.025 = 31.5549 and χ20.975 = 70.2224 so that the required confidence interval is given by

49 × 15.288 49 × 15.288
, = (10.668, 23.740) .
70.2224 31.5549

A 95% confidence interval for σ is obtained by taking the square roots of these endpoints to give (3.910, 4.872).

7.2.6 Confidence interval for an unknown population proportion

Let X1 , . . . , Xn be a random sample from Bi(1, p), i.e. the Bernoulli distribution, where the value of p is
unknown. We have already seen that the estimator pb = X is an unbiased estimator of p with variance
p(1 − p)/n. By the central limit theorem, pb ∼ N (p, p(1 − p)/n) approximately for large n. Thus, for large n,
!
pb − p
P −z1−α/2 ≤ p ≤ z1−α/2 ≈ 1 − α, (5)
p(1 − p)/n
p
In fact it can also be shown that the above remains true even if Var pb in the denominator is estimated via
p
pb(1 − pb)/n, i.e. for large n,
!
pb − p
P −z1−α/2 ≤ p ≤ z1−α/2 ≈ 1 − α.
pb(1 − pb)/n

Hence we have that for large n

r r !
pb(1 − pb) pb(1 − pb)
P pb − z1− α2 ≤ p ≤ pb + z1− α2 ≈ 1 − α.
n n

It then follows that " r r #

pb(1 − pb) pb(1 − pb)
pb − z1− α2 , pb + z1− α2
n n

is an approximate 100(1 − α)% confidence interval for the parameter p.

Example 19. Recall the opinion poll data collected from n = 1000 voters introduced in Chapter 1. We would
like to use these data to obtain a 95% CI for the proportion in the population who support Labour, denoted
by pL . The proportion in the sample supporting Labour was found to be 0.314 which is our sample estimate
of pL , i.e. pbL = 0.314. From the above, our 95% CI has end points
r
0.314 × 0.686
0.314 ± 1.96 × ,
1000

i.e. the interval is (0.285, 0.343).

p
Instead of substituting an estimate of Var(b
p) in the denominator of (5), we could adopt an alternative,
more conservative approach. The value of p which maximizes the function p(1 − p) for 0 < p < 1 is 0.5. Thus,
pL ) = p(1 − p)/n ≤ 0.5 × 1000 = 0.00025. Using this value in the CI gives end points
in our sample Var(b
√
0.314 ± 1.96 × 0.00025 = 0.314 ± 0.03 ,

34
i.e. the interval (0.283, 0.345), which is a little wider than before. It is this approach which gives rise to the
frequent comment that the proportions found in a poll based on 1000 voters are accurate to plus or minus 3%.

8 Hypothesis testing (Part I)

8.1 Introduction
As we have discussed earlier in the module, one of the main aims of a statistical analysis is to make inferences
about the unknown values of population parameters based on a sample of data from the population. We
previously considered both point and interval estimation of such parameters. Here we instead explore how to
test hypotheses about the values of parameters.
A statistical hypothesis is a conjecture or proposition regarding the distribution of one or more random
variables. In order to specify a statistical hypothesis we need to specify the family of the underlying distribution
(e.g. normal, Poisson, or binomial) as well as the set of possible values of any parameters. A simple hypothesis
specifies the distribution and the parameter values uniquely. In contrast, a composite hypothesis specifies
several different possibilities for the distribution, most commonly corresponding to different possibilities for
the parameter values.
An example of a simple hypothesis is ‘the data arise from N (5, 12 )’. An example of a composite hypothesis
is ‘the data arise from N (µ, 12 ), with µ > 5’.
The elements of a statistical test:

(i) The null hypothesis, denoted by H0 , is the hypothesis to be tested. This is usually a ‘conservative’ or
‘skeptical’ hypothesis that we believe by default unless there is significant evidence to the contrary.

(ii) The alternative hypothesis, denoted by H1 , is a hypothesis about the population parameters which
we will accept if there is evidence that H0 should be rejected.
For example, when assessing a new medical treatment it is common for the null hypothesis to correspond
to the statement that the new treatment is no better (or worse) than the old one. The alternative
hypothesis would be that the new treatment is better.
In this module the null hypothesis will always be simple, while the alternative hypothesis may either be
simple or composite. For example, consider the following hypotheses about the value of the mean µ of a
normal distribution with known variance σ 2 :

• H0 : µ = µ0 , where µ0 is a specific numerical value, is a simple null hypothesis.

• H1 : µ = µ1 (with µ1 6= µ0 ) is a simple alternative hypothesis.
• H1 : µ > µ0 is a one-sided composite alternative hypothesis.
• H1 : µ < µ0 is a one-sided composite alternative hypothesis.
• H1 : µ 6= µ0 is a two-sided composite alternative hypothesis.

How do we use the sample data to decide between H0 and H1 ?

(iii) Test statistic. This is a function of the sample data whose value we will use to decide whether or not
to reject H0 in favour of H1 . Clearly, the test statistic will be a random variable.

(iv) Acceptance and rejection regions. We consider the set of all possible values that the test statistic
may take, i.e. the range space of the statistic, and we examine the distribution of the test statistic under
the assumption that H0 is true. The range space is then divided into two disjoint subsets called the
acceptance region and rejection region.

35
On observing data, if the calculated value of the test statistic falls into the rejection region then we reject
H0 in favour of H1 . If the value of the test statistic falls in the acceptance region then we do not reject
H0 .
The rejection region is usually defined to be a set of extreme values of the test statistic which together
have low probability of occuring if H0 is true. Thus, if we observe such a value then this is taken as
evidence that H0 is in fact false.

(v) Type I and type II errors. The procedure described in (iv) above can lead to two types of possible
errors:

(a) Type I error - this occurs if we reject H0 when it is in fact true.

(b) Type II error - this occurs if we fail to reject H0 when it is in fact false.

The probability of making a type I error is denoted by α and is also called the significance level or
size of the test. The value of α is usually specified in advance; the rejection region is chosen in order to
achieve this value. A common choice is α = 0.05. Note that α = P(reject H0 | H0 ).
The probability of making a type II error is β = P(do not reject H0 | H1 ). For a good testing procedure,
β should be small for all values of the parameter included in H1 .

Example 20. Is a die biased or not? It is claimed that a particular die used in a game is biased in favour
of the six. To test this claim the die is rolled 60 times, and each time it is recorded whether or not a six is
obtained. At the end of the experiment the total number of sixes is counted, and this information is used to
decide whether or not the die is biased.
The null hypothesis to be tested is that the die is fair, i.e. P(rolling a six) = 1/6. The alternative hypothesis
is that the die is biased in favour of the six so that P(rolling a six) > 1/6. Let the probability of rolling a six
be denoted by p. We can write the above hypotheses as:

H0 : p = 1/6
H1 : p > 1/6 .

Let X denote the number of sixes thrown in 60 attempts. If H0 is true then X ∼ Bi(60, 1/6), whereas if H1 is
true then X ∼ Bi(60, p), with p > 1/6. H0 is a simple hypothesis, whereas H1 is a composite hypothesis.
If H0 were true, we would expect to see 10 sixes, since E(X) = 10 under H0 . However, the actual number
observed will vary randomly around this value. If we observe a large number of sixes, then this will constitute
evidence against H0 in favour of H1 . The question is, how large does the number of sixes need to be so that
we should reject H0 in favour of H1 ?
The test statistic here is x and the rejection region is

{x : x > k} ,

for some k ∈ N. Above, we choose the smallest value of k that ensures a significance level α < 0.05, i.e. the
smallest k such that
α = P(X > k | H0 ) < 0.05 .

Note that for k = 14, P(X > k | H0 ) = 0.0648, while for k = 15, P(X > k | H0 ) = 0.0338. Thus we select
k = 15. In this case, the actual significance level of the test is 0.0338.
When, as in this case, the test statistic is a discrete random variable, for many choices of significance level
there is no corresponding rejection region achieving that significance level exactly (e.g. α = 0.05 above).

36
In summary, under H0 the probability of observing more than 15 sixes in 60 rolls is 0.0338. This event is
sufficiently unlikely under H0 that if it occurs then we reject H0 in favour of H1 . It is possible that by rejecting
H0 we may make a type I error, with probability 0.0338 if H0 is true. If 15 or fewer sixes are obtained, then
this is within the acceptable bounds of random variation under H0 . Thus, in this case we would not reject the
null hypothesis that the die is unbiased. However in making this decision we may be making a type II error,
if H1 is in fact true.

8.1.1 Probability of correctly rejecting H0 when it is false

The probability of correctly rejecting H0 when it is false satisfies

P(reject H0 | p) = 1 − P(type II error) .

Ideally we would like the probability on the left to be high. It is straightforward to evaluate this probability for
particular values of p > 1/6. Specifically, P(reject H0 | p) = P(X > 15 | p), where X ∼ Bi(60, p). For example,
the following values have been computed using R:

p P(reject H0 | p)
0.2 0.1306
0.25 0.4312
0.3 0.7562
Clearly, the larger the true value of p, the more likely we are to correctly reject H0 .

9 Hypothesis testing (Part 2)

Single sample procedures
9.1 Introduction
In this chapter we will discuss specific applications of hypothesis testing where we have a single sample of data
and wish to test hypotheses regarding the value of a population mean parameter.
We focus our main discussion on the scenario in which the random sample is from a N (µ, σ 2 ) distribution
with µ unknown and σ 2 known. The ideas are then extended to develop hypothesis tests for (i) the mean of a
normal distribution with unknown variance, (ii) the mean of a non-normal distribution, and (iii) a population
proportion p. In cases (ii) and (iii) it is not possible to calculate the exact distribution of the test statistic
under the null hypothesis, however we can appeal to the central limit theorem to find an approximate normal
distribution.

9.2 Inference about the mean of a normal distribution when the variance is known
Let X1 , . . . , Xn be a random sample from N (µ, σ 2 ), where the value of µ is unknown but the value of σ 2 is
known. We would like to use the data to make inferences about the value of µ and, in particular, we wish to
test the following hypotheses:
H0 : µ = µ0 vs H1 : µ > µ0 .

The null hypothesis H0 posits that the data are sampled from N (µ0 , σ 2 ). In contrast, the alternative hypothesis
H1 posits that the data arise from N (µ1 , σ 2 ), where µ1 > µ0 is an unspecified value of µ. This is a one-sided
test.
We know that the sample mean, X, is an unbiased estimator of µ. Hence, if the true value of µ is µ0 , then
E[X − µ0 ] = µ0 − µ0 = 0. In contrast, if H1 is true, we would have that E[X − µ0 ] = µ − µ0 > 0. This suggests
that we should reject H0 in favour of H1 if X is ‘significantly’ larger than µ0 , i.e. if X > k, for some k > µ0 .

37
The question is, how much greater than µ0 should x be before we reject H0 ? In other words, what value should
we choose for k?
One way to decide this is to fix the probability of rejecting H0 if H0 is true, i.e. the probability of making
a Type I error; the critical value k can then be determined on this basis. This is equivalent to fixing the
significance level of the test. Suppose that we do indeed use X as the test statistic, with rejection region

C = {x > k} ,

and suppose we wish to find k > µ0 to ensure that

P(type I error) = P(reject H0 | H0 true) = α .

Hence we have that

α = P(reject H0 | H0 true) = P(X > k | H0 true)

X − µ0 k − µ0
=P √ > √
σ/ n σ/ n

k − µ0
=P Z> √ ,
σ/ n

X−µ
where Z = √0
σ/ n
∼ N (0, 1) under H0 . Let z1−α denote the α point of N (0, 1), i.e. P(Z ≤ z1−α ) = 1 − α. From
k−µ
√0
this we see that z1−α = σ/ n
and so
z1−α σ
k = µ0 + √ .
n
Thus, H0 is rejected in favour of H1 if the sample mean is greater than µ0 by z1−α standard errors.
Equivalently, we reject H0 in favour of H1 at the 100α% significance level if

X − µ0
Z= √ > z1−α .
σ/ n

The standardized version of X given by Z is the most frequently used form of the test statistic in this scenario.
The critical value z1−α can be obtained from standard normal tables. In hypothesis testing it is common to
use α = 0.05, and in this case z0.95 = 1.645.
Suppose now that we wish to use our sample to test the hypotheses

H0 : µ = µ0 vs H1 : µ < µ0 .

This is again a one-sided test. In this case we will reject H0 in favour of H1 if X < k where k < µ0 . Using
analogous arguments to those used above, we will reject H0 in favour of H1 at the 100α% significance level if

z1−α σ
X < µ0 − √ ,
n

or, equivalently, if
X − µ0
Z= √ < −z1−α .
σ/ n
For a test having a 5% significance level the critical value is −z0.95 = −1.645.
If in fact our interest is in testing

H0 : µ = µ0 vs H1 : µ 6= µ0 ,

38
then we now have a two-sided test. We will reject H0 in favour of H1 if X is either significantly greater or
significantly less than µ0 , i.e. if
X < k1 or X > k2 ,

The critical values k1 < µ0 and k2 > µ0 are chosen so that the significance level is equal to α, i.e.

α = P(X < k1 or X > k2 | H0 true)

= P(X < k1 | H0 ) + P(X > k2 | H0 ) .

It seems natural to choose the values of k1 and k2 so that the probability of rejecting H0 is split equally between
the upper and lower parts of the rejection region. In other words, we choose k1 and k2 such that

P(X < k1 | H0 ) = P(X > k2 | H0 ) = α/2 .

For illustration, see the figure overleaf which shows the p.d.f. of X, together with the rejection region.

Reject H0 Do not reject H0 Reject H0

α 2 1−α α 2

k1 µ0 k2

Figure 8: Illustration of a two-tailed test.

We now find appropriate values of k1 and k2 satisfying this property. We begin with k2 . Note that

X − µ0 k −µ
α/2 = P(X > k2 | H0 true ) = P √ > 2 √ 0
σ/ n σ/ n

k2 − µ0
=P Z> √ , with Z ∼ N (0, 1) .
σ/ n

39
However, we know that z1−α/2 satisfies P(Z ≤ z1−α/2 ) = 1 − α/2. Hence,

k2 − µ0
√ = z1−α/2 ,
σ/ n

and so we have that

z1−α/2 σ
k2 = µ0 + √ .
n
For k1 , observe that

X − µ0 k −µ
α/2 = P(X < k1 | H0 true ) = P √ < 1 √ 0
σ/ n σ/ n

k1 − µ0
=P Z< √ , with Z ∼ N (0, 1) .
σ/ n

k1 −µ
We know that P(Z < −z1−α/2 ) = α/2 and so √0
σ/ n
= −z1−α/2 . Hence

z1−α/2 σ
k1 = µ0 − √ .
n

To summarize the two-tailed test here, we reject H0 at significance level α if

z1−α/2 σ
X > µ0 + √ or if
n
z1−α/2 σ
X < µ0 − √ .
n

Equivalently, we reject H0 at significance level α if

X − µ0
Z= √ > z1−α/2 or if
σ/ n
X − µ0
Z= √ < −z1−α/2 .
σ/ n

9.2.1 Connection between the two-tailed test and a confidence interval for the mean when the
variance is known

Let X1 , . . . Xn be a random sample from N (µ, σ 2 ) with µ unknown and σ 2 known. Recall from Chapter 7 that
a 100(1 − α)% confidence interval for µ is given by

z1−α/2 σ z1−α/2 σ

X− √ , X+ √ .
n n

From the preceding discussion, if we are testing the hypotheses

H0 : µ = µ0
H1 : µ 6= µ0 ,

then we will ‘accept’ H0 at the 100α% significance level if

z1−α/2 σ z1−α/2 σ
µ0 − √ ≤ X ≤ µ0 + √ ,
n n

or, equivalently, if
z1−α/2 σ z1−α/2 σ
X− √ ≤ µ0 ≤ X + √ .
n n

40
Thus, the values of µ in the confidence interval correspond to values of µ0 for which the corresponding null
hypothesis H0 would not be rejected. In other words, informally, the 100(1 − α)% confidence interval is a set
of values of µ which would ‘pass a hypothesis test at significance level α’. It is in this sense that we can regard
the confidence interval as a set of plausible values of µ given the data.

Example 21. (i) A random sample of n = 25 observations is taken from a normal distribution with unknown
mean but known variance σ 2 = 16. The sample mean is found to be x = 18.2. Test H0 : µ = 20 vs
H1 : µ < 20 at the 5% significance level.
Solution: the test statistic is
18.2 − 20.0
Z= p = −2.25
16/25
The appropriate 5% critical value is −z0.95 = −1.645. The observed value of Z is less than −1.645.
Hence, we reject H0 at the 5% significance level and conclude that the true value of µ in the normal
distribution from which the data are sampled satisfies µ < 20.

(ii) Find the probability that we reject H0 using this testing procedure when the true value of the mean µ is
19.0.
Solution: the null hypothesis is rejected if

X − 20.0
√ < −1.645
4/ 25

or equivalently if
4
X < 20.0 − 1.645 × √
25
The true distribution of X is N (19.0, 16/25) and so the probability of rejecting H0 is

4
P X < 20.0 − 1.645 × √
25
!
X − 19.0 20.0 − (1.645 × 45 ) − 19.0
=P <
4/5 4/5

X − 19.0
=P < −0.395
4/5
= Φ(−0.395) = 0.3464 ,
X − 19.0
since the true distribution of is N (0, 1) .
4/5

More generally, the probability of rejecting H0 : µ = µ0 in favour of H1 : µ < µ0 can be written as

µ0 − µ
Φ √ − z1−α .
σ/ n

Clearly, the probability of rejecting H0 will increase as the difference µ0 − µ becomes larger. Hence, the
further the true mean from the hypothesized value, the more likely we are to reject H0 . When µ = µ0 the
above is the probability of rejecting H0 when H0 is true, i.e. the significance level. This can be verified
by substituting in µ = µ0 to obtain Φ(−z1−α ) = α.

Example 22. Suppose now that we have a random sample of n = 50 observations from a normal distribution
with unknown mean and known variance σ 2 = 36. It is found that x = 30.8.

(i) Test H0 : µ = 30 vs H1 : µ 6= 30 at the 5% significance level.

41
Solution: here the test statistic is
30.8 − 30.0
Z= p = 0.943 .
36/50

As the alternative hypothesis is two-sided, we will now reject H0 for either small or large values of Z.
Using a 5% significance level the critical values are −z0.975 = −1.96 and z0.975 = 1.96. The observed
value of Z lies between the two critical values, thus H0 is not rejected at the 5% significance level. We
conclude that there is insufficient evidence to reject the claim that the normal distribution from which
the data arise has mean 30.

(ii) Find the probability that we reject H0 when the true value of the mean µ is 31.0.
Solution: here we require
!
X − 30.0
1 − P −1.96 < √ < 1.96 µ = 31.0

6/ 50
!
6 6
= 1 − P 30 − 1.96 × √ < X < 30 + 1.96 × √ µ = 31
50 50
30 − (1.96 × √650 ) − 31 30 + (1.96 × √650 ) − 31
!
X − 31
=1−P √ < √ < √
6/ 50 6/ 50 6/ 50

30 − 31 30 − 31
=1− Φ √ + 1.96 − Φ √ − 1.96
6/ 50 6/ 50
= 1 − [Φ(−1.179 + 1.96) − Φ(−1.179 − 1.96)] = 0.218 .

More generally, the probability of rejecting H0 : µ = µ0 in favour of H1 : µ 6= µ0 is

µ0 − µ µ0 − µ
1− Φ √ + z1−α/2 − Φ √ − z1−α/2 .
σ/ n σ/ n

This probability increases as |µ0 − µ| becomes larger. When µ = µ0 it is equal to α, the significance level.

9.3 Inference about the mean of a normal distribution when the variance is unknown
Let X1 , . . . , Xn be a random sample from the N (µ, σ 2 ) distribution, where the value of µ is unknown but that
of σ 2 is also unknown. We want to test the following hypotheses:

H0 : µ = µ0
H1 : µ > µ0

at significance level α. Based on the discussion in the previous section, an appropriate test statistic which
measures the discrepancy between µ0 and the sample estimator X is given by

X − µ0
T = √
S/ n
where S is the sample standard deviation. This is an estimate of the standardized difference between X and µ0 .
As we have discussed previously, because the statistic T involves the random quantities X and S, its sampling
distribution is no longer N (0, 1). We have seen in Chapter 7 that T ∼ t(n − 1), under the assumption that H0
is true, i.e. T has a Student t-distribution with n − 1 degrees of freedom.
Assuming that the significance level of the test is α, we use one of the following rejection regions, depending
on the alternative hypothesis:

42
• For the one-sided alternative hypothesis H1 : µ > µ0 ,

reject H0 if T > t1−α ,

where t1−α is the 1 − α point of a t(n − 1) distribution, i.e. P(T ≤ t1−α ) = 1 − α.

• For the one-sided alternative hypothesis H1 : µ < µ0 ,

reject H0 if T < −t1−α .

• For the two-sided alternative hypothesis H1 : µ 6= µ0 ,

reject H0 if T < −t1−α/2 or T > t1−α/2 .

Example 23. The drug 6-mP is used to treat leukaemia. A random sample of 21 patients using 6-mP were
found to have an average remission time of x = 17.1 weeks with a sample standard deviation of s = 10.00
weeks. A previously used drug treatment had a known mean remission time of µ0 = 12.5 weeks. Assuming
that the remission times of patients taking 6-mP are normally distributed with both the mean µ and variance
σ 2 being unknown, test at the 5% significance level whether the mean remission time of patients taking 6-mP
is greater than µ0 = 12.5 weeks.
Solution: We want to test H0 : µ = 12.5 vs H1 : µ > 12.5 at the 5% significance level.
The test statistic is
x − µ0 17.1 − 12.5
T = √ = √ = 2.108
s/ n 10/ 21
Under H0 , T ∼ t(20). For a one-tailed test at the 5% significance level we will reject H0 if T > 1.725 (from
tables). Our observed value of T is greater than 1.725 and so we reject the null hypothesis that µ = 12.5 at
the 5% significance level and conclude that µ > 12.5, i.e. the drug 6-mP improves remission times compared
to the previous drug treatment.

9.4 Using the central limit theorem

(i) Inference about the mean of a non-normal distribution.
Let X1 , . . . , Xn be a random sample from a non-normal distribution, where the value of the mean µ is
unknown and that of the variance σ 2 is also unknown. We want to test the following hypotheses:

H0 : µ = µ 0
H1 : µ > µ0

at significance level α. We can again use the test statistic

X − µ0
Y = √
S/ n

defined above which, by asymptotic (large n) results, has an approximate N (0, 1) distribution when H0
is true (n ≥ 30). Aside from the choice of test statistic, the rejection regions for the various versions of
H1 are otherwise identical to those defined in the case of normal data with a known variance.

(ii) Inference about the population proportion p.

Let X1 , . . . , Xn be a random sample of Bi(1, p) random variables, where the value of p is unknown. We

43
want to test the following hypotheses:

H0 : p = p0
H1 : p > p0

at significance level α. As we have seen earlier in this module, an unbiased sample estimator of the
parameter p is given by
n
1X
pb = Xi = X n .
n
i=1

By the central limit theorem, pb ∼ N (p, p(1 − p)/n) approximately for large n. As a rule of thumb,
n ≥ 9 max{p/(1 − p), (1 − p)/p} guarantees this approximation has a good degree of accuracy. A suitable
test statistic is
pb − p0
Y =p
p0 (1 − p0 )/n
p
Here we have estimated the standard error of pb by p0 (1 − p0 )/n which uses the value of p specified
under H0 . If H0 is true then Y has an approximate N (0, 1) distribution for large n. Thus, to achieve an
approximate significance level of α, we reject H0 in favour of the above H1 if Y > z1−α .

• For the one-sided alternative hypothesis H1 : p < p0 , to achieve an approximate significance level of
α, we reject H0 if Y < −z1−α .
• For the two-sided alternative hypothesis H1 : p 6= p0 , to achieve an approximate significance level of
α, we reject H0 if
Y < −z1−α/2 or Y > z1−α/2 .

Example 24. A team of eye surgeons has developed a new technique for an eye operation to restore
the sight of patients blinded by a particular disease. It is known that 30% of patients who undergo an
operation using the old method recover their eyesight.
A total of 225 operations are performed by surgeons in various hospitals using the new method and it
is found that 88 of them are successful in that the patients recover their sight. Can we justify the claim
that the new method is better than the old one? (Use a 1% level of significance).
Solution: Let p be the probability that a patient recovers their eyesight following an operation using
the new technique. We wish to test H0 : p = 0.30 vs H1 : p > 0.30 at the 1% significance level.
Our test statistic is
88
225 − 0.30
Y =q = 2.9823
0.30×0.70
225

As a check for the approximate normality of the distribution of Y under H0 , we require n > 9 max{0.429, 2.333} =
20.997 which is true since n = 225.
The approximate 1% critical value, taken from standard normal tables, is 2.3263 which is less than the
observed value of Y . Hence, we reject the null hypothesis at the 1% significance level and conclude that
p > 0.30.

44
10 Hypothesis testing (Part 3)
Procedures for two independent samples
10.1 Introduction
In this chapter we will extend hypothesis testing to the scenario in which there are two independent samples
of data, and the aim is to make an inference about the difference in the means of the two populations from
which the data have been sampled.
To this end, let X11 , . . . , X1n1 be a random sample of size n1 from a distribution with mean µ1 and variance
σ12 . Also, let X21 , . . . , X2n2 be a second random sample, independent from the first, from a distribution with
mean µ2 and variance σ22 . Suppose that we wish to test

H0 : µ1 − µ2 = φ,

where φ is a constant (often φ = 0), versus one of the following alternative hypotheses at the 100α% significance
level:

(i) H1 : µ1 − µ2 > φ (one-sided)

(ii) H1 : µ1 − µ2 < φ (one-sided)

(iii) H1 : µ1 − µ2 6= φ (two-sided)

10.2 Both underlying distributions normal with known variances σ12 and σ22
An unbiased estimator of µ1 − µ2 = φ is given by X 1 − X 2 where
nk
1 X
Xk = Xki , k = 1, 2 .
nk
i=1

This estimator satisfies

σ2 σ2
Var X 1 − X 2 = 1 + 2 .
n1 n2
We have seen in Chapter 4 that both X 1 and X 2 are normally distributed so their difference will also be
normal. In fact
σ12 σ22

X 1 − X 2 ∼ N µ1 − µ2 , + ,
n1 n2
and, when H0 is true, µ1 − µ2 = φ.
For a test statistic we will use the standardized distance between the sample estimate of φ and its hypoth-
esized value, i.e.
X1 − X2 − φ
Z= q 2 .
σ1 σ22
n1 + n2

Under H0 , Z ∼ N (0, 1). We again find the critical value of our test by fixing the probability of a type I
error to be α, i.e. P(reject H0 | H0 is true) = α. This idea was described in detail for single sample inference
in Chapter 9. Below we list the rejection regions corresponding to the three possible alternative hypotheses
introduced in Section 10.1.

(i) For H1 : µ1 − µ2 > φ, we reject H0 at the 100α% significance level if Z > z1−α , where z1−α satisfies
Φ(z1−α ) = 1 − α. Equivalently, we reject H0 if
s
σ12 σ22
X 1 − X 2 > φ + z1−α + .
n1 n2

45
E.g. if α = 0.05 then z0.95 = 1.645.

(ii) For H1 : µ1 − µ2 < φ, we reject H0 at the 100α% significance level if Z < −z1−α . Equivalently, we reject
H0 if s
σ12 σ22
X 1 − X 2 < φ − z1−α + .
n1 n2
E.g. if α = 0.05 then −z0.95 = −1.645.

(iii) For H1 : µ1 − µ2 6= φ, we reject H0 at the 100α% significance level if |Z| > z1−α/2 . Equivalently, we
reject H0 if s
σ12 σ22
|(X 1 − X 2 ) − φ| > z1−α/2 +
n1 n2
E.g. if α = 0.05 then z0.975 = 1.96.

10.3 Both distributions normal with unknown variances

10.3.1 Unequal variances (i.e. σ12 6= σ22 )

As the true values of σ12 and σ22 are unknown, we estimate them using the sample variances given by

kn
1 X
Sk2 = (Xki − X k )2 , k = 1, 2 .
nk − 1
i=1

Considering the estimated standardized difference between X 1 − X 2 and φ we have that, under H0 ,

X1 − X2 − φ
Y = q 2 ∼ N (0, 1) approximately
S1 S22
n1 + n2

when n1 and n2 are large, e.g. n1 > 30 and n2 > 30. To achieve an approximate significance level of 100α%,
the rejection regions for the three alternative hypotheses introduced in Section 10.1 are:

(i) For H1 : µ1 − µ2 > φ, reject H0 if Y > z1−α

(ii) For H1 : µ1 − µ2 < φ, reject H0 if Y < −z1−α

(iii) For H1 : µ1 − µ2 6= φ, reject H0 if |Y | > z1− α2

10.3.2 Equal variances (i.e. σ12 = σ22 = σ 2 )

If we are prepared to assume that the unknown variances of the two normal distributions are equal, i.e.
σ12 = σ22 = σ 2 , then the common variance σ 2 may be estimated using the estimator described in Chapter 7, i.e.

(n1 − 1)S12 + (n2 − 1)S22

b2 =
σ .
n1 + n2 − 2

The test statistic is then

X1 − X2 − φ
T = q ,
b n11 + n12
σ

which can be shown to have a Student t-distribution with (n1 + n2 − 2) degrees of freedom when H0 is true.
The rejection regions for the three alternative hypotheses in Section 9.1 are:

(i) For H1 : µ1 − µ2 > φ, we reject H0 if T > t1−α , where t1−α is the 1 − α point of a t distribution on
n1 + n2 − 2 degrees of freedom.

46
(ii) For H1 : µ1 − µ2 < φ, we reject H0 if T < −t1−α .

(iii) For H1 : µ1 − µ2 6= φ, we reject H0 if |T | > t1−α/2 .

Each rejection region above defines a test with an exact significance level of 100α%.

Example 25. An investigation was carried out comparing a new drug with a placebo. A random sample of
n1 = 40 patients was treated with the new drug, while an independent sample of n2 = 36 patients was given
the placebo. A response was measured for each patient. Under the new drug, the response had sample mean
x1 = 10.13 and sample variance s21 = 4.721. Under placebo, the response had sample mean x2 = 12.16 and
sample variance s22 = 3.368.
Supposing that the responses in both groups are normally distributed, test at the 5% significance level
whether the population mean response under the new drug is the same as that under placebo. Conduct your
analysis assuming that (i) σ12 6= σ22 and (ii) σ12 = σ22 .
Solution: we are required to test H0 : µ1 = µ2 vs H1 : µ1 6= µ2 , where µ1 denotes the (population) mean
response under the new drug, and µ2 denotes the (population) mean response under placebo.

(i) In the case where we assume that σ12 6= σ22 , the test statistic is

10.13 − 12.16 − 0
Y = q = −4.413 .
4.721 3.368
40 + 36

For a two-sided test at the approximate 5% significance level we will reject H0 if |Y | > z0.975 = 1.96. The
observed value of |Y | is 4.413 and so we reject H0 at the approximate 5% level. Hence, we conclude that
the mean response for those receiving the new drug is not equal to the mean response for those receiving
the placebo.

(ii) In the second case, where we assume that σ12 = σ22 , we need to estimate the common variance σ 2 by

39 × 4.721 + 35 × 3.368
b2 =
σ = 4.081 .
40 + 36 − 2

The test statistic is then

10.13 − 12.16 − 0
T =q
1 1
= −4.374 .
4.081 40 + 36

This time, for a two-sided test at the 5% significance level, we will reject H0 if |T | > t0.975 = 1.993 on 74
degrees of freedom. We have |T | = 4.374 > 1.993 and so we reject H0 at the 5% level and conclude that
the two population means are not equal.

10.4 Both distributions non-normal with variances σ12 and σ22

If both distributions are non-normal then we can appeal to the central limit theorem. Provided n1 > 30 and
n2 > 30, under H0
X1 − X2 − φ
Y = q 2 ∼ N (0, 1) approximately .
σ1 σ22
n1 + n2

Below we give a rejection region resulting in an approximate significance level of 100α% for each of the three
alternative hypotheses listed in Section 10.1:

(i) For H1 : µ1 − µ2 > φ, we reject H0 at the approximate 100α% significance level if Y > z1−α .

(ii) For H1 : µ1 − µ2 < φ, we reject H0 at the approximate 100α% significance level if Y < −z1−α .

(iii) For H1 : µ1 − µ2 6= φ, we reject H0 at the approximate 100α% significance level if |Y | > z1− α2 .

47
If the variances of the two distributions are unknown then we substitute the sample estimators S12 and S22
and proceed as just described for the case of known variances.

10.5 Bernoulli distributions Bi(1, p1 ) and Bi(1, p2 )

This time we have two independent samples of binary data with E(X1i ) = p1 , i = 1, . . . , n1 , and E(X2i ) = p2 ,
i = 1, . . . , n2 . We want to test the null hypothesis

H0 : p1 − p2 = φ,

where φ is a constant (often set equal to zero) against one of the three alternative hypotheses given by

(i) H1 : p1 − p2 > φ (one-sided)

(ii) H1 : p1 − p2 < φ (one-sided)

(iii) H1 : p1 − p2 6= φ (two-sided)

at the approximate 100α% significance level. Here we are making an inference about the difference in the
proportions of ‘successes’ in the two underlying populations. When n1 and n2 are both large we have that

p1 (1 − p1 ) p2 (1 − p2 )
pb1 − pb2 ∼ N p1 − p2 , + approximately ,
n1 n2

and an appropriate test statistic is

pb1 − pb2 − φ
Y =q ,
pb1 (1−bp1 ) pb2 (1−bp2 )
n1 + n2

where in the denominator the following sample estimate of the standard error of pb1 − pb2 has been used:
s
pb1 (1 − pb1 ) pb2 (1 − pb2 )
sd p1 − pb2 ) =
. e.(b + .
n1 n2

Provided n1 and n2 are both reasonably large, under H0 the test statistic Y ∼ N (0, 1) approximately by
asymptotic results. Note that
nk
1 X
pbk = Xki = X k , k = 1, 2 ,
nk
i=1

which can be expressed as

rk
pbk = , k = 1, 2 ,
nk
Pnk
where rk = i=1 Xki denotes the number of successes observed in sample k, k = 1, 2.
The rejection regions for the three alternative hypotheses given above, using an approximate significance
level of 100α%, are:

(i) For H1 : p1 − p2 > φ, we reject H0 at the approximate 100α% significance level if Y > z1−α

(ii) For H1 : p1 − p2 < φ, we reject H0 at the approximate 100α% significance level if Y < −z1−α

(iii) For H1 : p1 − p2 6= φ, we reject H0 at the approximate 100α% significance level if |Y | > z1− α2

The case H0 : p1 = p2
If φ = 0, then under H0 we have p1 = p2 = p, say. An estimate of the common probability p is given by the
‘pooled estimate’
r1 + r2
p= .
n1 + n2

48
In this case it makes sense to use the estimate p when forming the estimated standard error of pb1 − pb2 that
appears in the denominator of Y . The revised test statistic for the case when H0 : p1 = p2 is thus

pb1 − pb2
Y =q .
p(1−p) p(1−p)
n1 + n2

The rejection regions are otherwise unchanged.

Example 26. In a random sample of n1 = 120 voters from Town I, r1 = 56 indicated that they would support
Labour in a general election. In a second independent random sample of size n2 = 110 from Town II, taken
on the same day as the sample from Town I, r2 = 63 indicated that they would support Labour in a general
election. Carry out an appropriate test at the approximate 5% significance level to examine whether the
proportions of voters supporting Labour are the same in the two towns.
Solution. Let p1 denote the (population) proportion of Labour voters in Town I and p2 denote the
(population) proportion of Labour voters in Town II. We wish to test H0 : p1 − p2 = 0 vs H1 : p1 − p2 6= 0 at
the approximate 5% significance level. We have that pb1 = r1 /n1 = 56/120 = 0.467 and pb2 = r2 /n2 = 63/110 =
0.573.
Under H0 , we have that p1 = p2 . An estimate of the common value of p is given by

r1 + r2 56 + 63 119
p= = = = 0.517 .
n1 + n2 120 + 110 230

This is used in the denominator of the test statistic to give

0.467 − 0.573 − 0
Y =q = −1.607 .
0.517×0.483 0.517×0.483
120 + 110

We would reject H0 at the approximate 5% level if |Y | > z0.975 = 1.96. The observed value of |Y | = 1.607 <
1.96. Hence, there is insufficient evidence to reject H0 at the approximate 5% level. In other words, there is
insufficient evidence to reject the claim that the proportions supporting Labour in the two towns are equal.
(Note that both n1 , n2 > 9 × max 0.517 0.483

0.483 0.517 = 9.634 which justifies the normal approximations for p
, b1
and pb2 under H0 .)

MATH10282: Introduction To Statistics Supplementary Lecture Notes
No ratings yet
MATH10282: Introduction To Statistics Supplementary Lecture Notes
50 pages
Econ notes32
No ratings yet
Econ notes32
5 pages
S M E: D S: Tatistics With Atlab For Ngineers Escriptive Tatisics
No ratings yet
S M E: D S: Tatistics With Atlab For Ngineers Escriptive Tatisics
16 pages
Lecture - 5 - Start
No ratings yet
Lecture - 5 - Start
167 pages
Data Analysis Assignment Help
No ratings yet
Data Analysis Assignment Help
7 pages
DA_Answer-Key
No ratings yet
DA_Answer-Key
12 pages
Stat Introduction Units 1& 2
No ratings yet
Stat Introduction Units 1& 2
108 pages
Mathematics in The Modern World: 2015 Census of Population
50% (2)
Mathematics in The Modern World: 2015 Census of Population
4 pages
Chapters 1 and 2
No ratings yet
Chapters 1 and 2
12 pages
ProbStat-A-123-1-23
No ratings yet
ProbStat-A-123-1-23
23 pages
MANG6513 2023 Lecture 3 (1)
No ratings yet
MANG6513 2023 Lecture 3 (1)
36 pages
Ratcliffe C.doubt-Free Uncertainty - Springer.2015 (W.comments)
No ratings yet
Ratcliffe C.doubt-Free Uncertainty - Springer.2015 (W.comments)
98 pages
Chapter 3-Numerical Measures
No ratings yet
Chapter 3-Numerical Measures
38 pages
Unit 10 Data Management and Presentation
No ratings yet
Unit 10 Data Management and Presentation
42 pages
W6_3-Problem Session - Questions
No ratings yet
W6_3-Problem Session - Questions
4 pages
1 Descriptive Analysis Complete
No ratings yet
1 Descriptive Analysis Complete
11 pages
Postmidterm Session PPTs
No ratings yet
Postmidterm Session PPTs
442 pages
QT Text Book-3
No ratings yet
QT Text Book-3
391 pages
2035 CH2 Notes
No ratings yet
2035 CH2 Notes
42 pages
Chapter 1 Overview
No ratings yet
Chapter 1 Overview
28 pages
Chapter 15
No ratings yet
Chapter 15
76 pages
3 Problems
No ratings yet
3 Problems
56 pages
P - 338 - Oil Price Prediction Final
No ratings yet
P - 338 - Oil Price Prediction Final
40 pages
PHPS30020 Week1 (2) - 29nov2023 (Combined Sampling Variation - SEM SEP CIs)
No ratings yet
PHPS30020 Week1 (2) - 29nov2023 (Combined Sampling Variation - SEM SEP CIs)
24 pages
Unit 4
No ratings yet
Unit 4
17 pages
Data Management MMW
No ratings yet
Data Management MMW
92 pages
STA 111 (1)
No ratings yet
STA 111 (1)
14 pages
Advance Research and Statistics
No ratings yet
Advance Research and Statistics
81 pages
AMR Assignment Pilgrim Bank: Indian Institute of Management Raipur
No ratings yet
AMR Assignment Pilgrim Bank: Indian Institute of Management Raipur
20 pages
IME602 Notes 01
No ratings yet
IME602 Notes 01
65 pages
Ie 673 Assignment 4
No ratings yet
Ie 673 Assignment 4
6 pages
Biometry - Chapter 1
No ratings yet
Biometry - Chapter 1
22 pages
Data Management
No ratings yet
Data Management
44 pages
Introductory Statistical Concepts
No ratings yet
Introductory Statistical Concepts
118 pages
5 Statistik
No ratings yet
5 Statistik
62 pages
Factor Analysis
67% (3)
Factor Analysis
25 pages
Interpreting Basic Statistics: A Workbook Based on Excerpts from Journal Articles 8th Edition – Ebook PDF Version - Download the full ebook now for a seamless reading experience
100% (1)
Interpreting Basic Statistics: A Workbook Based on Excerpts from Journal Articles 8th Edition – Ebook PDF Version - Download the full ebook now for a seamless reading experience
51 pages
Stat L2 2021 Fall
No ratings yet
Stat L2 2021 Fall
58 pages
Practice Hands-On Activity-Independent T-Test and Mann Whitney U Test
No ratings yet
Practice Hands-On Activity-Independent T-Test and Mann Whitney U Test
46 pages
Midterm Reviews
No ratings yet
Midterm Reviews
4 pages
7QC Tools
No ratings yet
7QC Tools
62 pages
23MT2013 DSS CO4 Session 20 Statistical Tests
No ratings yet
23MT2013 DSS CO4 Session 20 Statistical Tests
40 pages
Stats 1 Module Updated
No ratings yet
Stats 1 Module Updated
53 pages
Take Home Test MEI 2014
No ratings yet
Take Home Test MEI 2014
3 pages
Quantitative Techniques For Management (DBB2102)
No ratings yet
Quantitative Techniques For Management (DBB2102)
10 pages
PASSS RQ2 Frequencies
No ratings yet
PASSS RQ2 Frequencies
7 pages
Eco Final Sample
No ratings yet
Eco Final Sample
9 pages
Lecture 3-Descriptive Statistics
No ratings yet
Lecture 3-Descriptive Statistics
63 pages
Chapter 3 Data Visualizations
No ratings yet
Chapter 3 Data Visualizations
7 pages
final mean mode median central tendency
No ratings yet
final mean mode median central tendency
173 pages
Manvi Bhatt - SPSS Document
No ratings yet
Manvi Bhatt - SPSS Document
58 pages
Modelling and Simulation Laboratory Assignment No. 3
No ratings yet
Modelling and Simulation Laboratory Assignment No. 3
7 pages
Wa0001
No ratings yet
Wa0001
12 pages
Module 2a
No ratings yet
Module 2a
70 pages
SPSS Handouts
No ratings yet
SPSS Handouts
6 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
17 pages
Intro STAT POWER POINT 1
No ratings yet
Intro STAT POWER POINT 1
28 pages
Customer Management Tutorial 2 Charting Data: An Introduction To Variation
No ratings yet
Customer Management Tutorial 2 Charting Data: An Introduction To Variation
12 pages
Understanding Educational Statistics Using Microsoft Excel and SPSS
From Everand
Understanding Educational Statistics Using Microsoft Excel and SPSS
Martin Lee Abbott
No ratings yet
Normality, Data Transformation & Scaling: Statistics for Lean Six Sigma Simplified with GEN AI, #2
From Everand
Normality, Data Transformation & Scaling: Statistics for Lean Six Sigma Simplified with GEN AI, #2
Sumeet Savant
No ratings yet
How To Write Your Conference Paper For IC-AIRES2020: First A. Author1,2,, Second B. Author2, and Third C. Author3
No ratings yet
How To Write Your Conference Paper For IC-AIRES2020: First A. Author1,2,, Second B. Author2, and Third C. Author3
3 pages
Policy On The Organization and Implementation of The Special Program in Journalism (SPJ)
No ratings yet
Policy On The Organization and Implementation of The Special Program in Journalism (SPJ)
19 pages
Bandersnatch
No ratings yet
Bandersnatch
3 pages
Secret Sheet Music For Piano (Solo) Easy
No ratings yet
Secret Sheet Music For Piano (Solo) Easy
1 page
DPDP Coupler
No ratings yet
DPDP Coupler
104 pages
MAT490 Mathematics WEEK 3
No ratings yet
MAT490 Mathematics WEEK 3
5 pages
Group Members: Abdullah Aiman Bin Sharuddin 1171785 Muhammad Firdaus Bin Faisal 1171780 Noor Hamizan Mohd Noor 1171774
No ratings yet
Group Members: Abdullah Aiman Bin Sharuddin 1171785 Muhammad Firdaus Bin Faisal 1171780 Noor Hamizan Mohd Noor 1171774
12 pages
Migration to IP Based Networks-30.06.2014
No ratings yet
Migration to IP Based Networks-30.06.2014
45 pages
English Grade2 Unit1 All About Me My Clothes and Things LESSON PLAN
No ratings yet
English Grade2 Unit1 All About Me My Clothes and Things LESSON PLAN
4 pages
Capt Mittal
No ratings yet
Capt Mittal
38 pages
SURGERY - 1.6 Gallbladder and The HBT
100% (1)
SURGERY - 1.6 Gallbladder and The HBT
7 pages
21.8.21 Selected Solutions To Rudin's "Principles of Mathematical ...
No ratings yet
21.8.21 Selected Solutions To Rudin's "Principles of Mathematical ...
172 pages
طالب الدليمي اقتصاد السوق بحث في أصوله وأسباب تجدد الدعوة أليه PDF
No ratings yet
طالب الدليمي اقتصاد السوق بحث في أصوله وأسباب تجدد الدعوة أليه PDF
23 pages
A Summer Internship Report On The Topic - Financial
No ratings yet
A Summer Internship Report On The Topic - Financial
13 pages
Labor Law Case Digest
No ratings yet
Labor Law Case Digest
15 pages
State Succession A Critical Analysis
No ratings yet
State Succession A Critical Analysis
12 pages
Saraswati Temples: Ebook - Series (For Free Circulation)
100% (1)
Saraswati Temples: Ebook - Series (For Free Circulation)
8 pages
A Critical Case Analysis: Otelia J. Torres
No ratings yet
A Critical Case Analysis: Otelia J. Torres
11 pages
11vda. de Quirino v. Palarca 29 SCRA 1 GR L 28269 08151969 G.R. No. L 28269
No ratings yet
11vda. de Quirino v. Palarca 29 SCRA 1 GR L 28269 08151969 G.R. No. L 28269
3 pages
Passport Government of India
No ratings yet
Passport Government of India
6 pages
Countable and Uncountable Nouns
No ratings yet
Countable and Uncountable Nouns
3 pages
Computer System Architecture Third Edition Tutorial - Chap - 07
No ratings yet
Computer System Architecture Third Edition Tutorial - Chap - 07
25 pages
Ballad
No ratings yet
Ballad
9 pages
Basic Medical Terms 101 Terms Every Future Healthcare Pro Should Know
No ratings yet
Basic Medical Terms 101 Terms Every Future Healthcare Pro Should Know
7 pages
Buddhism and Quantum Physics
100% (1)
Buddhism and Quantum Physics
11 pages
5 Meal Contemplations Gathas
No ratings yet
5 Meal Contemplations Gathas
2 pages
Points of Family Law
No ratings yet
Points of Family Law
5 pages
Department of Education: Republic of The Philippines
No ratings yet
Department of Education: Republic of The Philippines
3 pages
Python Lab File
No ratings yet
Python Lab File
21 pages
Revised Construction of Particular Statutes-Nik
No ratings yet
Revised Construction of Particular Statutes-Nik
110 pages