Subject Guide (2019) PDF
Subject Guide (2019) PDF
Subject Guide (2019) PDF
J.S. Abdey
ST104b
2019
Undergraduate study in
Economics, Management,
Finance and the Social Sciences
This subject guide is for a 100 course offered as part of the University of London
undergraduate study in Economics, Management, Finance and the Social
Sciences. This is equivalent to Level 4 within the Framework for Higher Education
Qualifications in England, Wales and Northern Ireland (FHEQ). For more information,
see: www.london.ac.uk
This guide was prepared for the University of London by:
James S. Abdey, BA (Hons), MSc, PGCertHE, PhD, Department of Statistics,
London School of Economics and Political Science.
This is one of a series of subject guides published by the University. We
regret that due to pressure of work the author is unable to enter into any
correspondence relating to, or arising from, the guide. If you have any
comments on this subject guide, favourable or unfavourable, please use
the form at the back of this guide.
University of London
Publications Office
Stewart House
32 Russell Square
London WC1B 5DN
United Kingdom
www.london.ac.uk
Contents
1 Introduction 1
1.1 Route map to the guide . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Introduction to the subject area . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Aims of the course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Learning outcomes for the course . . . . . . . . . . . . . . . . . . . . . . 3
1.6 Overview of learning resources . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6.1 The subject guide . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6.2 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6.3 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.4 Online study resources (the Online Library and the VLE) . . . . . 6
1.7 Examination advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Probability theory 9
2.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Set theory: the basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Axiomatic definition of probability . . . . . . . . . . . . . . . . . . . . . 19
2.5.1 Basic properties of probability . . . . . . . . . . . . . . . . . . . . 20
2.6 Classical probability and counting rules . . . . . . . . . . . . . . . . . . . 25
2.6.1 Combinatorial counting methods . . . . . . . . . . . . . . . . . . 28
2.7 Conditional probability and Bayes’ theorem . . . . . . . . . . . . . . . . 32
2.7.1 Independence of multiple events . . . . . . . . . . . . . . . . . . . 35
2.7.2 Independent versus mutually exclusive events . . . . . . . . . . . 39
2.7.3 Conditional probability of independent events . . . . . . . . . . . 44
2.7.4 Chain rule of conditional probabilities . . . . . . . . . . . . . . . . 44
2.7.5 Total probability formula . . . . . . . . . . . . . . . . . . . . . . . 46
2.7.6 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.8 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
i
Contents
3 Random variables 61
3.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.4 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4.1 Probability distribution of a discrete random variable . . . . . . . 63
3.4.2 The cumulative distribution function (cdf) . . . . . . . . . . . . . 68
3.4.3 Properties of the cdf for discrete distributions . . . . . . . . . . . 71
3.4.4 General properties of the cdf . . . . . . . . . . . . . . . . . . . . . 71
3.4.5 Properties of a discrete random variable . . . . . . . . . . . . . . 72
3.4.6 Expected value versus sample mean . . . . . . . . . . . . . . . . . 74
3.5 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . 84
3.5.1 Median of a random variable . . . . . . . . . . . . . . . . . . . . . 102
3.6 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.7 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.8 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 104
ii
Contents
iii
Contents
iv
Contents
v
Contents
vi
Contents
vii
Contents
viii
Chapter 1
Introduction
By successfully completing this half course, you will understand the ideas of
randomness and variability, and the way in which they link to probability theory. This
will allow the use of a systematic and logical collection of statistical techniques of great
1
1. Introduction
practical importance in many applied areas. The examples in this subject guide will
concentrate on the social sciences, but the methods are important for the physical
sciences too. This subject aims to provide a grounding in probability theory and some
of the most common statistical methods.
The material in ST104b Statistics 2 is necessary as preparation for other subjects
you may study later on in your degree. The full details of the ideas discussed in this
subject guide will not always be required in these other subjects, but you will need to
have a solid understanding of the main concepts. This can only be achieved by seeing
how the ideas emerge in detail.
For statistics, you need some familiarity with abstract mathematical ideas, as well as
the ability and common sense to apply these to real-life problems. The concepts you will
encounter in probability and statistical inference are hard to absorb by just reading
about them in a book. You need to read, then think a little, then try some problems,
and then read and think some more. This procedure should be repeated until the
problems are easy to do; you should not spend a long time reading and forget about
solving problems.
1.3 Syllabus
The syllabus of ST104b Statistics 2 is as follows:
Point estimation: Estimation criteria: bias, variance and mean squared error;
Method of moments estimation; Least squares estimation; Maximum likelihood
estimation.
2
1.4. Aims of the course
apply and be competent users of standard statistical operators and be able to recall
a variety of well-known distributions and their respective moments
3
1. Introduction
topics through an alternative tutorial voice – please see the suggested ‘Further reading’
below.
The subject guide is divided into chapters which should be worked through in the order
in which they appear. There is little point in rushing past material you only partly
understand to get to later chapters, as the presentation is somewhat sequential and not
a series of self-contained topics. You should be familiar with the earlier chapters and
have a solid understanding of them before moving on to the later ones.
The following procedure is recommended:
The last step is the most important. It is easy to think that you have understood the
material after reading it, but working through problems is the crucial test of
understanding. Problem-solving should take up most of your study time.
To prepare for the examination, you will only need to read the material in the subject
guide, but it may be helpful from time to time to look at the suggested ‘Further
reading’ below.
Basic notation
We often use the symbol to denote the end of a proof, where we have finished
explaining why a particular result is true. This is just to make it clear where the proof
ends and the following text begins.
Time management
About one-third of your self-study time should be spent reading and the rest should be
spent solving problems. An internal student would expect maybe 15 hours of formal
teaching and another 50 hours of private study to be enough to cover the subject. Of
the 50 hours of private study, about 17 hours should be spent on the initial study of the
subject guide. The remaining 33 hours should be spent on attempting problems
contained in the subject guide.
Calculators
A calculator may be used when answering questions on the examination paper for
ST104b Statistics 2. It must comply in all respects with the specification given in the
Regulations. You should also refer to the admission notice you will receive when
entering the examination and the ‘Notice on permitted materials’.
Make sure you accustom yourself to using your chosen calculator and feel comfortable
with it. Specifically, calculators must:
4
1.6. Overview of learning resources
must be:
hand held
quiet in operation
non-programmable
and must:
The Regulations state: ‘The use of a calculator that communicates or displays textual
messages, graphical or algebraic information is strictly forbidden. Where a calculator is
permitted in the examination, it must be a non-scientific calculator. Where calculators
are permitted, only calculators limited to performing just basic arithmetic operations
may be used. This is to encourage candidates to show the examiners the steps taken in
arriving at the answer.’
Computers
If you are aiming to carry out serious statistical analysis (which is beyond the level of
this course) you will probably want to use some statistical software package such as R.
It is not necessary for this course to have such software available, but if you do have
access to it you may benefit from using it in your study of the material.
Statistical tables
As relevant extracts of these statistical tables are the same as those distributed for use
in the examination, it is advisable that you become familiar with them, rather than
those at the end of a textbook.
5
1. Introduction
Newbold, P., W.L. Carlson and B.M. Thorne, Statistics for Business and
Economics. (London: Prentice–Hall, 2012) eighth edition [ISBN 9780273767060].
Johnson, R.A. and G.K. Bhattacharyya, Statistics: Principles and Methods. (New
York: John Wiley and Sons, 2010) sixth edition [ISBN 9780470505779].
Larsen, R.J. and M.L. Marx, Introduction to Mathematical Statistics and Its
Applications (Pearson, 2013) fifth edition [ISBN 9781292023557].
While Newbold et al. is the main recommended textbook for this course, there are many
which are just as good. You are encouraged to look at those listed above and at any
others you may find. It may be necessary to look at several textbooks for a single topic,
as you may find that the approach of one textbook suits you better than that of another.
1.6.4 Online study resources (the Online Library and the VLE)
In addition to the subject guide and the Essential reading, it is crucial that you take
advantage of the study resources that are available online for this course, including the
virtual learning environment (VLE) and the Online Library.
You can access the VLE, the Online Library and your University of London email
account via the Student Portal at:
http://my.londoninternational.ac.uk
You should have received your login details for the Student Portal with your official
offer, which was emailed to the address that you gave on your application form. You
have probably already logged in to the Student Portal in order to register! As soon as
you registered, you will automatically have been granted access to the VLE, Online
Library and your fully functional University of London email account.
If you forget your login details, please click on the ‘Forgotten your password’ link on the
login page.
The VLE
The VLE, which complements this subject guide, has been designed to enhance your
learning experience, providing additional support and a sense of community. It forms an
important part of your study experience with the University of London and you should
access it regularly.
The VLE provides a range of resources for EMFSS courses:
6
1.6. Overview of learning resources
Self-testing activities: Doing these allows you to test your own understanding of the
subject material.
Electronic study materials: The printed materials that you receive from the
University of London are available to download, including updated reading lists
and references.
A student discussion forum: This is an open space for you to discuss interests and
experiences, seek support from your peers, work collaboratively to solve problems
and discuss subject material.
Videos: There are recorded academic introductions to the subject, interviews and
debates and, for some courses, audio-visual tutorials and conclusions.
Recorded lectures: For some courses, where appropriate, the sessions from previous
years’ Study Weekends have been recorded and made available.
Study skills: Expert advice on preparing for examinations and developing your
digital literacy skills.
Feedback forms.
Some of these resources are available for certain courses only, but we are expanding our
provision all the time and you should check the VLE regularly for updates.
The Online Library contains a huge array of journal articles and other resources to help
you read widely and extensively.
To access the majority of resources via the Online Library you will either need to use
your University of London Student Portal login details, or you will be required to
register and use an Athens login:
http://tinyurl.com/ollathens
The easiest way to locate relevant content and journal articles in the Online Library is
to use the Summon search engine.
If you are having trouble finding an article listed in a reading list, try removing any
punctuation from the title, such as single quotation marks, question marks and colons.
For further advice, please see the online help pages:
www.external.shl.lon.ac.uk/summon/about.php
Additional material
There is a lot of computer-based teaching material available freely over the web. A
fairly comprehensive list can be found in the ‘Books & Manuals’ section of
http://statpages.org
7
1. Introduction
Unless otherwise stated, all websites in this subject guide were accessed in August 2019.
We cannot guarantee, however, that they will stay current and you may need to
perform an internet search to find the relevant pages.
where available, past examination papers and Examiners’ commentaries for the
course which give advice on how each question might best be answered.
The examination is by a two-hour unseen question paper. No books may be taken into
the examination, but the use of calculators is permitted, and statistical tables and a
formula sheet are provided (the formula sheet can be found in past examination papers
available on the VLE).
The examination paper has a variety of questions, some quite short and others longer.
All questions must be answered correctly for full marks. You may use your calculator
whenever you feel it is appropriate, always remembering that the examiners can give
marks only for what appears on the examination script. Therefore, it is important to
always show your working.
In terms of the examination, as always, it is important to manage your time carefully
and not to dwell on one question for too long – move on and focus on solving the easier
questions, coming back to harder ones later.
8
Chapter 2
Probability theory
explain the fundamental ideas of random experiments, sample spaces and events
list the axioms of probability and be able to derive all the common probability
rules from them
list the formulae for the number of combinations and permutations of k objects out
of n, and be able to routinely use such results in problems
explain conditional probability and the concept of independent events
prove the law of total probability and apply it to problems where there is a
partition of the sample space
prove Bayes’ theorem and apply it to find conditional probabilities.
2.3 Introduction
Consider the following hypothetical example. A country will soon hold a referendum
about whether it should leave the European Union (EU). An opinion poll of a random
sample of people in the country is carried out.
950 respondents say that they plan to vote in the referendum. They answer the question
‘Will you vote ‘Yes’ or ‘No’ to leaving the EU?’ as follows:
Answer
Yes No Total
Count 513 437 950
% 54% 46% 100%
9
2. Probability theory
However, we are not interested in just this sample of 950 respondents, but in the
population which they represent, that is, all likely voters.
Statistical inference will allow us to say things like the following about the
population.
In short, the opinion poll gives statistically significant evidence that ‘Yes’ voters are in
the majority among likely voters. Such methods of statistical inference will be discussed
later in the course.
The inferential statements about the opinion poll rely on the following assumptions and
results.
In the next few chapters, we will learn about the terms in bold, among others.
In statistical inference, the data we have observed are regarded as a sample from a
broader population, selected with a random process.
A preview of probability
Experiment: for example, rolling a single die and recording the outcome.
10
2.4. Set theory: the basics
B = {1, 2, 3, 4, 5}.
Activity 2.1 Why is S = {1, 1, 2}, not a sensible way to try to define a sample
space?
Solution
Because there is no need to list the elementary outcome ‘1’ twice. It is much clearer
to write S = {1, 2}.
Activity 2.2 Write out all the events for the sample space S = {a, b, c}. (There are
eight of them.)
Solution
The possible events are {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c} (the sample space
S) and ∅.
1
Strictly speaking not all subsets are events.
11
2. Probability theory
1 ∈ A and 2 ∈ A.
6∈
/ A and 1.5 ∈
/ A.
The familiar Venn diagrams help to visualise statements about sets. However, Venn
diagrams are not formal proofs of results in set theory.
Example 2.3 In Figure 2.1, the darkest area in the middle is A ∩ B, the total
shaded area is A ∪ B, and the white area is (A ∪ B)c = Ac ∩ B c .
A⊂B when x ∈ A ⇒ x ∈ B.
Example 2.4 An example of the distinction between subsets and non-subsets is:
{1, 2, 3} ⊂ {1, 2, 3, 4}, because all elements appear in the larger set
{1, 2, 5} 6⊂ {1, 2, 3, 4}, because the element 5 does not appear in the larger set.
12
2.4. Set theory: the basics
Two sets A and B are equal (A = B) if they have exactly the same elements. This
implies that A ⊂ B and B ⊂ A.
A ∪ B = {x | x ∈ A or x ∈ B}.
That is, the set of those elements which belong to A or B (or both). An example is
shown in Figure 2.3.
A ∪ B = {1, 2, 3, 4}
A ∪ C = {1, 2, 3, 4, 5, 6}
B ∪ C = {2, 3, 4, 5, 6}.
A ∩ B = {x | x ∈ A and x ∈ B}.
That is, the set of those elements which belong to both A and B. An example is
shown in Figure 2.4.
A ∩ B = {2, 3}
A ∩ C = {4}
B ∩ C = ∅.
13
2. Probability theory
Both set operators can also be applied to more than two sets, such as A ∩ B ∩ C.
Concise notation for the unions and intersections of sets A1 , A2 , . . . , An is:
n
[
Ai = A1 ∪ A2 ∪ · · · ∪ An
i=1
and: n
\
Ai = A1 ∩ A2 ∩ · · · ∩ An .
i=1
These can also be used for an infinite number of sets, i.e. when n is replaced by ∞.
Complement (‘not’)
Suppose S is the set of all possible elements which are under consideration. In
probability, S will be referred to as the sample space.
It follows that A ⊂ S for every set A we may consider. The complement of A with
respect to S is:
Ac = {x | x ∈ S and x ∈
/ A}.
That is, the set of those elements of S that are not in A. An example is shown in
Figure 2.5.
We now consider some useful properties of set operators. In proofs and derivations
about sets, you can use the following results without proof.
14
2.4. Set theory: the basics
Commutativity:
A ∩ B = B ∩ A and A ∪ B = B ∪ A.
Associativity:
A ∩ (B ∩ C) = (A ∩ B) ∩ C and A ∪ (B ∪ C) = (A ∪ B) ∪ C.
Distributive laws:
A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) and A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C).
De Morgan’s laws:
If S is the sample space and A and B are any sets in S, you can also use the following
results without proof:
∅c = S.
∅ ⊂ A, A ⊂ A and A ⊂ S.
A ∩ A = A and A ∪ A = A.
A ∩ Ac = ∅ and A ∪ Ac = S.
If B ⊂ A, A ∩ B = B and A ∪ B = A.
A ∩ ∅ = ∅ and A ∪ ∅ = A.
A ∩ S = A and A ∪ S = S.
∅ ∩ ∅ = ∅ and ∅ ∪ ∅ = ∅.
15
2. Probability theory
A ∩ B = ∅.
Sets A1 , A2 , . . . , An are pairwise disjoint if all pairs of sets from them are disjoint,
i.e. Ai ∩ Aj = ∅ for all i 6= j.
Partition
The sets A1 , A2 , . . . , An form a partition of the set A if they are pairwise disjoint
n
S
and if Ai = A, that is, A1 , A2 , . . . , An are collectively exhaustive of A.
i=1
Therefore, a partition divides the entire set A into non-overlapping pieces Ai , as
shown in Figure 2.6 for n = 3. Similarly, an infinite collection of sets A1 , A2 , . . . form
∞
S
a partition of A if they are pairwise disjoint and Ai = A.
i=1
A3 A2
A1
We have:
A ∩ (B ∩ Ac ) = (A ∩ Ac ) ∩ B = ∅ ∩ B = ∅
and:
A ∪ (B ∩ Ac ) = (A ∪ B) ∩ (A ∪ Ac ) = B ∩ S = B.
Hence A and B ∩ Ac are mutually exclusive and collectively exhaustive of B, and so
they form a partition of B.
16
2.4. Set theory: the basics
Activity 2.3 For an event A, work out a simpler way to express the events A ∩ S,
A ∪ S, A ∩ ∅ and A ∪ ∅.
Solution
We have:
A ∩ S = A, A ∪ S = S, A ∩ ∅ = ∅ and A ∪ ∅ = A.
Activity 2.4 Use the rules of set operators to prove that the following represents a
partition of set A:
A = (A ∩ B) ∪ (A ∩ B c ). (*)
In other words, prove that (*) is true, and also that (A ∩ B) ∩ (A ∩ B c ) = ∅.
Solution
We have:
(A ∩ B) ∩ (A ∩ B c ) = (A ∩ A) ∩ (B ∩ B c ) = A ∩ ∅ = ∅.
This uses the results of commutativity, associativity, A ∩ A = A, A ∩ Ac = ∅ and
A ∩ ∅ = ∅.
Similarly:
(A ∩ B) ∪ (A ∩ B c ) = A ∩ (B ∪ B c ) = A ∩ S = A
using the results of the distributive laws, A ∪ Ac = S and A ∩ S = A.
Solution
(a) We have:
A1 ∪ A2 = {0, 1, 2, 3, 4} and A1 ∩ A2 = {2}.
(b) We have:
(c) We have:
17
2. Probability theory
Activity 2.6 Let A, B and C be events in a sample space, S. Using only the
symbols ∪, ∩, () and c , find expressions for the following events:
Solution
There is more than one way to answer this question, because the sets can be
expressed in different, but logically equivalent, forms. One way to do so is the
following.
Activity 2.7 Let A and B be events in a sample space S. Use Venn diagrams to
convince yourself that the two De Morgan’s laws:
(A ∩ B)c = Ac ∪ B c (1)
and:
(A ∪ B)c = Ac ∩ B c (2)
are correct. For each of them, draw two Venn diagrams – one for the expression on
the left-hand side of the equation, and one for the right-hand side. Shade the areas
corresponding to each expression, and hence show that for both (1) and (2) the
left-hand and right-hand sides describe the same set.
Solution
For (A ∩ B)c = Ac ∪ B c we have:
18
2.5. Axiomatic definition of probability
Example 2.8 If the experiment is ‘select a trading day at random and record the
% change in the FTSE 100 index from the previous trading day’, then the outcome
is the % change in the FTSE 100 index.
S = [−100, +∞) for the % change in the FTSE 100 index (in principle).
An event of interest might be A = {x | x > 0} – the event that the daily change is
positive, i.e. the FTSE 100 index gains value from the previous trading day.
The sample space and events are represented as sets. For two events A and B, set
operations are then interpreted as follows:
19
2. Probability theory
Axioms of probability
‘Probability’ is formally defined as a function P (·) from subsets (events) of the sample
space S onto real numbers.2 Such a function is a probability function if it satisfies
the following axioms (‘self-evident truths’).
Axiom 2: P (S) = 1.
The axioms require that a probability function must always satisfy these requirements.
Axiom 2 requires that the outcome is some element from the sample space with
certainty (that is, with probability 1). In other words, the experiment must have
some outcome.
All other properties of the probability function can be derived from the axioms. We
begin by showing that a result like Axiom 3 also holds for finite collections of mutually
exclusive sets.
Probability property
2
The precise definition also requires a careful statement of which subsets of S are allowed as events,
which we can skip on this course.
20
2.5. Axiomatic definition of probability
In pictures, the previous result means that in a situation like the one shown in Figure
2.7, the probability of the combined event A = A1 ∪ A2 ∪ A3 is simply the sum of the
probabilities of the individual events:
P (A) = P (A1 ) + P (A2 ) + P (A3 ).
That is, we can simply sum probabilities of mutually exclusive sets. This is very useful
for deriving further results.
A2
A1
A3
Figure 2.7: Venn diagram depicting three mutually exclusive sets, A1 , A2 and A3 . Note
although A2 and A3 have touching boundaries, there is no actual intersection and hence
they are (pairwise) mutually exclusive.
Probability property
Probability property
Proof (by contradiction): If it was true that P (A) > 1 for some A, then we would have:
P (Ac ) = 1 − P (A) < 0.
This violates Axiom 1, so cannot be true. Therefore, it must be that P (A) ≤ 1 for all A.
Putting this and Axiom 1 together, we get:
0 ≤ P (A) ≤ 1
for all events A.
21
2. Probability theory
Probability property
since P (B ∩ Ac ) ≥ 0.
Probability property
P (A ∪ B) = P (A ∩ B c ) + P (A ∩ B) + P (Ac ∩ B)
P (A) = P (A ∩ B c ) + P (A ∩ B)
P (B) = P (Ac ∩ B) + P (A ∩ B)
and hence:
These show that the probability function has the kinds of values we expect of something
called a ‘probability’.
P (Ac ) = 1 − P (A).
22
2.5. Axiomatic definition of probability
86% spend at least 1 hour watching television (event A, with P (A) = 0.86)
19% spend at least 1 hour reading newspapers (event B, with P (B) = 0.19)
15% spend at least 1 hour watching television and at least 1 hour reading
newspapers (P (A ∩ B) = 0.15).
Activity 2.8
(a) A, B and C are any three events in the sample space S. Prove that:
P (A∪B∪C) = P (A)+P (B)+P (C)−P (A∩B)−P (B∩C)−P (A∩C)+P (A∩B∩C).
23
2. Probability theory
(b) Use the result that if X ⊂ Y then P (X) ≤ P (Y ) for events X and Y .
Since A ⊂ A ∪ B and B ⊂ A ∪ B, we have P (A) ≤ P (A ∪ B) and
P (B) ≤ P (A ∪ B).
Adding these inequalities, P (A) + P (B) ≤ 2 × P (A ∪ B) so:
P (A) + P (B)
≤ P (A ∪ B).
2
Similarly, A ∩ B ⊂ A and A ∩ B ⊂ B, so P (A ∩ B) ≤ P (A) and
P (A ∩ B) ≤ P (B).
Adding, 2 × P (A ∩ B) ≤ P (A) + P (B) so:
P (A) + P (B)
P (A ∩ B) ≤ .
2
Probability theory tells us how to work with the probability function and derive
‘probabilities of events’ from it. However, it does not tell us what ‘probability’ really
means.
There are several alternative interpretations of the real-world meaning of ‘probability’
in this sense. One of them is outlined below. The mathematical theory of probability
and calculations on probabilities are the same whichever interpretation we assign to
‘probability’. So, in this course, we do not need to discuss the matter further.
Example 2.10 How should we interpret the following, as statements about the real
world of coins and babies?
‘The probability that a tossed coin comes up heads is 0.5.’ If we tossed a coin a
large number of times, and the proportion of heads out of those tosses was 0.5,
the ‘probability of heads’ could be said to be 0.5, for that coin.
‘The probability is 0.51 that a child born in the UK today is a boy.’ If the
proportion of boys among a large number of live births was 0.51, the
‘probability of a boy’ could be said to be 0.51.
A key question is how to determine appropriate numerical values of P (A) for the
probabilities of particular events.
24
2.6. Classical probability and counting rules
This is usually done empirically, by observing actual realisations of the experiment and
using them to estimate probabilities. In the simplest cases, this basically applies the
frequency definition to observed data.
If I toss a coin 10,000 times, and 5,023 of the tosses come up heads, it seems
that, approximately, P (heads) = 0.5, for that coin.
Of the 7,098,667 live births in England and Wales in the period 1999–2009,
51.26% were boys. So we could assign the value of about 0.51 to the probability
of a boy in this population.
Standard illustrations of classical probability are devices used in games of chance, such
as:
We will use these often, not because they are particularly important but because they
provide simple examples for illustrating various results in probability.
Suppose that the sample space, S, contains m equally likely outcomes, and that event A
consists of k ≤ m of these outcomes. Therefore:
k number of outcomes in A
P (A) = = .
m total number of outcomes in the sample space, S
That is, the probability of A is the proportion of outcomes which belong to A out of all
possible outcomes.
In the classical case, the probability of any event can be determined by counting the
number of outcomes which belong to the event, and the total number of possible
outcomes.
25
2. Probability theory
Example 2.12 Rolling two dice, what is the probability that the sum of the two
scores is 5?
S = {(1, 1), (1, 2), (1, 3), (1, 4) , (1, 5), (1, 6),
(2, 1), (2, 2), (2, 3) , (2, 4), (2, 5), (2, 6),
(3, 1), (3, 2) , (3, 3), (3, 4), (3, 5), (3, 6),
(4, 1) , (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}.
The event of interest is A = {(1, 4), (2, 3), (3, 2), (4, 1)}.
Now that we have a way of obtaining probabilities for events in the classical case, we
can use it together with the rules of probability.
The formula P (A) = 1 − P (Ac ) is convenient when we want P (A) but the probability of
the complementary event Ac , i.e. P (Ac ), is easier to find.
Example 2.13 When rolling two fair dice, what is the probability that the sum of
the dice is greater than 3?
The complement is that the sum is at most 3, i.e. the complementary event is
Ac = {(1, 1), (1, 2), (2, 1)}.
The formula:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
says that the probability that A or B happens (or both happen) is the sum of the
probabilities of A and B, minus the probability that both A and B happen.
Example 2.14 When rolling two fair dice, what is the probability that the two
scores are equal (event A) or that the total score is greater than 10 (event B)?
Activity 2.9 Assume that a calculator has a ‘random number’ key and that when
the key is pressed an integer between 0 and 999 inclusive is generated at random, all
numbers being generated independently of one another.
26
2.6. Classical probability and counting rules
(a) What is the probability that the number generated is less than 300?
(b) If two numbers are generated, what is the probability that both are less than
300?
(c) If two numbers are generated, what is the probability that the first number
exceeds the second number?
(d) If two numbers are generated, what is the probability that the first number
exceeds the second number, and their sum is exactly 300?
(e) If five numbers are generated, what is the probability that at least one number
occurs more than once?
Solution
(d) The following cases apply {300, 0}, {299, 1}, . . . , {151, 149}, i.e. there are 150
possibilities from (10)6 . So the required probability is:
150
= 0.00015.
1000000
(e) The probability that they are all different is (noting that the first number can
be any number):
999 998 997 996
1× × × × .
1000 1000 1000 1000
Subtracting from 1 gives the required probability, i.e. 0.009965.
Activity 2.10 A box contains r red balls and b blue balls. One ball is selected at
random and its colour is observed. The ball is then returned to the box and k
additional balls of the same colour are also put into the box. A second ball is then
selected at random, its colour is observed, and it is returned to the box together
with k additional balls of the same colour. Each time another ball is selected, the
process is repeated. If four balls are selected, what is the probability that the first
three balls will be red and the fourth ball will be blue?
Hint: Your answer should be a function of r, b and k.
Solution
Let Ri be the event that a red ball is drawn on the ith draw, and let Bi be the event
27
2. Probability theory
that a blue ball is drawn on the ith draw, for i = 1, . . . , 4. Therefore, we have:
r
P (R1 ) =
r+b
r+k
P (R2 | R1 ) =
r+b+k
r + 2k
P (R3 | R1 ∩ R2 ) =
r + b + 2k
b
P (B4 | R1 ∩ R2 ∩ R3 ) =
r + b + 3k
where ‘|’ means ‘given’, notation which will be formally introduced later in the
chapter with conditional probability. The required probability is the product of these
four probabilities, namely:
whether the selection is with replacement (an object can be selected more than
once) or without replacement (an object can be selected only once)
with replacement, so that each of the n objects may appear several times in the
selection.
Therefore:
n objects are available for selection for the 1st object in the sequence
n objects are available for selection for the 2nd object in the sequence
. . . and so on, until n objects are available for selection for the kth object in the
sequence.
28
2.6. Classical probability and counting rules
n × n × · · · × n = nk .
z }| {
Suppose that the selection of k objects out of n is again treated as an ordered sequence,
but that selection is now:
Now:
n objects are available for selection for the 1st object in the sequence
. . . and so on, until n − k + 1 objects are available for selection for the kth object.
n × (n − 1) × · · · × (n − k + 1). (2.2)
Factorials
The number of ordered sets of n objects, selected without replacement from n objects,
is:
n! = n × (n − 1) × · · · × 2 × 1.
The number n! (read ‘n factorial’) is the total number of different ways in which
n objects can be arranged in an ordered sequence. This is known as the number of
permutations of n objects.
We also define 0! = 1.
n!
n × (n − 1) × · · · × (n − k + 1) = .
(n − k)!
29
2. Probability theory
Suppose now that the identities of the objects in the selection matter, but the order
does not.
For example, the sequences (1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1) are
now all treated as the same, because they all contain the elements 1, 2 and 3.
Example 2.15 Suppose we have k = 3 people (Amy, Bob and Sam). How many
different sets of birthdays can they have (day and month, ignoring the year, and
pretending February 29th does not exist, so that n = 365) in the following cases?
1. It makes a difference who has which birthday (ordered ), i.e. Amy (January 1st),
Bob (May 5th) and Sam (December 5th) is different from Amy (May 5th), Bob
(December 5th) and Sam (January 1st), and different people can have the same
birthday (with replacement). The number of different sets of birthdays is:
(365)3 = 48,627,125.
2. It makes a difference who has which birthday (ordered ), and different people
must have different birthdays (without replacement). The number of different
sets of birthdays is:
365!
= 365 × 364 × 363 = 48,228,180.
(365 − 3)!
3. Only the dates matter, but not who has which one (unordered ), i.e. Amy
(January 1st), Bob (May 5th) and Sam (December 5th) is treated as the same
as Amy (May 5th), Bob (December 5th) and Sam (January 1st), and different
people must have different birthdays (without replacement). The number of
different sets of birthdays is:
365 365! 365 × 364 × 363
= = = 8,038,030.
3 (365 − 3)! 3! 3×2×1
30
2.6. Classical probability and counting rules
Example 2.16 Consider a room with r people in it. What is the probability that
at least two of them have the same birthday (call this event A)? In particular, what
is the smallest r for which P (A) > 1/2?
Assume that all days are equally likely.
Label the people 1 to r, so that we can treat them as an ordered list and talk about
person 1, person 2 etc. We want to know how many ways there are to assign
birthdays to this list of people. We note the following.
1. The number of all possible sequences of birthdays, allowing repeats (i.e. with
replacement) is (365)r .
2. The number of sequences where all birthdays are different (i.e. without
replacement) is 365!/(365 − r)!.
Here ‘1.’ is the size of the sample space, and ‘2.’ is the number of outcomes which
satisfy Ac , the complement of the case in which we are interested.
Therefore:
365!/(365 − r)! 365 × 364 × · · · × (365 − r + 1)
P (Ac ) = r
=
(365) (365)r
and:
365 × 364 × · · · × (365 − r + 1)
P (A) = 1 − P (Ac ) = 1 − .
(365)r
Probabilities, for P (A), of at least two people sharing a birthday, for different values
of the number of people r are given in the following table:
Activity 2.11 A box contains 18 light bulbs, of which two are defective. If a person
selects 7 bulbs at random, without replacement, what is the probability that both
defective bulbs will be selected?
Solution
The sample space consists
of all (unordered) subsets of 7 out of the 18 light bulbs in
18
the box. There are such subsets. The number of subsets which contain the two
7
31
2. Probability theory
16
defective bulbs is the number of subsets of size 5 out of the other 16 bulbs, , so
5
the probability we want is:
16
5 7×6
= = 0.1373.
18 18 × 17
7
independence
conditional probability
Bayes’ theorem.
updating probabilities of events, after we learn that some other event has happened.
Independence
P (A ∩ B) = P (A) P (B).
if A happens, this does not affect the probability of B happening (and vice versa)
if you are told that A has happened, this does not give you any new information
about the value of P (B) (and vice versa).
For example, independence is often a reasonable assumption when A and B
correspond to physically separate experiments.
Example 2.17 Suppose we roll two dice. We assume that all combinations of the
values of them are equally likely. Define the events:
32
2.7. Conditional probability and Bayes’ theorem
Therefore:
Solution
Using the probability property P (A ∪ B) = P (A) + P (B) − P (A ∩ B), and the
definition of independent events P (A ∩ B) = P (A) P (B), we have:
Activity 2.13 A and B are events such that P (A | B) > P (A). Prove that:
Solution
From the definition of conditional probability:
However:
P (A ∩ B)
P (A | B) = > P (A)
P (B)
i.e. P (A ∩ B) > P (A) P (B). Hence:
33
2. Probability theory
Activity 2.14 A and B are any two events in the sample space S. The binary set
operator ∨ denotes an exclusive union, such that:
(b) P (A ∨ B | A) = 1 − P (B | A).
Solution
(a) We have:
A ∨ B = (A ∩ B c ) ∪ (B ∩ Ac ).
By axiom 3, noting that (A ∩ B c ) and (B ∩ Ac ) are disjoint:
P (A ∨ B) = P (A ∩ B c ) + P (B ∩ Ac ).
P (A ∩ B c ) = P (A) − P (A ∩ B).
(b) We have:
P ((A ∨ B) ∩ A)
P (A ∨ B | A) =
P (A)
P (A ∩ B c )
=
P (A)
P (A) − P (A ∩ B)
=
P (A)
P (A) P (A ∩ B)
= −
P (A) P (A)
= 1 − P (B | A).
Activity 2.15 Suppose that we toss a fair coin twice. The sample space is given by:
S = {HH, HT, T H, T T }
where the elementary outcomes are defined in the obvious way – for instance HT is
heads on the first toss and tails on the second toss. Show that if all four elementary
outcomes are equally likely, then the events ‘heads on the first toss’ and ‘heads on
the second toss’ are independent.
34
2.7. Conditional probability and Bayes’ theorem
Solution
Note carefully here that we have equally likely elementary outcomes (due to the coin
being fair), so that each has probability 1/4, and the independence follows.
The event ‘heads on the first toss’ is A = {HH, HT } and has probability 1/2,
because it is specified by two elementary outcomes. The event ‘heads on the second
toss’ is B = {HH, T H} and has probability 1/2. The event ‘heads on the first toss
and the second toss’ is A ∩ B = {HH} and has probability 1/4. So the
multiplication property P (A ∩ B) = 1/4 = 1/2 × 1/2 = P (A) P (B) is satisfied, and
the two events are independent.
Example 2.18 It can be cold in London. Four impoverished teachers dress to feel
warm. Teacher A has a hat and a scarf and gloves, Teacher B only has a hat, Teacher
C only has a scarf and Teacher D only has gloves. One teacher out of the four is
selected at random. It is shown that although each pair of events H = ‘the teacher
selected has a hat’, S = ‘the teacher selected has a scarf’, and G = ‘the teacher
selected has gloves’ are independent, all three of these events are not independent.
Two teachers have a hat, two teachers have a scarf, and two teachers have gloves, so:
2 1 2 1 2 1
P (H) = = , P (S) = = and P (G) = = .
4 2 4 2 4 2
Only one teacher has both a hat and a scarf, so:
1
P (H ∩ S) =
4
and similarly:
1 1
P (H ∩ G) = and P (S ∩ G) = .
4 4
From these results, we can verify that:
P (H ∩ S) = P (H) P (S)
P (H ∩ G) = P (H) P (G)
P (S ∩ G) = P (S) P (G)
and so the events are pairwise independent. However, one teacher has a hat, a scarf
and gloves, so:
1
P (H ∩ S ∩ G) = 6= P (H) P (S) P (G).
4
35
2. Probability theory
Hence the three events are not independent. If the selected teacher has a hat and a
scarf, then we know that the teacher has gloves. There is no independence for all
three events together.
Activity 2.16 A, B and C are independent events. Prove that A and (B ∪ C) are
independent.
Solution
We need to show that the joint probability of A ∩ (B ∪ C) equals the product of the
probabilities of A and B ∪ C, i.e. we need to show that
P (A ∩ (B ∪ C)) = P (A) P (B ∪ C).
Using the distributive law:
Solution
(a) Since the component failures are independent, the probability of system failure
is π1 π2 π3 .
(b) The probability that component i does not fail is 1 − πi , hence the probability
that the system does not fail is (1 − π1 )(1 − π2 )(1 − π3 ), and so the probability
that the system fails is:
1 − (1 − π1 )(1 − π2 )(1 − π3 ).
36
2.7. Conditional probability and Bayes’ theorem
Activity 2.18 Write down the condition for three events A, B and C to be
independent.
Solution
Applying the product rule, we must have:
Therefore, since all subsets of two events from A, B and C must be independent, we
must also have:
P (A ∩ B) = P (A) P (B)
P (A ∩ C) = P (A) P (C)
and:
P (B ∩ C) = P (B) P (C).
One must check that all four conditions hold to verify independence of A, B and C.
(a) What is the probability that the device fails in a year of use?
(b) How large must π be for the probability of failure in (a) to be less than 0.05?
Solution
(a) It is often easier to evaluate the probability of the complement of the event
specified. Here, we calculate:
37
2. Probability theory
Solution
P (A ∩ B c ) = P (A) − P (A ∩ B)
= P (A) − P (A) P (B) (due to independence of A and B)
= P (A)[1 − P (B)]
= P (A) P (B c ).
(a) Determine the probability that the target will be hit for the first time on the
third throw of James A.
(b) Determine the probability that James A will hit the target before James B does.
Solution
(a) In order for the target to be hit for the first time on the third throw of James A,
all five of the following independent events must occur: (i) James A misses on
his first throw, (ii) James B misses on his first throw, (iii) James A misses on
his second throw, (iv) James B misses on his second throw, and (v) James A
hits the target on his third throw. The probability of all five events occurring is:
3 4 3 4 1 9
× × × × = .
4 5 4 5 4 100
38
2.7. Conditional probability and Bayes’ theorem
(b) Let A denote the event that James A hits the target before James B. There are
two methods of solving this problem.
1. The first method is to note that A can occur in two different ways. (i)
James A hits the target on the first throw, which occurs with probability
1/4. (ii) Both Jameses miss the target on their first throws, and then
subsequently James A hits the target before James B. The probability that
both Jameses miss on their first throws is:
3 4 3
× = .
4 5 5
When they do miss, the conditions of the game become exactly the same as
they were at the beginning of the game. In effect, it is as if the boys were
starting a new game all over again, and so the probability that James A
will subsequently hit the target before James B is again P (A). Therefore,
by considering these two ways in which the event A can occur, we have:
1 3 5
P (A) = + × P (A) ⇒ P (A) = .
4 5 8
Hence: ∞ i−1
1X 3 1 1 5
P (A) = = × =
4 i=1 5 4 1 − 3/5 8
which uses the sum to infinity of a geometric series (with common ratio less
than 1 in absolute value) from mathematics.
The idea of independent events is quite different from that of mutually exclusive
(disjoint) events, as shown in Figure 2.9.
For mutually exclusive events A ∩ B = ∅, and so, from (2.1), P (A ∩ B) = 0. For
independent events, P (A ∩ B) = P (A) P (B). So since P (A ∩ B) = 0 6= P (A) P (B) in
general (except in the uninteresting case when P (A) = 0 or P (B) = 0), then mutually
exclusive events and independent events are different.
In fact, mutually exclusive events are extremely non-independent (i.e. dependent). For
example, if you know that A has happened, you know for certain that B has not
happened. There is no particularly helpful way to represent independent events using a
39
2. Probability theory
Venn diagram.
Conditional probability
Consider two events A and B. Suppose you are told that B has occurred. How does
this affect the probability of event A?
The answer is given by the conditional probability of A given that B has occurred,
or the conditional probability of A given B for short, defined as:
P (A ∩ B)
P (A | B) =
P (B)
assuming that P (B) > 0. The conditional probability is not defined if P (B) = 0.
Example 2.19 Suppose we roll two independent fair dice again. Consider the
following events.
These are shown in Figure 2.10. Now P (A) = 11/36 ≈ 0.31, P (B) = 15/36 and
P (A ∩ B) = 2/36. Therefore, the conditional probability of A given B is:
P (A ∩ B) 2/36 2
P (A | B) = = = ≈ 0.13.
P (B) 15/36 15
40
2.7. Conditional probability and Bayes’ theorem
A
(1,1) (1,2) (1,3) (1,4) (1,5) (1,6)
A B
Example 2.20 In Example 2.19, when we are told that the conditioning event B
has occurred, we know we are within the green line in Figure 2.9. So the 15
outcomes within it become the new sample space. There are 2 outcomes which
satisfy A and which are inside this new sample space, so:
2 number of cases of A within B
P (A | B) = = .
15 number of cases of B
Activity 2.22 If all elementary outcomes are equally likely, S = {a, b, c, d},
A = {a, b, c} and B = {c, d}, find P (A | B) and P (B | A).
Solution
S has 4 elementary outcomes which are equally likely, so each elementary outcome
has probability 1/4.
We have:
P (A ∩ B) P ({c}) 1/4 1
P (A | B) = = = =
P (B) P ({c, d}) 1/4 + 1/4 2
and:
P (B ∩ A) P ({c}) 1/4 1
P (B | A) = = = = .
P (A) P ({a, b, c}) 1/4 + 1/4 + 1/4 3
Activity 2.23 Show that if A and B are disjoint events, and are also independent,
then P (A) = 0 or P (B) = 0. (Note that independence and disjointness are not
similar ideas.)
41
2. Probability theory
Solution
It is important to get the logical flow in the right direction here. We are told that A
and B are disjoint events, that is:
A ∩ B = ∅.
So:
P (A ∩ B) = 0.
We are also told that A and B are independent, that is:
P (A ∩ B) = P (A) P (B).
It follows that:
0 = P (A) P (B)
and so either P (A) = 0 or P (B) = 0.
Activity 2.24 Suppose A and B are events with P (A) = p, P (B) = 2p and
P (A ∪ B) = 0.75.
Solution
Activity 2.25
(a) Show that if A and B are independent events in a sample space, then Ac and B c
are also independent.
(b) Show that if X and Y are mutually exclusive events in a sample space, then X c
and Y c are not in general mutually exclusive.
42
2.7. Conditional probability and Bayes’ theorem
Solution
(b) To show that X c and Y c are not necessarily mutually exclusive when X and Y
are mutually exclusive, the best approach is to find a counterexample. Attempts
to ‘prove’ the result directly are likely to be logically flawed.
Look for a simple example. Suppose we roll a die. Let X = {6} be the event of
obtaining a 6, and let Y = {5} be the event of obtaining a 5. Obviously X and
Y are mutually exclusive, but X c = {1, 2, 3, 4, 5} and Y c = {1, 2, 3, 4, 6} have
X c ∩ Y c 6= ∅, so X c and Y c are not mutually exclusive.
You can assume that all unions and intersections of Ai and B are also events in S.
Solution
We have:
∞ ∞
S S
"∞ # ! P Ai ∩ B P [Ai ∩ B] ∞ ∞
[
i=1 i=1
X P (Ai ∩ B) X
P Ai B = = = = P (Ai | B)
i=1
P (B) P (B) i=1
P (B) i=1
where the equation on the second line follows from (*) in the question, since Ai ∩ B
43
2. Probability theory
are also events in S, and they are pairwise mutually exclusive (i.e. (Ai ∩ B)∩
(Aj ∩ B) = ∅ for all i 6= j).
If A ⊥⊥ B, i.e. P (A ∩ B) = P (A) P (B), and P (B) > 0 and P (A) > 0, then:
P (A ∩ B) P (A) P (B)
P (A | B) = = = P (A)
P (B) P (B)
and:
P (A ∩ B) P (A) P (B)
P (B | A) = = = P (B).
P (A) P (A)
In other words, if A and B are independent, learning that B has occurred does not
change the probability of A, and learning that A has occurred does not change the
probability of B. This is exactly what we would expect under independence.
P (A ∩ B) = P (A | B) P (B).
That is, the probability that both A and B occur is the probability that A occurs given
that B has occurred multiplied by the probability that B occurs. An intuitive graphical
version of this is:
s
B
s
As
and you can use whichever is more convenient. Very often some version of this chain
rule is much easier than calculating P (A ∩ B) directly.
The chain rule generalises to multiple events:
where, for example, P (A3 | A1 , A2 ) is shorthand for P (A3 | A1 ∩ A2 ). The events can be
taken in any order, as shown in Example 2.21.
44
2.7. Conditional probability and Bayes’ theorem
Example 2.22 Suppose you draw 4 cards from a deck of 52 playing cards. What is
the probability of A = ‘the cards are the 4 aces (cards of rank 1)’ ?
We could calculate this using counting rules. There are 52
4
= 270,725 possible
subsets of 4 different cards, and only 1 of these consists of the 4 aces. Therefore,
P (A) = 1/270725.
Let us try with conditional probabilities. Define Ai as ‘the ith card is an ace’, so
that A = A1 ∩ A2 ∩ A3 ∩ A4 . The necessary probabilities are:
P (A1 ) = 4/52 since there are initially 4 aces in the deck of 52 playing cards
P (A2 | A1 ) = 3/51. If the first card is an ace, 3 aces remain in the deck of 51
playing cards from which the second card will be drawn
P (A3 | A1 , A2 ) = 2/50
P (A4 | A1 , A2 , A3 ) = 1/49.
We now return to probabilities of partitions like the situation shown in Figure 2.10.
45
2. Probability theory
HH
H A1
A2 HH
A1
rH HHr
A
A3
HH A2
H
HH
H A3
Figure 2.10: On the left, a Venn diagram depicting A = A1 ∪ A2 ∪ A3 , and on the right
the ‘paths’ to A.
Both diagrams in Figure 2.10 represent the partition A = A1 ∪ A2 ∪ A3 . For the next
results, it will be convenient to use diagrams like the one on the right in Figure 2.11,
where A1 , A2 and A3 are symbolised as different ‘paths’ to A.
We now develop powerful methods of calculating sums like:
r B1
r B2
HH
HH
rH
r B3 HHr
A
@H
H
@ HH
@ Hr
@ B4
@
@r
B5
Figure 2.11: On the left, a Venn diagram depicting the set A and the partition of S, and
on the right the ‘paths’ to A.
In other words, think of event A as the union of all the A ∩ Bi s, i.e. of ‘all the paths to
A via different intervening events Bi ’.
To get the probability of A, we now:
P (A ∩ Bi ) = P (A | Bi ) P (Bi )
46
2.7. Conditional probability and Bayes’ theorem
Example 2.23 Any event B has the property that B and its complement B c
partition the sample space. So if we take K = 2, B1 = B and B2 = B c in the formula
of total probability, we get:
P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
= P (A | B) P (B) + P (A | B c ) [1 − P (B)].
r Bc
H
H
HH
H
r HHr
H A
HH
HH
r
HH
B
Example 2.24 Suppose that 1 in 10,000 people (0.01%) has a particular disease. A
diagnostic test for the disease has 99% sensitivity. If a person has the disease, the
test will give a positive result with a probability of 0.99. The test has 99% specificity.
If a person does not have the disease, the test will give a negative result with a
probability of 0.99.
Let B denote the presence of the disease, and B c denote no disease. Let A denote a
positive test result. We want to calculate P (A).
The probabilities we need are P (B) = 0.0001, P (B c ) = 0.9999, P (A | B) = 0.99 and
P (A | B c ) = 0.01. Therefore:
P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
= 0.99 × 0.0001 + 0.01 × 0.9999
= 0.010098.
Activity 2.27 A man has two bags. Bag A contains five keys and bag B contains
seven keys. Only one of the twelve keys fits the lock which he is trying to open. The
man selects a bag at random, picks out a key from the bag at random and tries that
key in the lock. What is the probability that the key he has chosen fits the lock?
Solution
Define a partition {Ci }, such that:
5 1 5
C1 = key in bag A and bag A chosen ⇒ P (C1 ) = × =
12 2 24
7 1 7
C2 = key in bag B and bag A chosen ⇒ P (C2 ) = × =
12 2 24
47
2. Probability theory
5 1 5
C3 = key in bag A and bag B chosen ⇒ P (C3 ) = × =
12 2 24
7 1 7
C4 = key in bag B and bag B chosen ⇒ P (C4 ) = × = .
12 2 24
Hence we require, defining the event F = ‘key fits’:
1 1 1 5 1 7 1
P (F ) = × P (C1 ) + × P (C4 ) = × + × = .
5 7 5 24 7 24 12
What is the probability that we got there via, say, B1 ? In other words, what is the
conditional probability P (B1 | A)? This situation is depicted in Figure 2.13.
So we need:
P (A ∩ Bj )
P (Bj | A) =
P (A)
and we already know how to get this.
48
2.7. Conditional probability and Bayes’ theorem
Bayes’ theorem
Using the chain rule and the total probability formula, we have:
P (A | Bj ) P (Bj )
P (Bj | A) = K
P
P (A | Bi ) P (Bi )
i=1
Example 2.25 Continuing with Example 2.24, let B denote the presence of the
disease, B c denote no disease, and A denote a positive test result.
We want to calculate P (B | A), i.e. the probability that a person has the disease,
given that the person has received a positive test result.
The probabilities we need are:
Therefore:
P (A | B) P (B) 0.99 × 0.0001
P (B | A) = c c
= ≈ 0.0098.
P (A | B) P (B) + P (A | B ) P (B ) 0.010098
Why is this so small? The reason is because most people do not have the disease and
the test has a small, but non-zero, false positive rate P (A | B c ). Therefore, most
positive test results are actually false positives.
Activity 2.28 Prove the simplest version of Bayes’ theorem from first principles.
Solution
Applying the definition of conditional probability, we have:
P (B ∩ A) P (A ∩ B) P (A | B) P (B)
P (B | A) = = = .
P (A) P (A) P (A)
Solution
Bayes’ theorem is:
P (A | Bj ) P (Bj )
P (Bj | A) = K
.
P
P (A | Bi ) P (Bi )
i=1
By definition:
P (Bj ∩ A) P (A | Bj ) P (Bj )
P (Bj | A) = = .
P (A) P (A)
49
2. Probability theory
Activity 2.30 A statistics teacher knows from past experience that a student who
does their homework consistently has a probability of 0.95 of passing the
examination, whereas a student who does not do their homework has a probability
of 0.30 of passing.
(a) If 25% of students do their homework consistently, what percentage can expect
to pass?
(b) If a student chosen at random from the group gets a pass, what is the
probability that the student has done their homework consistently?
Solution
Here the random experiment is to choose a student at random, and to record
whether the student passes (P ) or fails (F ), and whether the student has done their
homework consistently (C) or has not (N ). (Notice that F = P c and N = C c .) The
sample space is S = {P C, P N, F C, F N }. We use the events Pass = {P C, P N }, and
Fail = {F C, F N }. We consider the sample space partitioned by Homework
= {P C, F C}, and No Homework = {P N, F N }.
(a) The first part of the example asks for the denominator of Bayes’ theorem:
P (Homework ∩ Pass)
P (Homework | Pass) =
P (Pass)
P (Pass | Homework) P (Homework)
=
P (Pass)
0.95 × 0.25
=
0.4625
= 0.5135.
50
2.7. Conditional probability and Bayes’ theorem
Solution
Suppose that two phrases are the same. We use Bayes’ theorem:
0.95 × 0.05
P (plagiarised | two the same) = = 0.0909.
0.95 × 0.05 + 0.5 × 0.95
Finding two phrases has increased the chance the work is plagiarised from 5% to
9.1%. Did you get anywhere near 9% when guessing? Now suppose that we find
three or more phrases:
0.05 × 0.05
P (plagiarised | three or more the same) = = 0.0052.
0.05 × 0.05 + 0.5 × 0.95
It seems that no plagiariser is silly enough to keep three or more phrases the same,
so if we find three or more, the chance of the work being plagiarised falls from 5% to
0.5%! How close did you get by guessing?
51
2. Probability theory
Activity 2.32 Continuing with Activity 2.27, suppose the first key chosen does not
fit the lock. What is the probability that the bag chosen:
(a) is bag A?
Solution
P (F c | C1 ) P (C1 ) + P (F c | C2 ) P (C2 )
P (bag A | F c ) = 4
.
P c
P (F | Ci ) P (Ci )
i=1
P (F c | C1 ) P (C1 ) + P (F c | C4 ) P (C4 )
P (right bag | F c ) = 4
P
P (F c | Ci ) P (Ci )
i=1
Solution
We must assume that the game finishes with probability one (it would be proved in
a more advanced subject). If A, B and C all throw and fail to get a six, then their
respective chances of winning are as at the start of the game. We can call each
completed set of three throws a round. Let us denote the probabilities of winning by
P (A), P (B) and P (C) for A, B and C, respectively. Therefore:
52
2.7. Conditional probability and Bayes’ theorem
53
2. Probability theory
Solution
Suppose that the two players are A and B. We calculate the probability that A wins
a three-, four- or five-set match, and then, since the players are evenly matched,
double these probabilities for the final answer.
P (‘A wins in 3 sets’) = P (‘A wins 1st set’ ∩ ‘A wins 2nd set’ ∩ ‘A wins 3rd set’).
P (‘A wins in 3 sets’) = P (‘A wins 1st set’) P (‘A wins 2nd set’) P (‘A wins 3rd set’)
1 1 1
= × ×
2 2 2
1
= .
8
Therefore, the total probability that the game lasts three sets is:
1 1
2× = .
8 4
If A wins in four sets, the possible winning patterns are:
Each of these patterns has probability (1/2)4 by using the same argument as in the
case of 3 sets. So the probability that A wins in four sets is 3 × (1/16) = 3/16.
Therefore, the total probability of a match lasting four sets is 2 × (3/16) = 3/8.
The probability of a five-set match should be 1 − 3/8 − 1/4 = 3/8, but let us check
this directly. The winning patterns for A in a five-set match are:
Each of these has probability (1/2)5 because of the independence of the sets. So the
probability that A wins in five sets is 6 × (1/32) = 3/16. Therefore, the total
probability of a five-set match is 3/8, as before.
54
2.7. Conditional probability and Bayes’ theorem
(c) If deuce is called, show that A’s subsequent probability of winning the game is
4/5.
Solution
(a) A will win the game without deuce if he or she wins four points, including the
last point, before B wins three points. This can occur in three ways.
• A wins four straight points, i.e. AAAA with probability (2/3)4 = 16/81.
• B wins just one point in the game. There are 4 C1 ways for this to happen,
namely BAAAA, ABAAA, AABAA and AAABA. Each has probability
(1/3) × (2/3)4 , so the probability of one of these outcomes is given by
4 × (1/3) × (2/3)4 = 64/243.
• B wins just two points in the game. There are 5 C2 ways for this to happen,
namely BBAAAA, BABAAA, BAABAA, BAAABA, ABBAAA,
ABABAA, ABAABA, AABBAA, AABABA and AAABBA. Each has
probability (1/3)2 × (2/3)4 , so the probability of one of these outcomes is
given by 10 × (1/3)2 × (2/3)4 = 160/729.
Therefore, the probability that A wins without a deuce must be the sum of
these, namely:
(b) We can mimic the above argument to find the probability that B wins the game
without a deuce. That is, the probability of four straight points to B is
(1/3)4 = 1/81, the probability that A wins just one point in the game is
4 × (2/3) × (1/3)4 = 8/243, and the probability that A wins just two points is
10 × (2/3)2 × (1/3)4 = 40/729. So the probability of B winning without a deuce
is 1/81 + 8/243 + 40/729 = 73/729 and so the probability of deuce is
1 − 496/729 − 73/729 = 160/729.
(c) Either: suppose deuce has been called. The probability that A wins the set
without further deuces is the probability that the next two points go AA – with
probability (2/3)2 .
The probability of exactly one further deuce is that the next four points go
ABAA or BAAA – with probability (2/3)3 × (1/3) + (2/3)3 × (1/3) = (2/3)4 .
The probability of exactly two further deuces is that the next six points go
ABABAA, ABBAAA, BAABAA or BABAAA – with probability
4 × (2/3)4 × (1/3)2 = (2/3)6 .
Continuing this way, the probability that A wins after three further deuces is
(2/3)8 and the overall probability that A wins after deuce has been called is
(2/3)2 + (2/3)4 + (2/3)6 + (2/3)8 + · · · .
55
2. Probability theory
This is a geometric progression (GP) with first term a = (2/3)2 and common
ratio (2/3)2 , so the overall probability that A wins after deuce has been called is
a/(1 − r) (sum to infinity of a GP) which is:
(2/3)2 4/9 4
2
= = .
1 − (2/3) 5/9 5
Or (quicker!): given a deuce, the next 2 balls can yield the following results. A
wins with probability (2/3)2 , B wins with probability (1/3)2 , and deuce with
probability 4/9.
Hence P (A wins | deuce) = (2/3)2 + (4/9) P (A wins | deuce) and solving
immediately gives P (A wins | deuce) = 4/5.
(d) We have:
Example 2.26 You are waiting for your bag at the baggage reclaim carousel of an
airport. Suppose that you know that there are 200 bags to come from your flight,
and you are counting the distinct bags which come out. Suppose that x bags have
arrived, and your bag is not among them. What is the probability that your bag will
not arrive at all, i.e. that it has been lost (or at least delayed)?
Define A = ‘your bag has been lost’ and x = ‘your bag is not among the first x bags
to arrive’. What we want to know is the conditional probability P (A | x) for any
x = 0, 1, 2, . . . , 200. The conditional probabilities the other way round are as follows.
P (x | A) = 1 for all x. If your bag has been lost, it will not arrive!
P (x | A) P (A) P (A)
P (A | x) = c c
= .
P (x | A) P (A) + P (x | A ) P (A ) P (A) + [(200 − x)/200] [1 − P (A)]
56
2.7. Conditional probability and Bayes’ theorem
Obviously, P (A | 200) = 1. If the bag has not arrived when all 200 have come out, it
has been lost!
For other values of x we need P (A). This is the general probability that a bag gets
lost, before you start observing the arrival of the bags from your particular flight.
This kind of probability is known as the prior probability of an event A.
Let us assign values to P (A) based on some empirical data. Statistics by the
Association of European Airlines (AEA) show how many bags were ‘mishandled’ per
1,000 passengers the airlines carried. This is not exactly what we need (since not all
passengers carry bags, and some have several), but we will use it anyway. In
particular, we will compare the results for the best and the worst of the AEA in 2006:
For Air Malta, P (A | 199) = 0.469. So even when only 1 bag remains to arrive,
the probability is less than 0.5 that your bag has been lost.
For British Airways, P (A | 199) = 0.825. Also, we see that P (A | 197) = 0.541 is
the first probability over 0.5.
BA
0.8
Air Malta
P( Your bag is lost )
0.6
0.4
0.2
0.0
Figure 2.14: Plot of P (A | x) as a function of x for the two airlines in Example 2.26, Air
Malta and British Airways (BA).
57
2. Probability theory
1. For each one of the statements below say whether the statement is true or false,
explaining your answer. Throughout this question A and B are events such that
0 < P (A) < 1 and 0 < P (B) < 1.
(a) If A and B are independent, then P (A) + P (B) > P (A ∪ B).
(b) If P (A | B) = P (A | B c ) then A and B are independent.
(c) If A and B are disjoint events, then Ac and B c are disjoint.
3. A person tried by a three-judge panel is declared guilty if at least two judges cast
votes of guilty (i.e. a majority verdict). Suppose that when the defendant is in fact
guilty, each judge will independently vote guilty with probability 0.9, whereas when
58
2.10. Sample examination questions
the defendant is not guilty (i.e. innocent), this probability drops to 0.25. Suppose
70% of defendants are guilty.
(a) Compute the probability that judge 1 votes guilty.
(b) Given that both judge 1 and judge 2 vote not guilty, compute the probability
that judge 3 votes guilty.
59
2. Probability theory
60
Chapter 3
Random variables
define a random variable and distinguish it from the values which it takes
explain the difference between discrete and continuous random variables
find the mean and the variance of simple random variables whether discrete or
continuous
demonstrate how to proceed and use simple properties of expected values and
variances.
3.3 Introduction
In ST104a Statistics 1, we considered descriptive statistics for a sample of
observations of a variable X. Here we will represent the observations as a sequence of
variables, denoted as:
X1 , X2 , . . . , Xn
where n is the sample size.
In statistical inference, the observations will be treated as a sample drawn at random
from a population. We will then think of each observation Xi of a variable X as an
outcome of an experiment.
The experiment is ‘select a unit at random from the population and record its
value of X’.
The outcome is the observed value Xi of X.
Because variables X in statistical data are recorded as numbers, we can now focus on
experiments where the outcomes are also numbers – random variables.
61
3. Random variables
Random variable
A random variable is an experiment for which the outcomes are numbers.1 This
means that for a random variable:
the outcomes are numbers in this sample space (instead of ‘outcomes’, we often
call them the values of the random variable)
There are two main types of random variables, depending on the nature of S, i.e. the
possible values of the random variable.
Notation
P (a < X < b) denotes the probability that X is between the numbers a and b.
You will notice that many of the quantities we define for random variables are
analogous to sample quantities defined in ST104a Statistics 1.
1
This definition is a bit informal, but it is sufficient for this course.
2
Strictly speaking, a discrete random variable is not just a random variable which is not continuous
as there are many others, such as mixture distributions.
62
3.4. Discrete random variables
Example 3.1 The following two examples will be used throughout this chapter.
63
3. Random variables
Example 3.2 Consider the following probability distribution for the household
size, X.3
Number of people
in the household, x P (X = x)
1 0.3002
2 0.3417
3 0.1551
4 0.1336
5 0.0494
6 0.0145
7 0.0034
8 0.0021
Probability function
p(x) = P (X = x).
We can talk of p(x) both as the pf of the random variable X, and as the pf of the
probability distribution of X. Both mean the same thing.
Alternative terminology: the pf of a discrete random variable is also often called the
probability mass function (pmf).
Alternative notation: instead of p(x), the pf is also often denoted by, for example, pX (x)
– especially when it is necessary to indicate clearly to which random variable the
function corresponds.
The pf is defined for all real numbers x, but p(x) = 0 for any x ∈
/ S, i.e. for any
value x which is not one of the possible values of X.
3
Source: ONS, National report for the 2001 Census, England and Wales. Table UV51.
64
3.4. Discrete random variables
Example 3.3 Continuing Example 3.2, here we can simply list all the values:
0.3002 for x = 1
0.3417 for x = 2
0.1551 for x = 3
0.1336 for x = 4
p(x) = 0.0494 for x = 5
0.0145 for x = 6
0.0034 for x = 7
0.0021 for x = 8
0 otherwise.
8
P
These are clearly all non-negative, and their sum is p(x) = 1.
x=1
A graphical representation of the pf is shown in Figure 3.1.
0.35
0.30
0.25
0.20
p(x)
0.15
0.10
0.05
0.00
1 2 3 4 5 6 7 8
For the next example, we need to remember the following results from mathematics,
concerning sums of geometric series. If r 6= 1, then:
n−1
X a (1 − rn )
x
ar =
x=0
1−r
65
3. Random variables
Example 3.4 In the basketball example, the number of possible values is infinite,
so we cannot simply list the values of the pf. So we try to express it as a formula.
Suppose that:
Hence the probability that the first success occurs after x failures is the probability
of a sequence of x failures followed by a success, i.e. the probability is:
(1 − π)x π.
So the pf of the random variable X (the number of failures before the first success)
is: (
(1 − π)x π for x = 0, 1, 2, . . .
p(x) = (3.1)
0 otherwise
where 0 ≤ π ≤ 1. Let us check that (3.1) satisfies the conditions for a pf.
π = 0.7
0.3
π = 0.3
0.2
0.1
0.0
0 5 10 15
x (number of failures)
Figure 3.2: Probability function for Example 3.4. π = 0.7 indicates a fairly good free-throw
shooter. π = 0.3 indicates a pretty poor free-throw shooter.
66
3.4. Discrete random variables
Activity 3.1 Suppose that a box contains 12 green balls and 4 yellow balls. If 7
balls are selected at random, without replacement, determine the probability
function of X, the number of green balls which will be obtained.
Solution
Let the random variable X denote the number of green balls. As 7 balls are selected
without replacement, the sample space of X is S = {3, 4, 5, 6, 7} because the
maximum number of yellow balls which could be obtained is 4 (all selected), hence a
minimum of 3 green balls must be obtained, up to a maximum of7 green balls. The
16
number of possible combinations of 7 balls drawn from 16 is . The x green
7
12
balls chosen from 12 can occur in ways, and the 7 − x yellow balls chosen from
x
4
4 can occur in ways. Therefore, using classical probability:
7−x
12 4
x 7−x
p(x) = .
16
7
Activity 3.2 Consider a sequence of independent tosses of a fair coin. Let the
random variable X denote the number of tosses needed to obtain the first head.
Determine the probability function of X and verify it satisfies the necessary
conditions for a valid probability function.
Solution
The sample space is clearly S = {1, 2, 3, . . .}. If the first head appears on toss x, then
the previous x − 1 tosses must have been tails. By independence of the tosses, and
the fact it is a fair coin:
x−1 x
1 1 1
P (X = x) = × = .
2 2 2
Therefore, the probability function is:
(
1/2x for x = 1, 2, 3, . . .
p(x) =
0 otherwise.
Clearly, p(x) ≥ 0 for all x and:
∞ x 2 3
X 1 1 1 1 1/2
= + + + ··· = =1
x=1
2 2 2 2 1 − 1/2
67
3. Random variables
noting the sum to infinity of a geometric series with first term a = 1/2 and common
ratio r = 1/2.
Solution
Since k > 0, then 2x/(k (k + 1)) ≥ 0 for x = 1, 2, . . . , k. Therefore, p(x) ≥ 0 for all
real x. Also, noting the hint in the question:
k
X 2x 2 4 2k
= + + ··· +
x=1
k (k + 1) k (k + 1) k (k + 1) k (k + 1)
2
= (1 + 2 + · · · + k)
k (k + 1)
2 k (k + 1)
=
k (k + 1) 2
= 1.
i.e. the sum of the probabilities of the possible values of X which are less than or
equal to x.
68
3.4. Discrete random variables
Example 3.5 Continuing with the household size example, values of F (x) at all
possible values of X are:
Number of people
in the household, x p(x) F (x)
1 0.3002 0.3002
2 0.3417 0.6419
3 0.1551 0.7970
4 0.1336 0.9306
5 0.0494 0.9800
6 0.0145 0.9945
7 0.0034 0.9979
8 0.0021 1.0000
0.4
0.2
0.0
0 2 4 6 8
we can write: (
0 for x < 0
F (x) =
1 − (1 − π)x+1 for x = 0, 1, 2, . . . .
The cdf is shown in graphical form in Figure 3.4.
69
3. Random variables
Activity 3.4 Suppose that random variable X has the range {x1 , x2 , . . .}, where
x1 < x2 < · · · . Prove the following results:
∞
X k
X
p(xi ) = 1, p(xk ) = F (xk ) − F (xk−1 ) and F (xk ) = p(xi ).
i=1 i=1
Solution
The events X = x1 , X = x2 , . . . are disjoint, so we can write:
∞
X ∞
X
p(xi ) = P (X = xi ) = P (X = x1 ∪ X = x2 ∪ · · · ) = P (S) = 1.
i=1 i=1
In words, this result states that the sum of the probabilities of all the possible values
X can take is equal to 1.
For the second equation, we have:
F (xk ) = P (X ≤ xk ) = P (X = xk ∪ X ≤ xk−1 ).
Activity 3.5 At a charity event, the organisers sell 100 tickets to a raffle. At the
end of the event, one of the tickets is selected at random and the person with that
number wins a prize. Carol buys ticket number 22. Janet buys tickets numbered 1–5.
What is the probability for each of them to win the prize?
Solution
Let X denote the number on the winning ticket. Since all values between 1 and 100
are equally likely, X has a discrete ‘uniform’ distribution such that:
1
P (‘Carol wins’) = P (X = 22) = p(22) = = 0.01
100
and:
5
P (‘Janet wins’) = P (X ≤ 5) = F (5) = = 0.05.
100
70
3.4. Discrete random variables
1.0
0.8
0.6
F(x)
0.4 π = 0.7
π = 0.3
0.2
0.0
0 5 10 15
x (number of failures)
at such an xi , the value of F (xi ) is the value at the top of the jump (i.e. F (x) is
right-continuous).
Either the pf or the cdf can be used to calculate the probabilities of any events for a
discrete random variable.
Example 3.7 Continuing with the household size example (for the probabilities,
see Example 3.5), then:
71
3. Random variables
The expected value (or mean) of X is denoted E(X), and defined as:
X
E(X) = xi p(xi ).
xi ∈S
P P
This can also be written more concisely as E(X) = x p(x) or E(X) = x p(x).
x
We can talk of E(X) as the expected value of both the random variable X, and of the
probability distribution of X.
Alternative notation: instead of E(X), the symbol µ (the lower-case Greek letter ‘mu’),
or µX , is often used.
Activity 3.6 Toward the end of the financial year, James is considering whether to
accept an offer to buy his stock option now, rather than wait until the normal
exercise time. If he sells now, his profit will be £120,000. If he waits until the
exercise time, his profit will be £200,000, provided that there is no crisis in the
markets before that time; if there is a crisis, the option will be worthless and he
would expect a net loss of £50,000. What action should he take to maximise his
expected profit if the probability of crisis is:
(a) 0.5?
(b) 0.1?
For what probability of a crisis would James be indifferent between the two courses
of action if he wishes to maximise his expected profit?
Solution
Let π = probability of crisis, then:
S = E(profit given James sells) = £120,000
and:
W = E(profit given James waits) = £200,000 (1 − π) + (−£50,000) π.
72
3.4. Discrete random variables
so π = 8/25 = 0.32.
Activity 3.7 What is the expectation of the random variable X if the only possible
value it can take is c? Also, show that E(X − E(X)) = 0.
Solution
We have p(c) = 1, so X is effectively a constant, even though it is called a random
variable. Its expectation is:
X
E(X) = x p(x) = c p(x) = c p(c) = c × 1 = c. (3.2)
∀x
Solution
We have:
∞ ∞ ∞
X
X
x
X 1 x
X
E(2 ) = 2 p(x) = 2 x = 1 = 1 + 1 + 1 + · · · = ∞.
x=1 x=1
2 x=1
Note that this is the famous ‘Petersburg paradox’, according to which a player’s
expectation is infinite (i.e. does not exist) if s/he is to receive 2X units of currency
when, in a series of tosses of a fair coin, the first head appears on the xth toss.
73
3. Random variables
Activity 3.9 Suppose that on each play of a certain game James, a gambler, is
equally likely to win or to lose. Suppose that when he wins, his fortune is doubled,
and that when he loses, his fortune is cut in half. If James begins playing with a
given fortune c > 0, what is the expected value of his fortune after n independent
plays of the game?
Hint: If X1 , X2 , . . . , Xn are independent random variables, then:
That is, for independent random variables the ‘expectation of the product’ is the
‘product of the expectations’. This will be introduced in Chapter 5: Multivariate
random variables.
Solution
For i = 1, . . . , n, let Xi = 2 if James’ fortune is doubled on the ith play of the game,
and let Xi = 1/2 if his fortune is cut in half on the ith play. Hence:
1 1 1 5
E(Xi ) = 2 × + × = .
2 2 2 4
After the first play of the game, James’ fortune will be cX1 , after the second play it
will be (cX1 )X2 , and by continuing in this way it is seen that after n plays James’
fortune will be cX1 X2 · · · Xn . Since X1 , . . . , Xn are independent, and noting the hint:
n
5
E(cX1 X2 · · · Xn ) = c × E(X1 ) × E(X2 ) × · · · × E(Xn ) = c .
4
where:
fi
pb(xi ) = K
P
fi
i=1
are the sample proportions of the values xi .
74
3.4. Discrete random variables
So X̄ uses the sample proportions, pb(xi ), whereas E(X) uses the population
probabilities, p(xi ).
Number of people
in the household, x p(x) x p(x)
1 0.3002 0.3002
2 0.3417 0.6834
3 0.1551 0.4653
4 0.1336 0.5344
5 0.0494 0.2470
6 0.0145 0.0870
7 0.0034 0.0238
8 0.0021 0.0168
Sum 2.3579
= E(X)
Example 3.9 For the basketball example, p(x) = (1 − π)x π for x = 0, 1, 2, . . ., and
0 otherwise.
It can be shown that (see the appendix for a non-examinable proof):
1−π
E(X) = .
π
Hence, for example:
So, before scoring a basket, a fairly good free-throw shooter (with π = 0.7) misses on
average about 0.42 shots, and a pretty poor free-throw shooter (with π = 0.3) misses
on average about 2.33 shots.
75
3. Random variables
In general:
E(g(X)) 6= g(E(X))
when g(X) is a non-linear function of X.
Suppose X is a random variable and a and b are constants, i.e. known numbers
which are not random variables. Therefore:
E(aX + b) = a E(X) + b.
E(aX + b) = a E(X) + b
E(b) = b.
Both Var(X) and sd(X) are always ≥ 0. Both are measures of the dispersion (variation)
of the random variable X.
Alternative notation: the variance is often denoted σ 2 (‘sigma squared’) and the
standard deviation by σ (‘sigma’).
76
3.4. Discrete random variables
2 2 2 2
Var(X) =pE[(X − E(X))
√ ] = 1.699 = 7.259 − (2.358) = E(X ) − (E(X)) and
sd(X) = Var(X) = 1.699 = 1.30.
So the variation in how many free throws a pretty poor shooter misses before the
first success is much higher than the variation for a fairly good shooter.
Var(aX + b) = a2 Var(X).
If a = 0, this gives:
Var(b) = 0.
That is, the variance of a constant is 0. The converse also holds – if a random variable
has a variance of 0, it is actually a constant.
77
3. Random variables
Example 3.14 For further practice, let us consider a discrete random variable X
which has possible values 0, 1, 2 . . . , n, where n is a known positive integer, and X
has the following probability function:
(
n
x
π x (1 − π)n−x for x = 0, 1, 2, . . . , n
p(x) =
0 otherwise
This does not simplify into a simple formula, so we just calculate the values
from the definition, by summation.
For the values x = 0, 1, . . . , n, the value of the cdf is:
x
X n
F (x) = P (X ≤ x) = π y (1 − π)n−y .
y=0
y
Activity 3.10 Show that if Var(X) = 0 then p(µ) = 1. (We say in this case that X
is almost surely equal to its mean.)
78
3.4. Discrete random variables
Solution
From the definition of variance, we have:
X
Var(X) = E((X − µ)2 ) = (x − µ)2 p(x) ≥ 0
∀x
because the squared term (x − µ)2 is non-negative (as is p(x)). The only case where
it is equal to 0 is when x − µ = 0, that is, when x = µ. Therefore, the random
variable X can only take the value µ, and we have p(µ) = P (X = µ) = 1.
Activity 3.11 Construct suitable examples to show that for a random variable X:
Solution
We require a counterexample. A simple one will suffice – there is no merit in
complexity. Let the discrete random variable X assume values 1 and 2 with
probabilities 1/3 and 2/3, respectively. (Obviously, there are many other examples
we could have chosen.) Therefore:
1 2 5
E(X) = 1 × +2× =
3 3 3
1 2
E(X 2 ) = 1 × + 4 × = 3
3 3
1 1 1 2 2
E =1× + × =
X 3 2 3 3
and, clearly, E(X 2 ) 6= (E(X))2 and E(1/X) 6= 1/E(X) in this case. So the result has
been shown in general.
Activity 3.12
Solution
79
3. Random variables
(b) We have:
X 1 + X2 + · · · + Xn E(X1 + X2 + · · · + Xn )
E =
n n
E(X1 ) + E(X2 ) + · · · + E(Xn )
=
n
µ + µ + ··· + µ
=
n
nµ
=
n
= µ.
Also:
X1 + X2 + · · · + X n Var(X1 + X2 + · · · + Xn )
Var =
n n2
Var(X1 ) + Var(X2 ) + · · · + Var(Xn )
(by independence) =
n2
σ2 + σ2 + · · · + σ2
=
n2
n σ2
=
n2
σ2
= .
n
Activity 3.13 Let X be a random variable for which E(X) = µ and Var(X) = σ 2 ,
and let c be an arbitrary constant. Show that:
E((X − c)2 ) = σ 2 + (µ − c)2 .
Solution
We have:
E((X − c)2 ) = E(X 2 − 2cX + c2 ) = E(X 2 ) − 2c E(X) + c2
= Var(X) + (E(X))2 − 2cµ + c2
= σ 2 + µ2 − 2cµ + c2
= σ 2 + (µ − c)2 .
80
3.4. Discrete random variables
Solution
hence θ = −4/7.
2 !
4
= − × 0.7 + (12 × 0.2) + (22 × 0.1)
7
= 0.8286.
Activity 3.15 James is planning to invest £1000 for two years. He will choose
between two savings accounts offered by a bank:
(a) Calculate the expected value and standard deviation of X, and the expected
value and standard deviation of Y .
(b) For which values of π is E(X) > E(Y )?
(c) Which account would you choose, and why? (There is no single right answer to
this question!)
81
3. Random variables
Solution
(a) Since the interest rate of the standard account is guaranteed, the ‘random’
variable Y is actually a constant. So E(Y ) = 5.5 and Var(Y ) = sd(Y ) = 0. The
random variable X has two values, 8.1 and 1.1, with probabilities π and 1 − π
respectively. Therefore:
(b) E(X) > E(Y ) if 1.1 + 7.0π > 5.5, i.e. if π > 0.6286. The expected interest rate
of the Deposit Plus account is higher than the guaranteed rate of the standard
account if the probability is higher than 0.6286 that all three stock prices are at
higher levels at the end of the reference period.
(c) If you focus solely on the expected interest rate, you would make your decision
based on your estimate of π. You would choose the Deposit Plus account if you
believe – based on whatever evidence on the companies and the world economy
you choose to use – that there is a probabily of at least 0.6286 that the three
companies will all increase their share prices over the two years.
However, you might also consider the variances. The standard account has a
guaranteed rate, while the Deposit Plus account offers both a possibility of a
high rate and a risk of a low rate. So the choice could also depend on how
risk-averse you are.
(a) each door is equally likely to be chosen on each trial, and all trials are mutually
independent
(b) at each trial, the rat chooses with equal probability between the doors which it
has not so far tried
(c) the rat never chooses the same door on two successive trials, but otherwise
chooses at random with equal probabilities.
82
3.4. Discrete random variables
Solution
(c) For the ‘forgetful’ rat (short-term, but not long-term, memory):
1
P (X = 1) =
4
3 1
P (X = 2) = ×
4 3
3 2 1
P (X = 3) = × ×
4 3 3
..
.
r−2
3 2 1
P (X = r) = × × (for r ≥ 2).
4 3 3
Therefore:
" 2 ! #
1 3 1 2 1 2 1
E(X) = + × 2× + 3× × + 4× × + ···
4 4 3 3 3 3 3
" 2 ! #
1 1 2 2
= + 2+ 3× + 4× + ··· .
4 4 3 3
83
3. Random variables
Note that 2.5 < 3.25 < 4, so the intelligent rat needs the least trials on average,
while the stupid rat needs the most, as we would expect!
In other words, the set of possible values (the sample space) is the real numbers R,
or one or more intervals in R.
Suppose the policy has a deductible of £999, so all claims are at least £1,000.
Most of the concepts introduced for discrete random variables have exact or
approximate analogies for continuous random variables, and many results are the same
for both types. However, there are some differences in the details. The most obvious
difference is that wherever in the discrete case there are sums over the possible values of
the random variable, in the continuous case these are integrals.
4
Strictly speaking, having an uncountably infinite number of possible values does not necessarily
imply that it is a continuous random variable. For example, the Cantor distribution (not covered in this
course) is neither a discrete nor an absolutely continuous probability distribution, nor is it a mixture of
these. However, we will not consider this matter any further in this course.
84
3.5. Continuous random variables
where α > 0 is a parameter, and k > 0 (the smallest possible value of X) is a known
number. In our example, k = 1 (due to the deductible). A probability distribution
with this pdf is known as the Pareto distribution. A graph of this pdf when
α = 2.2 is shown in Figure 3.5.
2.0
1.5
f(x)
1.0
0.5
0.0
Unlike for probability functions of discrete random variables, in the continuous case
values of the probability density function are not probabilities of individual values, i.e.
f (x) 6= P (X = x). In fact, for a continuous random variable:
That is, the probability that X has any particular value exactly is always 0.
Because of (3.4), with a continuous random variable we do not need to be very careful
about differences between < and ≤, and between > and ≥. Therefore, the following
probabilities are all equal:
85
3. Random variables
R3
Example 3.17 In Figure 3.6, the shaded area is P (1.5 < X ≤ 3) = 1.5
f (x) dx.
2.0
1.5
f(x)
1.0
0.5
0.0
86
3.5. Continuous random variables
Properties of pdfs
The pdf f (x) of any continuous random variable must satisfy the following conditions.
1.
2.
Z ∞
f (x) dx = 1.
−∞
Example 3.18 Continuing with the insurance example, we check that the
conditions hold for the pdf:
(
0 for x < k
f (x) =
α k α /xα+1 for x ≥ k
1. Clearly, f (x) ≥ 0 for all x, since α > 0, k α > 0 and xα+1 ≥ k α+1 > 0.
2. We have:
∞ ∞ ∞
α kα
Z Z Z
f (x) dx = dx = α k α x−α−1 dx
−∞ k xα+1 k
1 ∞
α
x−α
= αk k
−α
= (−k α )(0 − k −α )
= 1.
The general properties of the cdf stated previously also hold for continuous
distributions. The cdf of a continuous distribution is not a step function, so results
on discrete-specific properties do not hold in the continuous case. A continuous cdf
is a smooth, continuous function of x.
87
3. Random variables
f (x) = F 0 (x).
Activity 3.17
(a) Define the cumulative distribution function (cdf) of a random variable and state
the principal properties of such a function.
(b) Identify which, if any, of the following functions could be a cdf under suitable
choices of the constants a and b. Explain why (or why not) each function
satisfies the properties required of a cdf and the constraints which may be
required in respect of the constants a and b.
i. F (x) = a (b − x)2 for −1 ≤ x ≤ 1.
ii. F (x) = a (1 − xb ) for −1 ≤ x ≤ 1.
iii. F (x) = a − b exp (−x/2) for 0 ≤ x ≤ 2.
Solution
88
3.5. Continuous random variables
= (−k α )(x−α − k −α )
= 1 − k α x−α
α
k
=1− .
x
Therefore: (
0 for x < k
F (x) = (3.5)
1 − (k/x)α for x ≥ k.
If we were given (3.5), we could obtain the pdf by differentiation, since F 0 (x) = 0
when x < k, and:
α kα
F 0 (x) = −k α (−α) x−α−1 = for x ≥ k.
xα+1
A plot of the cdf is shown in Figure 3.7.
1.0
0.8
0.6
F(x)
0.4
0.2
0.0
1 2 3 4 5 6 7
89
3. Random variables
Example 3.20 Continuing with the insurance example (with k = 1 and α = 2.2),
then:
Example 3.21 Consider now a continuous random variable with the following pdf:
(
λ e−λx for x > 0
f (x) = (3.6)
0 for x ≤ 0
where λ > 0 is a parameter. This is the pdf of the exponential distribution. The
uses of this distribution will be discussed in the next chapter.
Since: Z x x
λ e−λt dt = − e−λt 0 = 1 − e−λx
0
the cdf of the exponential distribution is:
(
0 for x ≤ 0
F (x) =
1 − e−λx for x > 0.
2. Since we have just done the integration to derive the cdf F (x), we can also use
it to show that f (x) integrates to one. This follows from:
Z ∞
f (x) dx = P (−∞ < X < ∞) = lim F (x) − lim F (x)
−∞ x→∞ x→−∞
90
3.5. Continuous random variables
Suppose X is a continuous random variable with pdf f (x). Definitions of its expected
value, the expected value of any transformation g(X), the variance and standard
deviation are the same as for discrete distributions, except that summation is replaced
by integration:
Z ∞
E(X) = x f (x) dx
−∞
Z ∞
E[g(X)] = g(x) f (x) dx
−∞
Z ∞
2
Var(X) = E[(X − E(X)) ] = (x − E(X))2 f (x) dx = E(X 2 ) − (E(X))2
−∞
p
sd(X) = Var(X).
2
=
λ2
where the last step follows because the last integral is simply E(X) = 1/λ again.
Finally:
2 1 1
Var(X) = E(X 2 ) − (E(X))2 = 2 − 2 = 2 .
λ λ λ
91
3. Random variables
Solution
(a) We have:
1 1 1
ax2 bx3
Z Z
2
f (x) dx = 1 ⇒ ax + bx dx = + =1
0 0 2 3 0
i.e. we have:
a b 1
+ = ⇒ a = 6 and b = −6.
3 4 2
Hence f (x) = 6x(1 − x) for 0 ≤ x ≤ 1, and 0 otherwise.
(b) We have:
0
for x < 0
F (x) = 3x2 − 2x3 for 0 ≤ x ≤ 1
1 for x > 1.
(c) Finally:
1 1 1
6x4 6x5
Z Z
2 2 3 4
E(X ) = x (6x(1 − x)) dx = 6x − 6x dx = − = 0.3.
0 0 4 5 0
92
3.5. Continuous random variables
(f) If a sample of five observations is drawn at random from the distribution, find
the probability that all the observations exceed 1.5.
Solution
93
3. Random variables
(0.6836)5 = 0.1493.
Solution
R∞
(a) Clearly, f (x) ≥ 0 for all x and −∞ f (x) dx = 1. This can be seen geometrically,
since f (x) defines two rectangles, one with base 1 and height 1/4, the other
with base 1 and height 3/4, giving a total area of 1/4 + 3/4 = 1.
(b) We have:
Z ∞ Z 1 Z 2 2 1 2 2
x 3x x 3x 1 3 3 5
E(X) = x f (x) dx = dx+ dx = + = + − = .
−∞ 0 4 1 4 8 0 8 1 8 2 8 4
The median is most simply found geometrically. The area to the right of the
point x = 4/3 is 0.5, i.e. the rectangle with base 2 − 4/3 = 2/3 and height 3/4,
giving an area of 2/3 × 3/4 = 1/2. Hence the median is 4/3.
(c) Evaluate P (W > 1), P (W = 2), P (W ≤ 1.5 | W > 0.5) and E(W ).
95
3. Random variables
Solution
G (w )
1
1-(2/3)e -1
1/3
0 2 w
(c) We have:
2 −1/2 2 −1
P (W > 1) = 1 − G(1) = e , P (W = 2) = e
3 3
and:
P (0.5 < W ≤ 1.5)
P (W ≤ 1.5 | W > 0.5) =
P (W > 0.5)
G(1.5) − G(0.5)
=
1 − G(0.5)
1 − (2/3)e−1.5/2 − 1 − (2/3)e−0.5/2
=
(2/3)e−0.5/2
= 1 − e−1/2 .
96
3.5. Continuous random variables
(a) Show that this function has the characteristics of a probability density function.
Solution
(a) Clearly, f (x) ≥ 0 for all x since λ2 > 0, x > 0 and e−λ x > 0.
R∞
To show, −∞ f (x) dx = 1, we have:
∞ ∞ ∞ Z ∞
e−λx e−λx
Z Z
2 −λx 2
f (x) dx = λ xe dx = λ x + λ2 dx
−∞ 0 −λ 0 0 λ
Z ∞
=0+ λ e−λx dx
0
∞
Z ∞
2
λ e−λ x 0 2 x λ e−λ x dx
= −x +
0
2
=0+ (from the exponential distribution).
λ
For the variance:
Z ∞ Z ∞
−λ x
3 −λ x ∞ 6
2
E(X ) = 2 2
x λ xe dx = −x λ e 0
+ 3 x2 λ e−λ x dx = .
0 0 λ2
97
3. Random variables
(b) Suppose that E(X) = 0.75 (1 − e−1 ). Evaluate the median of X and Var(X).
Solution
(a) We have:
i. P (X = 0) = F (0) = 1 − a.
ii. P (X = 1) = lim (F (1) − F (x)) = 1 − (1 − a e−1 ) = a e−1 .
x→1
−x
iii. f (x) = a e , for 0 ≤ x < 1, and 0 otherwise.
iv. The mean is:
Z 1
−1
E(X) = 0 × (1 − a) + 1 × (a e ) + x a e−x dx
0
Z 1
= a e−1 + [−x a e−x ]10 + a e−x dx
0
−1 −1
= ae − ae + [−a e−x ]10
= a (1 − e−1 ).
Hence:
Var(X) = 2a − 4a e−1 − a2 (1 + e−2 − 2e−1 ) = 0.1716.
98
3.5. Continuous random variables
(a) Determine the constant k and derive the cumulative distribution function, F (x),
of X.
Solution
(a) We have: Z ∞ Z π
f (x) dx = k sin(x) dx = 1.
−∞ 0
Therefore:
1
[k (− cos(x))]π0 = 2k = 1 ⇒ k= .
2
The cdf is hence:
0
for x < 0
F (x) = (1 − cos(x))/2 for 0 ≤ x ≤ π
1 for x > π.
π2
= − [− cos(x)]π0
2
π2
= − 2.
2
Therefore, the variance is:
π2 π2 π2
Var(X) = E(X 2 ) − (E(X))2 = −2− = − 2.
2 4 4
99
3. Random variables
Solution
0.2
0.1
0.0
0 1 2 3 4 5
(b) We determine the cdf by integrating the pdf over the appropriate range, hence:
0 for x ≤ 0
x2 /10
for 0 < x < 2
F (x) =
(10x − x2 − 10)/15 for 2 ≤ x ≤ 5
1 for x > 5.
x 0 x 2 x
x2
Z Z Z
t t
F (x) = f (t) dt = 0 dt + dt = = .
−∞ −∞ 0 5 10 0 10
100
3.5. Continuous random variables
For 2 ≤ x ≤ 5, we have:
Z x 0 2 Z x
20 − 4t
Z Z
t
F (x) = f (t) dt = 0 dt + dt + dt
−∞ −∞ 0 5 2 30
x
t2
4 2t
=0+ + −
10 3 15 2
2x x2
4 4 4
= + − − −
10 3 15 3 15
2x x2 2
= − −
3 15 3
10x − x2 − 10
= .
15
101
3. Random variables
(b) Let X be a random variable with probability density function f (x). Find E(X).
Solution
(a) If one draws it on a diagram, it is simply a triangle with base length 6 (from 0
to 6 on the x-axis) and height 1 (the highest point at x = 3). Integrating this
function
R is just finding the area of it, which is (6 × 1)/2 = 3. Hence
f (x) dx = 3k, and so we must have 3k = 1, implying k = 1/3. Alternatively,
note that:
Z 6 Z 6 Z 3 Z 6
x x
k g(x) dx = k g(x) dx = k dx + − + 2 dx .
0 0 0 3 3 3
R
We must have f (x) dx = 1. Hence:
Z 3 Z 6 2 3 2 6
x x x x 3k 3k
1=k dx + k − + 2 dx = k + k − + 2x = +
0 3 3 3 6 0 6 3 2 2
giving k = 1/3.
(b) This can be done very quickly if one can realise that g(x), and hence f (x), is
symmetric aroundR 3, and hence the mean, E(X), must be 3. Otherwise, you
6
need to calculate 0 x f (x) dx, which can be written as the sum of two integrals:
6 6 3 6
x2
Z Z Z Z 2
1 1 1 x
x f (x) dx = x g(x) dx = dx + − + 2x dx.
0 3 0 3 0 3 3 3 3
Therefore:
Z 3 2 Z 6 2 3 3 3 6
x2
x x 2x x x
E(X) = dx+ − + dx = + − + = 1+(4−2) = 3.
0 9 3 9 3 27 0 27 3 3
Recall from ST104a Statistics 1 that the sample median is essentially the observation
‘in the middle’ of a set of data, i.e. where half of the observations in the sample are
smaller than the median and half of the observations are larger.
The median of a random variable (i.e. of its probability distribution) is similar in spirit.
102
3.6. Overview of chapter
1 log 2
e−λ m = ⇔ −λ m = − log 2 ⇔ m= .
2 λ
103
3. Random variables
Outcome Parameter
Probability density function Probability distribution
Probability (mass) function Random variable
Standard deviation Step function
Variance
2. The random variable X has the probability density function given by:
(
kx2 for 0 < x < 1
f (x) =
0 otherwise
104
Chapter 4
Common distributions of random
variables
state properties of these distributions such as the expected value and variance.
4.3 Introduction
In statistical inference we will treat observations:
X1 , X2 , . . . , Xn
(the sample) as values of a random variable X, which has some probability distribution
(the population distribution).
How to choose the probability distribution?
There is a large number of such distributions, such that for most purposes we can
find a suitable standard distribution.
105
4. Common distributions of random variables
This part of the course introduces some of the most common standard distributions for
discrete and continuous random variables.
Probability distributions may differ from each other in a broader or narrower sense. In
the broader sense, we have different families of distributions which may have quite
different characteristics, for example:
The ‘distributions’ discussed in this chapter are really families of distributions in this
sense.
In the narrower sense, individual distributions within a family differ in having different
values of the parameters of the distribution. The parameters determine the mean and
variance of the distribution, values of probabilities from it etc.
In the statistical analysis of a random variable X we typically:
Here we need a family of discrete distributions with only two possible values (0
and 1). The Bernoulli distribution (discussed in the next section), which has
one parameter π (the probability that Xi = 1) is appropriate.
Within the family of Bernoulli distributions, we use the one where the value of
π is our best estimate based on the observed data. This is π
b = 513/950 = 0.54.
106
4.4. Common discrete distributions
The discrete uniform distribution is not very common in applications, but it is useful as
a reference point for more complex distributions.
k
X 1 + 2 + ··· + k k+1
E(X) = x p(x) = = (4.1)
x=1
k 2
and:
k
X 12 + 22 + · · · + k 2 (k + 1)(2k + 1)
E(X 2 ) = x2 p(x) = = . (4.2)
x=1
k 6
Therefore:
k2 − 1
Var(X) = E(X 2 ) − (E(X))2 = .
12
agree / disagree
male / female
107
4. Common distributions of random variables
1
X
2
E(X ) = x2 p(x) = 02 × (1 − π) + 12 × π = π
x=0
and:
Var(X) = E(X 2 ) − (E(X))2 = π − π 2 = π (1 − π). (4.4)
Solution
n
P
(a) Xn = Bi takes the values 0, 1, . . . , n. Any sequence consisting of x 1s and
i=1
x n−x
−x 0s has a probability π (1 − π)
n and gives a value Xn = x. There are
n
such sequences, so:
x
n x
P (Xn = x) = π (1 − π)n−x
x
and 0 otherwise. Hence E(Bi ) = π and Var(Bi ) = π (1 − π) which means
E(Xn ) = n π and Var(Xn ) = n π (1 − π).
(b) Y = min{i : Bi = 1} takes the values 1, 2, 3, . . ., hence:
P (Y = y) = (1 − π)y−1 π
and 0 otherwise. It follows that P (Y > y) = (1 − π)y .
108
4.4. Common discrete distributions
Let X denote the total number of successes in these n trials. X follows a binomial
distribution with parameters n and π, where n ≥ 1 is a known integer and 0 ≤ π ≤ 1.
This is often written as:
X ∼ Bin(n, π).
The binomial distribution was first encountered in Example 3.14.
Example 4.4 A multiple choice test has 4 questions, each with 4 possible answers.
James is taking the test, but has no idea at all about the correct answers. So he
guesses every answer and, therefore, has the probability of 1/4 of getting any
individual question correct.
Let X denote the number of correct answers in James’ test. X follows the binomial
distribution with n = 4 and π = 0.25, i.e. we have:
X ∼ Bin(4, 0.25).
For example, what is the probability that James gets 3 of the 4 questions correct?
Here it is assumed that the guesses are independent, and each has the probability
π = 0.25 of being correct. The probability of any particular sequence of 3 correct
and 1 incorrect answers, for example 1110, is π 3 (1 − π)1 , where ‘1’ denotes a correct
answer and ‘0’ denotes an incorrect answer.
However, we do not care about the order of the 1s and 0s, only about the number of
1s. So 1101 and 1011, for example, also count as 3 correct answers. Each of these
also has the probability π 3 (1 − π)1 .
The total number of sequences with three 1s (and, therefore, one 0) is the number of
locations for the three 1s which can be selected in the sequence of 4 answers. This is
4
3
= 4. Therefore, the probability of obtaining three 1s is:
4 3
π (1 − π)1 = 4 × (0.25)3 × (0.75)1 ≈ 0.0469.
3
109
4. Common distributions of random variables
We have already shown that (4.5) satisfies the conditions for being a probability
function in the previous chapter (see Example 3.14).
Example 4.6 Suppose a multiple choice examination has 20 questions, each with 4
possible answers. Consider again James who guesses each one of the answers. Let X
denote the number of correct answers by such a student, so that we have
X ∼ Bin(20, 0.25). For such a student, the expected number of correct answers is
E(X) = 20 × 0.25 = 5.
The teacher wants to set the pass mark of the examination so that, for such a
student, the probability of passing is less than 0.05. What should the pass mark be?
In other words, what is the smallest x such that P (X ≥ x) < 0.05, i.e. such that
P (X < x) ≥ 0.95?
Calculating the probabilities of x = 0, 1, . . . , 20 we get (rounded to 2 decimal places):
x 0 1 2 3 4 5 6 7 8 9 10
p(x) 0.00 0.02 0.07 0.13 0.19 0.20 0.17 0.11 0.06 0.03 0.01
x 11 12 13 14 15 16 17 18 19 20
p(x) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Calculating the cumulative probabilities, we find that F (7) = P (X < 8) = 0.898 and
F (8) = P (X < 9) = 0.959. Therefore, P (X ≥ 8) = 0.102 > 0.05 and also
P (X ≥ 9) = 0.041 < 0.05. The pass mark should be set at 9.
More generally, consider a student who has the same probability π of the correct
answer for every question, so that X ∼ Bin(20, π). Figure 4.1 shows plots of the
probabilities for π = 0.25, 0.5, 0.7 and 0.9.
110
4.4. Common discrete distributions
0.30
0.30
0.20
0.20
Probability
Probability
0.10
0.10
0.00
0.00
0 5 10 15 20 0 5 10 15 20
0.30
0.20
0.20
Probability
Probability
0.10
0.10
0.00
0.00
0 5 10 15 20 0 5 10 15 20
Consider this distribution in the case where n = 4 and π = 0.8. For this distribution,
calculate the expected value and variance of X.
(Note that E(X) = n π and Var(X) = n π (1 − π) for this distribution. Check that
your answer agrees with this.)
Solution
Substituting the values into the definitions we get:
X
E(X) = x p(x) = 0 × 0.0016 + 1 × 0.0256 + · · · + 4 × 0.4096 = 3.2
x
X
E(X 2 ) = x2 p(x) = 0 × 0.0016 + 1 × 0.0256 + · · · + 16 × 0.4096 = 10.88
x
and:
Var(X) = E(X 2 ) − (E(X))2 = 10.88 − (3.2)2 = 0.64.
Note that E(X) = n π = 4 × 0.8 = 3.2 and Var(X) = n π (1 − π) = 4 × 0.8 × (1 − 0.8)
= 0.64 for n = 4, π = 0.8, as stated by the general formulae.
111
4. Common distributions of random variables
Activity 4.3 A certain electronic system contains 12 components. Suppose that the
probability that each individual component will fail is 0.3 and that the components
fail independently of each other. Given that at least two of the components have
failed, what is the probability that at least three of the components have failed?
Solution
Let X denote the number of components which will fail, hence X ∼ Bin(12, 0.3).
Therefore:
P (X ≥ 3) 1 − P (X = 0) − P (X = 1) − P (X = 2)
P (X ≥ 3 | X ≥ 2) = =
P (X ≥ 2) 1 − P (X = 0) − P (X = 1)
1 − 0.0138 − 0.0712 − 0.1678
=
1 − 0.0138 − 0.0712
0.7472
=
0.9150
= 0.8166.
Activity 4.4 A greengrocer has a very large pile of oranges on his stall. The pile of
fruit is a mixture of 50% old fruit with 50% new fruit; one cannot tell which are old
and which are new. However, 20% of old oranges are mouldy inside, but only 10% of
new oranges are mouldy. Suppose that you choose 5 oranges at random. What is the
distribution of the number of mouldy oranges in your sample?
Solution
For an orange chosen at random, the event ‘mouldy’ is the union of the disjoint
events ‘mouldy’ ∩ ‘new’ and ‘mouldy’ ∩ ‘old’. So:
As the pile of oranges is very large, we can assume that the results for the five
oranges will be independent, so we have 5 independent trials each with probability of
‘mouldy’ equal to 0.15. The distribution of the number of mouldy oranges will be a
binomial distribution with n = 5 and π = 0.15.
Activity 4.5 Metro trains on a particular line have a probability 0.05 of failure
between two stations. Supposing that the failures are all independent, what is the
probability that out of 10 journeys between these two stations more than 8 do not
have a breakdown?
112
4.4. Common discrete distributions
Solution
The probability of no breakdown on one journey is π = 1 − 0.05 = 0.95, so the
number of journeys without a breakdown, X, has a Bin(10, 0.95) distribution. We
want P (X > 8), which is:
Hence find E(X) and Var(X). (The wording of the question implies that you use the
result which you have just proved. Other methods of derivation will not be accepted!)
Solution
n
For X ∼ Bin(n, π), P (X = x) = x
π x (1 − π)n−x . So, for E(X), we have:
n
X n x
E(X) = x π (1 − π)n−x
x=0
x
n
X n x
= x π (1 − π)n−x
x=1
x
n
X n (n − 1)!
= π π x−1 (1 − π)n−x
x=1
(x − 1)! [(n − 1) − (x − 1)]!
n
X n−1
= nπ π x−1 (1 − π)n−x
x=1
x−1
n−1
X n−1
= nπ π y (1 − π)(n−1)−y
y=0
y
= nπ × 1
= nπ
where y = x − 1, and the last summation is over all the values of the pf of another
binomial distribution, this time with possible values 0, 1, . . . , n − 1 and probability
parameter π.
113
4. Common distributions of random variables
Similarly:
n
X n x
E(X(X − 1)) = x (x − 1) π (1 − π)n−x
x=0
x
n
X x (x − 1) n!
= π x (1 − π)n−x
x=2
(n − x)! x!
n
2
X (n − 2)!
= n (n − 1) π π x−2 (1 − π)n−x
x=2
(n − x)! (x − 2)!
n−2
2
X (n − 2)!
= n (n − 1) π π y (1 − π)n−y−2
y=0
(n − y − 2)! y!
= n (n − 1) π 2
(b) 17 animals are injected; more than 15 remain free from infection and there are 2
doubtful cases
(c) 23 animals are infected; more than 20 remain free from infection and there are
three doubtful cases.
114
4.4. Common discrete distributions
Solution
These experiments involve tests on different cattle, which one might expect to
behave independently of one another. The probability of infection without injection
with the serum might also reasonably be assumed to be the same for all cattle. So
the distribution which we need here is the binomial distribution. If the serum has no
effect, then the probability of infection for each of the cattle is 0.25.
One way to assess the evidence of the three experiments is to calculate the
probability of the result of the experiment if the serum had no effect at all. If it has
an effect, then one would expect larger numbers of cattle to remain free from
infection, so the experimental results as given do provide some clue as to whether
the serum has an effect, in spite of their incompleteness.
Let X(n) be the number of cattle infected, out of a sample of n. We are assuming
that X(n) ∼ Bin(n, 0.25).
(a) With 10 trials, the probability of 0 infected if the serum has no effect is:
10
P (X(10) = 0) = × (0.75)10 = (0.75)10 = 0.0563.
0
(b) With 17 trials, the probability of more than 15 remaining uninfected if the
serum has no effect is:
P (X(17) < 2) = P (X(17) = 0) + P (X(17) = 1)
17 17 17
= × (0.75) + × (0.25)1 × (0.75)16
0 1
= (0.75)17 + 17 × (0.25)1 × (0.75)16
= 0.0075 + 0.0426
= 0.0501.
(c) With 23 trials, the probability of more than 20 remaining free from infection if
the serum has no effect is:
P (X(23) < 3) = P (X(23) = 0) + P (X(23) = 1) + P (X(23) = 2)
23 23 23
= × (0.75) + × (0.25)1 × (0.75)22
0 1
23
+ × (0.25)2 × (0.75)21
2
23 × 22
= 0.7523 + 23 × 0.25 × (0.75)22 + × (0.25)2 × (0.75)21
2
= 0.0013 + 0.0103 + 0.0376
= 0.0492.
115
4. Common distributions of random variables
1. The numbers of occurrences in any two disjoint intervals of time are independent of
each other.
2. The probability of two or more occurrences at the same time is negligibly small.
3. The probability of one occurrence in any short time interval of length t is λ t for
some constant λ > 0.
Example 4.7 Examples of variables for which we might use a Poisson distribution:
116
4.4. Common discrete distributions
Because λ is the rate per unit of time, its value also depends on the unit of time (that
is, the length of interval) we consider.
Example 4.8 If X is the number of arrivals per hour and X ∼ Poisson(1.5), then if
Y is the number of arrivals per two hours, Y ∼ Poisson(1.5 × 2) = Poisson(3).
Example 4.9 Figure 4.2 shows the probabilities p(x) for x = 0, 1, 2, . . . , 10 for
X ∼ Poisson(2) and X ∼ Poisson(4).
0.25
λ=2
λ=4
0.20
0.15
p(x)
0.10
0.05
0.00
0 2 4 6 8 10
X ∼ Poisson(1.6)
and:
Y ∼ Poisson(1.6 × 5) = Poisson(8).
117
4. Common distributions of random variables
2. What is the probability that more than two customers arrive in a one-minute
interval?
P (X > 2) = 1 − P (X ≤ 2) = 1 − [P (X = 0) + P (X = 1) + P (X = 2)] which is:
e−8 80 e−8 81
pY (0) + pY (1) = + = e−8 + 8 e−8 = 9 e−8 = 0.0030.
0! 1!
Activity 4.8 Cars independently pass a point on a busy road at an average rate of
150 per hour.
(a) Assuming a Poisson distribution, find the probability that none passes in a
given minute.
(c) Find the probability that the expected number actually passes in a given
two-minute period.
Solution
(a) A rate of 150 cars per hour is a rate of 2.5 per minute. Using a Poisson
distribution with λ = 2.5, P (none passes) = e−2.5 × (2.5)0 /0! = e−2.5 = 0.0821.
(c) The probability of 5 cars passing in two minutes is e−5 × 55 /5! = 0.1755.
Activity 4.9 People entering an art gallery are counted by the attendant at the
door. Assume that people arrive in accordance with a Poisson distribution, with one
person arriving every 2 minutes. The attendant leaves the door unattended for 5
118
4.4. Common discrete distributions
minutes.
(b) Find, to the nearest second, the length of time for which the attendant could
leave the door unattended for there to be a probability of 0.9 of no arrivals in
that time.
Solution
and:
(b) For an interval of N minutes, the parameter is N/2. We need p(0) = 0.9, so
e−N/2 = 0.9 giving N/2 = − ln(0.9) and N = 0.21 minutes, or 13 seconds.
(c) The rate is unlikely to be constant: more people at lunchtimes or early evenings
etc. Likely to be several arrivals in a small period – couples, groups etc. Quite
unlikely the Poisson will provide a good model.
Activity 4.10 In a large industrial plant there is an accident on average every two
days.
(a) What is the chance that there will be exactly two accidents in a given week?
(b) What is the chance that there will be two or more accidents in a given week?
(c) If James goes to work there for a four-week period, what is the probability that
no accidents occur while he is there?
Solution
Here we have counts of random events over time, which is a typical application for
the Poisson distribution. We are assuming that accidents are equally likely to occur
at any time and are independent. The mean for the Poisson distribution is 0.5 per
day.
Let X be the number of accidents in a week. The probability of exactly two
accidents in a given week is found by using the parameter λ = 5 × 0.5 = 2.5 (5
working days a week assumed).
119
4. Common distributions of random variables
e−2.5 (2.5)2
p(2) = = 0.2565.
2!
(c) If James goes to the industrial plant and does not change the probability of an
accident simply by being there (he might bring bad luck, or be superbly
safety-conscious!), then over 4 weeks there are 20 working days, and the
probability of no accident comes from a Poisson random variable with mean 10.
If Y is the number of accidents while James is there, the probability of no
accidents is:
e−10 (10)0
pY (0) = = 0.0000454.
0!
James is very likely to be there when there is an accident!
(a) Find:
i. the probability of exactly seven arrivals in a period of two minutes
ii. the probability of more than three arrivals in 45 seconds
iii. the probability that the time to arrival of the next customer is less than
one minute.
(b) If T is the time to arrival of the next customer (in minutes), calculate:
P (T > 2.3 | T > 1).
Solution
(a) The rate is given as 84 per hour, but it is convenient to work in numbers of
minutes, so note that this is the same as λ = 1.4 arrivals per minute.
i. For two minutes, use λ = 1.4 × 2 = 2.8. Hence:
e−2.8 (2.8)7
P (X = 7) = = 0.0163.
7!
ii. For 45 seconds, λ = 1.4 × 0.75 = 1.05. Hence:
3
X e−1.05 (1.05)x
P (X > 3) = 1 − P (X ≤ 3) = 1 −
x=0
x!
120
4.4. Common discrete distributions
iii. The probability that the time to arrival of the next customer is less than
one minute is 1 − P (no arrivals in one minute) = 1 − P (X = 0). For one
minute we use λ = 1.4, hence:
e−1.4 (1.4)0
1 − P (X = 0) = 1 − = 1 − e−1.4 = 1 − 0.2466 = 0.7534.
0!
(b) The time to the next customer is more than t if there are no arrivals in the
interval from 0 to t, which means that we need to use λ = 1.4 × t. Now the
conditional probability formula yields:
and, as in other instances, the two events {T > 2.3} and {T > 1} collapse to a
single event, {T > 2.3}. Hence:
To calculate the numerator, use λ = 1.4 × 2.3 = 3.22, hence (by the same
method as in (iii.):
e−3.22 (3.22)0
P (T > 2.3) = = e−3.22 = 0.0400.
0!
Hence:
P (T > 2.3) 0.0400
P (T > 2.3 | T > 1) = = = 0.1620.
P (T > 1) 0.2466
Activity 4.12 A glacier in Greenland ‘calves’ (lets fall off into the sea) an iceberg
on average twice every five weeks. (Seasonal effects can be ignored for this question,
and so the calving process can be thought of as random, i.e. the calving of icebergs
can be assumed to be independent events.)
(a) Explain which distribution you would use to estimate the probabilities of
different numbers of icebergs being calved in different periods, justifying your
selection.
(b) What is the probability that no iceberg is calved in the next three weeks?
(c) What is the probability that no iceberg is calved in the three weeks after the
next three weeks?
(d) What is the probability that exactly five icebergs are calved in the next four
weeks?
(e) If exactly five icebergs are calved in the next four weeks, what is the probability
that exactly five more icebergs will be calved in the four-week period after the
next four weeks?
(f) Comment on the relationship between your answers to (d) and (e).
121
4. Common distributions of random variables
Solution
(a) If we assume that the calving process is random (as the remark about
seasonality hints) then we are counting events over periods of time (with, in
particular, no obvious upper maximum), and hence the appropriate distribution
is the Poisson distribution.
(b) The rate parameter for one week is 0.4, so for three weeks we use λ = 1.2, hence:
e−1.2 × (1.2)0
P (X = 0) = = e−1.2 = 0.3012.
0!
(c) If it is correct to use the Poisson distribution then events are independent, and
hence:
e−1.6 × (1.6)5
P (X = 5) = = 0.0176.
5!
P (5 in weeks 5 to 8 ∩ 5 in weeks 1 to 4)
P (5 in weeks 5 to 8 | 5 in weeks 1 to 4) = .
P (5 in weeks 1 to 4)
(f) The fact that the results are identical in the two cases is a consequence of the
independence built into the assumption that the Poisson distribution is the
appropriate one to use. A Poisson process does not ‘remember’ what happened
before the start of a period under consideration.
122
4.4. Common discrete distributions
P
where λ > 0 is a parameter. Show that E(X) = λ by determining x p(x).
Solution
We have:
∞ ∞ ∞
X X e−λ λx X e−λ λx
E(X) = x p(x) = x = x
x=0 x=0
x! x=1
x!
∞
X e−λ λx−1
=λ
x=1
(x − 1)!
∞
X e−λ λy
=λ
y=0
y!
=λ×1
=λ
∞
(e−λ λy )/y! is
P
where we replace x − 1 with y. The result follows from the fact that
y=0
the sum of all non-zero values of a probability function of this form.
For completeness, we also give here a derivation of the variance of this distribution.
Consider first:
∞ ∞
X X e−λ λx
E[X(X − 1)] = x(x − 1) p(x) = x(x − 1)
x=0 x=2
x!
∞
2
X e−λ λx−2
=λ
x=2
(x − 2)!
∞
2
X e−λ λy
=λ
y=0
y!
= λ2
where y = x − 2. Also:
X X X
E[X(X − 1)] = E(X 2 − X) = (x2 − x) p(x) = x2 p(x) − x p(x)
x x x
= E(X 2 ) − E(X)
= E(X 2 ) − λ.
123
4. Common distributions of random variables
(a) Using a Poisson distribution, find the mean number of fish he catches.
(b) Show that the probability that he takes home the last fish he catches is
(1 − π 2 )/2.
Solution
(a) Let X denote the number of fish caught, such that X ∼ Poisson(λ).
P (X = 0) = e−λ λx /x! where the parameter λ is as yet unknown, so
P (X = 0) = e−λ λ0 /0! = e−λ .
However, we know P (X = 0) = π. So e−λ = π giving −λ = ln π and λ = ln(1/π).
(b) James will take home the last fish caught if he catches 1, 3, 5, 7, . . . fish. So we
require:
Now we know:
λ2 λ3
eλ = 1 + λ + + + ···
2! 3!
and:
−λ λ2 λ3
e =1−λ+ − + ··· .
2! 3!
Subtracting gives:
λ3 λ5
λ −λ
e −e =2 λ+ + + ··· .
3! 5!
e − e−λ 1 − e−2λ
λ
1 − π2
−λ
e = =
2 2 2
124
4.4. Common discrete distributions
X ∼ Bin(n, π)
Example 4.11 A classic example (from Bortkiewicz (1898) Das Gesetz der kleinen
Zahlen) helps to remember the key elements of the ‘law of small numbers’.
Figure 4.3 shows the numbers of soldiers killed by horsekick in each of 14 army corps
of the Prussian army in each of the years spanning 1875–94.
Suppose that the number of men killed by horsekicks in one corps in one year is
X ∼ Bin(n, π), where:
The sample mean of the counts is x̄ = 0.7, which we use as λ for the Poisson
distribution. X ∼ Poisson(0.7) is indeed a good fit to the data, as shown in Figure
4.4.
125
4. Common distributions of random variables
Figure 4.3: Numbers of soldiers killed by horsekick in each of 14 army corps of the Prussian
army in each of the years spanning 1875–94. Source: Bortkiewicz (1898) Das Gesetz der
kleinen Zahlen, Leipzig: Teubner.
0.5
Poisson(0.7)
Sample proportion
0.4
0.3
Probability
0.2
0.1
0.0
0 1 2 3 4 5 6
Men killed
Example 4.12 An airline is selling tickets to a flight with 198 seats. It knows that,
on average, about 1% of customers who have bought tickets fail to arrive for the
flight. Because of this, the airline overbooks the flight by selling 200 tickets. What is
the probability that everyone who arrives for the flight will get a seat?
Let X denote the number of people who fail to turn up. Using the binomial
distribution, X ∼ Bin(200, 0.01). We have:
126
4.4. Common discrete distributions
Activity 4.15 The chance that a lottery ticket has a winning number is 0.0000001.
(a) If 10,000,000 people buy tickets which are independently numbered, what is the
probability there is no winner?
Solution
The number of winning tickets, X, will be distributed as:
X ∼ Bin(10000000, 0.0000001).
Since n is large and π is small, the Poisson distribution should provide a good
approximation. The Poisson parameter is:
λ = n π = 10000000 × 0.0000001 = 1
(10)7
7
p(0) = × ((10)−7 )0 × (1 − (10)−7 )(10) = 0.3679
0
(10)7
7
p(1) = × ((10)−7 )1 × (1 − (10)−7 )(10) −1 = 0.3679
1
and:
(10)7
7
p(2) = × ((10)−7 )2 × (1 − (10)−7 )(10) −2 = 0.1839.
2
Notice that, in this case, the Poisson approximation is correct to at least 4 decimal
places.
Geometric(π) distribution.
• Distribution of the number of failures in Bernoulli trials before the first success.
• π is the probability of success at each trial.
• The sample space is 0, 1, 2, . . ..
• See the basketball example in Chapter 3.
127
4. Common distributions of random variables
Uniform distribution.
Exponential distribution.
Normal distribution.
The pdf is ‘flat’, as shown in Figure 4.5 (along with the cdf). Clearly, f (x) ≥ 0 for all x,
and: Z ∞ Z b
1 1 1
f (x) dx = dx = [x]ba = [b − a] = 1.
−∞ a b−a b−a b−a
The cdf is:
Z x 0
for x < a
F (x) = P (X ≤ x) = f (t) dt = (x − a)/(b − a) for a ≤ x ≤ b
a
1 for x > b.
128
4.5. Common continuous distributions
F(x)
f(x)
a b a b
x x
Figure 4.5: Continuous uniform distribution pdf (left) and cdf (right).
Activity 4.16 Suppose that X ∼ Uniform[0, 1]. Compute P (X > 0.2), P (X ≥ 0.2)
and P (X 2 > 0.04).
Solution
We have a = 0 and b = 1, and can use the formula for P (c < X ≤ d), for constants c
and d. Hence:
1 − 0.2
P (X > 0.2) = P (0.2 < X ≤ 1) = = 0.8.
1−0
Also:
Finally:
Activity 4.17 A newsagent, James, has n newspapers to sell and makes £1.00
profit on each sale. Suppose the number of customers of these newspapers is a
random variable with a distribution which can be approximated by:
(
1/200 for 0 < x < 200
f (x) =
0 otherwise.
129
4. Common distributions of random variables
If James does not have enough newspapers to sell to all customers, he figures he
loses £5.00 in goodwill from each unhappy (non-served) customer. However, if he has
surplus newspapers (which only have commercial value on the day of print), he loses
£0.50 on each unsold newspaper. What should n be (to the nearest integer) to
maximise profit?
Hint: If X ≤ n, James’ profit (in £) is X − 0.5(n − X). If X > n, James’ profit is
n − 5(X − n). Find the expected value of profit as a function of n, and then select n
to maximise this function. (There is no need to verify it is a maximum.)
Solution
We have:
Z n Z 200
1 1
E(profit) = (x − 0.5(n − x)) dx + (n − 5(x − n)) dx
0 200 n 200
n 200
1 x2 (n − x)2 5x2
1
= + + 6nx −
200 2 4 0 200 2 n
1
= (−3.25n2 + 1200n − 100000).
200
Differentiating with respect to n, we have:
dE(profit) 1
= (−6.5n + 1200).
dn 200
Equating to zero and solving, we have:
1200
n= ≈ 185.
6.5
It was shown in the previous chapter that this satisfies the conditions for a pdf (see
Example 3.21). The general shape of the pdf is that of ‘exponential decay’, as shown in
Figure 4.6 (hence the name).
130
4.5. Common continuous distributions
f(x)
0 1 2 3 4 5
0.4
0.2
0.0
0 1 2 3 4 5
Note that the median is always smaller than the mean, because the distribution is
skewed to the right.
The exponential is, among other things, a basic distribution of waiting times of
various kinds. This arises from a connection between the Poisson distribution – the
simplest distribution for counts – and the exponential.
If the number of events per unit of time has a Poisson distribution with parameter
λ, the time interval (measured in the same units of time) between two successive
events has an exponential distribution with the same parameter λ.
E(X) = λ for Poisson(λ), i.e. a large λ means many events per unit of time, on
average.
E(X) = 1/λ for Exponential(λ), i.e. a large λ means short waiting times between
successive events, on average.
From this exponential distribution, the expected waiting time between arrivals of
customers is E(X) = 1/1.6 = 0.625 (minutes) and the median is calculated to be
(log 2) × 0.625 = 0.433.
We can also calculate probabilities of waiting times between arrivals, using the
cumulative distribution function:
(
0 for x ≤ 0
F (x) = −1.6x
1−e for x > 0.
For example:
132
4.5. Common continuous distributions
Activity 4.18 Suppose that the service time for a customer at a fast food outlet
has an exponential distribution with parameter 1/3 (customers per minute). What is
the probability that a customer waits more than 4 minutes?
Solution
The distribution of X is Exp(1/3), so the probability is:
(a) Is it reasonable to assume that such crashes are Poisson events? Briefly explain.
(b) What is the probability that two or more crashes will occur next year?
(c) What is the probability that the next two crashes will occur within six months
of one another?
Solution
(a) Yes, because the Poisson assumptions are probably satisfied – crashes are
independent events and the crash rate is likely to remain constant.
(c) Let Y = interval (in years) between the next two crashes. Therefore, we have
Y ∼ Exp(2.5). So:
Z 0.5
P (Y < 0.5) = 2.5e−2.5y dy = F (0.5) − F (0)
0
= (1 − e−2.5(0.5) ) − (1 − e−2.5(0) )
= 1 − e−1.25
= 0.7135.
Activity 4.20 Let the random variable X have the following pdf:
(
e−x for x ≥ 0
f (x) =
0 otherwise.
133
4. Common distributions of random variables
Solution
Note that X ∼ Exp(1). For x > 0, we have:
Z x Z x
x
e−t dt = −e−t 0 = 1 − e−x
f (t) dt =
0 0
hence: (
1 − e−x for x > 0
F (x) =
0 otherwise.
Denoting the first and third quartiles by Q1 and Q3 , respectively, we have:
Therefore:
and so:
IQR = Q3 − Q1 = 1.3863 − 0.2877 = 1.0986.
(b) Comment briefly on the implications of the age-specific failure rate you have
derived in the context of the exponentially-distributed component life-spans.
Solution
f (y) λ e−λy
φ(y) = = −λy = λ.
=(y) e
(b) The age-specific failure rate is constant, indicating it does not vary with age.
This is unlikely to be true in practice!
134
4.5. Common continuous distributions
Many variables have distributions which are approximately normal, for example
heights of humans or animals, and weights of various products.
The normal distribution has extremely convenient mathematical properties, which
make it a useful default choice of distribution in many contexts.
Even when a variable is not itself even approximately normally distributed,
functions of several observations of the variable (‘sampling distributions’) are often
approximately normal, due to the central limit theorem. Because of this, the
normal distribution has a crucial role in statistical inference. This will be discussed
later in the course.
(x − µ)2
1
f (x) = √ exp − for − ∞ < x < ∞
2πσ 2 2σ 2
where π is the mathematical constant (i.e. π = 3.14159 . . .), and µ and σ 2 are
parameters, with −∞ < µ < ∞ and σ 2 > 0.
A random variable X with this pdf is said to have a normal distribution with mean
µ and variance σ 2 , denoted X ∼ N (µ, σ 2 ).
R∞
Clearly, f (x) ≥ 0 for all x. Also, it can be shown that −∞
f (x) dx = 1 (do not attempt
to show this), so f (x) really is a pdf.
If X ∼ N (µ, σ 2 ), then:
E(X) = µ
and:
Var(X) = σ 2
and, therefore, the standard deviation is sd(X) = σ.
The mean can also be inferred from the observation that the normal pdf is symmetric
about µ. This also implies that the median of the normal distribution is µ.
The normal density is the so-called ‘bell curve’. The two parameters affect it as follows.
N (0, 1) and N (5, 1) have the same dispersion but different location: the
N (5, 1) curve is identical to the N (0, 1) curve, but shifted 5 units to the right
135
4. Common distributions of random variables
N (0, 1) and N (0, 9) have the same location but different dispersion: the
N (0, 9) curve is centered at the same value, 0, as the N (0, 1) curve, but spread
out more widely.
0.4
0.3
0.2
0.1
N(0, 1) N(5, 1)
N(0, 9)
0.0
−5 0 5 10
We now consider one of the convenient properties of the normal distribution. Suppose
X is a random variable, and we consider the linear transformation Y = aX + b, where a
and b are constants.
Whatever the distribution of X, it is true that E(Y ) = a E(X) + b and also that
Var(Y ) = a2 Var(X).
Furthermore, if X is normally distributed, then so is Y . In other words, if
X ∼ N (µ, σ 2 ), then:
Y = aX + b ∼ N (aµ + b, a2 σ 2 ). (4.7)
This type of result is not true in general. For other families of distributions, the
distribution of Y = aX + b is not always in the same family as X.
Let us apply (4.7) with a = 1/σ and b = −µ/σ, to get:
2 !
1 µ X −µ 1 µ 1
Z= X− = ∼N µ− , σ 2 = N (0, 1).
σ σ σ σ σ σ
The transformed variable Z = (X − µ)/σ is known as a standardised variable or a
z-score.
The distribution of the z-score is N (0, 1), i.e. the normal distribution with mean µ = 0
and variance σ 2 = 1 (and, therefore, a standard deviation of σ = 1). This is known as
the standard normal distribution. Its density function is:
2
1 x
f (x) = √ exp − for − ∞ < x < ∞.
2π 2
136
4.5. Common continuous distributions
1. It is only for N (0, 1), not for N (µ, σ 2 ) for any other µ and σ 2 .
We next show how these are not really limitations, starting with ‘2.’.
The key to using the tables is that the standard normal distribution is symmetric about
0. This means that for an interval in one tail, its ‘mirror image’ in the other tail has the
same probability. Another way to justify these results is that if Z ∼ N (0, 1), then also
−Z ∼ N (0, 1). See ST104a Statistics 1 for a discussion of how to use Table 4 of the
New Cambridge Statistical Tables.
which can be calculated using Table 4 of the New Cambridge Statistical Tables. (Note
that this also covers the cases of the one-sided inequalities P (X ≤ b), with a = −∞,
and P (X > a), with b = ∞.)
137
4. Common distributions of random variables
Example 4.15 Let X denote the diastolic blood pressure of a randomly selected
person in England. This is approximately distributed as X ∼ N (74.2, 127.87).
Suppose we want to know the probabilities of the following intervals:
and:
X − 74.2 60 − 74.2
P (X < 60) = P <
11.31 11.31
= P (Z < −1.26)
= P (Z > 1.26)
= 1 − Φ(1.26)
= 1 − 0.8962
= 0.1038.
Finally:
P (60 ≤ X ≤ 90) = P (X ≤ 90) − P (X < 60) = 0.8152.
These probabilities are shown in Figure 4.9.
Activity 4.22 Suppose that the distribution of men’s heights in London, measured
in cm, is N (175, 62 ). Find the proportion of men whose height is:
(a) under 169 cm
138
4.5. Common continuous distributions
0.04
Mid: 0.82
0.03
Low: 0.10
0.02
High: 0.08
0.01
0.00
40 60 80 100 120
Solution
The values of interest are 169 and 190. The corresponding z-values are:
169 − 175 190 − 175
z1 = = −1 and z2 = = 2.5.
6 6
Using values from statistical tables, we have:
also:
P (X > 190) = P (Z > 2.5) = 1 − Φ(2.5) = 1 − 0.9938 = 0.0062
and:
P (169 < X < 190) = P (−1 < Z < 2.5) = Φ(2.5)−Φ(−1) = 0.9938−0.1587 = 0.8351.
Solution
Suppose X ∼ N (µ, σ 2 ) is the random variable for throws. P (X > 43) = 0.15 leads
to µ = 43 − 1.035 × σ (using statistical tables).
Similarly, P (X > 45) = 0.03 leads to µ = 45 − 1.88 × σ. Solving yields µ = 40.55 and
139
4. Common distributions of random variables
Activity 4.24 The life, in hours, of a light bulb is normally distributed with a mean
of 175 hours. If a consumer requires at least 95% of the light bulbs to have lives
exceeding 150 hours, what is the largest value that the standard deviation can have?
Solution
Let X be the random variable representing the lifetime of a light bulb (in hours), so
that for some value σ we have X ∼ N (175, σ 2 ). We want P (X > 150) = 0.95, such
that:
150 − 175 25
P (X > 150) = P Z > =P Z>− = 0.95.
σ σ
Note that this is the same as P (Z > 25/σ) = 1 − 0.95 = 0.05, so 25/σ = 1.645,
giving σ = 15.20.
Activity 4.25 Two statisticians disagree about the distribution of IQ scores for a
population under study. Both agree that the distribution is normal, and that σ = 15,
but A says that 5% of the population have IQ scores greater than 134.6735, whereas
B says that 10% of the population have IQ scores greater than 109.224. What is the
difference between the mean IQ score as assessed by A and that as assessed by B?
Solution
The standardised z-value giving 5% in the upper tail is 1.6449, and for 10% it is
1.2816. So, converting to the scale for IQ scores, the values are:
µA + 24.6735 = 134.6735
so:
µA = 110
whereas:
µB + 19.224 = 109.224
so µB = 90. The difference µA − µB = 110 − 90 = 20.
140
4.5. Common continuous distributions
0.683
Figure 4.10: Some probabilities around the mean for the normal distribution.
141
4. Common distributions of random variables
since P (4 < X ≤ 4.5) = 0 and P (4.5 < X < 5) = 0 due to the ‘gaps’ in the probability
mass for this distribution. In contrast if Y ∼ N (16, 9.6), then:
since P (4 < Y < 4.5) > 0 and P (4.5 < Y < 5) > 0 because this is a continuous
distribution.
The accepted way to circumvent this problem is to use a continuity correction which
corrects for the effects of the transition from a discrete Bin(n, π) distribution to a
continuous N (n π, n π (1 − π)) distribution.
Continuity correction
Example 4.16 In the UK general election in May 2010, the Conservative Party
received 36.1% of the votes. We carry out an opinion poll in November 2014, where
we survey 1,000 people who say they voted in 2010, and ask who they would vote for
142
4.5. Common continuous distributions
if a general election was held now. Let X denote the number of people who say they
would now vote for the Conservative Party.
Suppose we assume that X ∼ Bin(1000, 0.361).
P (X ≥ 400) ≈ P (Y ≥ 399.5)
Y − 361 399.5 − 361
=P √ ≥ √
230.68 230.68
= P (Z ≥ 2.53)
= 1 − Φ(2.53)
= 0.0057.
3. Suppose that 300 respondents in the actual survey say they would vote for the
Conservative Party now. What do you conclude from this?
From the answer to Question 2, we know that P (X ≤ 300) < 0.01, if π = 0.361.
In other words, if the Conservatives’ support remains 36.1%, we would be very
unlikely to get a random sample where only 300 (or fewer) respondents would
say they would vote for the Conservative Party.
Now X = 300 is actually observed. We can then conclude one of two things (if
we exclude other possibilities, such as a biased sample or lying by the
respondents).
143
4. Common distributions of random variables
(a) The Conservatives’ true level of support is still 36.1% (or even higher), but
by chance we ended up with an unusual sample with only 300 of their
supporters.
(b) The Conservatives’ true level of support is currently less than 36.1% (in
which case getting 300 in the sample would be more probable).
Here (b) seems a more plausible conclusion than (a). This kind of reasoning is
the basis of statistical significance tests.
Activity 4.26 James enjoys playing Solitaire on his laptop. One day, he plays the
game repeatedly. He has found, from experience, that the probability of success in
any game is 1/3 and is independent of the outcomes of other games.
(a) What is the probability that his first success occurs in the fourth game he
plays? What is the expected number of games he needs to play to achieve his
first success?
(b) What is the probability of three successes in ten games? What is the expected
number of successes in ten games?
(c) Use a suitable approximation to find the probability of less than 25 successes in
100 games. You should justify the use of the approximation.
(d) What is the probability that his third success occurs in the tenth game he plays?
Solution
(a) P (first success in 4th game) = (2/3)3 × (1/3) = 8/81 ≈ 0.1. This is a geometric
distribution, for which E(X) = 1/π = 1/(1/3) = 3.
(b) Use X ∼ Bin(10, 1/3), such that E(X) = 10 × 1/3 = 3.33, and:
3 7
10 1 2
P (X = 3) = ≈ 0.2601.
3 3 3
(d) This is a negative binomial distribution (used for the trial number of the kth
success) with a pf given by:
x−1 k
p(x) = π (1 − π)x−k for x = k, k + 1, . . .
k−1
144
4.5. Common continuous distributions
Activity 4.27 You may assume that 15% of individuals in a large population are
left-handed.
(a) If a random sample of 40 individuals is taken, find the probability that exactly 6
are left-handed.
(b) If a random sample of 400 individuals is taken, find the probability that exactly
60 are left-handed by using a suitable approximation. Briefly discuss the
appropriateness of the approximation.
(c) What is the smallest possible size of a randomly chosen sample if we wish to be
99% sure of finding at least one left-handed individual in the sample?
Solution
145
4. Common distributions of random variables
Activity 4.28 For the binomial distribution with a probability of success of 0.25 in
an individual trial, calculate the probability that, in 50 trials, there are at least 8
successes:
Compare these results with the exact probability of 0.9547 and comment.
Solution
We seek P (X ≥ 8) using the normal approximation Y ∼ N (12.5, 9.375).
Compared to 0.9547, using the continuity correction yields the closer approximation.
Activity 4.29 We have found that the Poisson distribution can be used to
approximate a binomial distribution, and a normal distribution can be used to
approximate a binomial distribution. It should not be surprising that a normal
distribution can be used to approximate a Poisson distribution. It can be shown that
the approximation is suitable for large values of the Poisson parameter λ, and should
be adequate for practical purposes when λ ≥ 10.
(b) Use this approach to estimate P (X > 12) for a Poisson random variable with
λ = 15. Use a continuity correction.
Note: The exact value of this probability, from the Poisson distribution, is
0.7323890.
146
4.6. Overview of chapter
Solution
(a) The Poisson distribution with parameter λ has its expectation and variance
both equal to λ, so we should take µ = λ and σ 2 = λ in a normal approximation,
i.e. use a N (λ, λ) distribution as the approximating distribution.
(b) P (X > 12) ≈ P (Y > 12.5) using a continuity correction, where Y ∼ N (15, 15).
This is:
Y − 15 12.5 − 15
P (Y > 12.5) = P √ > √ = P (Z > −0.65) = 0.7422.
15 15
The probability of being aged at least x0 + 1, given being aged at least x0 , is:
p = P (X > x0 + 1 | X > x0 ).
Calculate p.
147
4. Common distributions of random variables
148
Chapter 5
Multivariate random variables
5.3 Introduction
So far, we have considered univariate situations, that is one random variable at a time.
Now we will consider multivariate situations, that is two or more random variables at
once, and together.
In particular, we consider two somewhat different types of multivariate situations.
X = (X1 , X2 , . . . , Xn )0
149
5. Multivariate random variables
x = (x1 , x2 , . . . , xn )0
p(x1 , x2 , . . . , xn ) = P (X1 = x1 , X2 = x2 , . . . , Xn = xn )
for all vectors (x1 , x2 , . . . , xn ) of n real numbers. The value p(x1 , x2 , . . . , xn ) of the joint
probability function is itself a single number, not a vector.
In the bivariate case, this is:
p(x, y) = P (X = x, Y = y)
which we sometimes write as pX,Y (x, y) to make the random variables clear.
Example 5.2 Consider a randomly selected football match in the English Premier
League (EPL), and the two random variables:
Suppose both variables have possible values 0, 1, 2 and 3 (to keep this example
simple, we have recorded the small number of scores of 4 or greater also as 3).
150
5.5. Marginal distributions
Consider the joint distribution of (X, Y ). We use probabilities based on data from
the 2009–10 EPL season.
Suppose the values of pX,Y (x, y) = p(x, y) = P (X = x, Y = y) are the following:
Y =y
X=x 0 1 2 3
0 0.100 0.031 0.039 0.031
1 0.100 0.146 0.092 0.015
2 0.085 0.108 0.092 0.023
3 0.062 0.031 0.039 0.006
The joint probability function gives probabilities of values of (X, Y ), for example:
A 1–1 draw, which is the most probable single result, has probability:
P (X = 1, Y = 1) = p(1, 1) = 0.146.
151
5. Multivariate random variables
where the sum is of the values of the joint pf of (X1 , X2 , X3 , X4 ) over all possible
values of X3 and X4 .
The simplest marginal distributions are those of individual variables in the multivariate
random variable.
The marginal pf is then obtained by summing the joint pf over all the other variables.
The resulting marginal distribution is univariate, and its pf is a univariate pf.
For the bivariate distribution of (X, Y ) the univariate marginal distributions are
those of X and Y individually. Their marginal pfs are:
X X
pX (x) = p(x, y) and pY (y) = p(x, y).
y x
Example 5.4 Continuing with the football example introduced in Example 5.2, the
joint and marginal probability functions are:
Y =y
X=x 0 1 2 3 pX (x)
0 0.100 0.031 0.039 0.031 0.201
1 0.100 0.146 0.092 0.015 0.353
2 0.085 0.108 0.092 0.023 0.308
3 0.062 0.031 0.039 0.006 0.138
pY (y) 0.347 0.316 0.262 0.075 1.000
Even for a multivariate random variable, expected values E(Xi ), variances Var(Xi ) and
medians of individual variables are obtained from the univariate (marginal)
distributions of Xi , as defined in Chapter 3.
152
5.6. Conditional distributions
Activity 5.1 Show that the marginal distributions of a bivariate distribution are
not enough to define the bivariate distribution itself.
Solution
Here we must show that there are two distinct bivariate distributions with the same
marginal distributions. It is easiest to think of the simplest case where X and Y
each take only two values, say 0 and 1.
Suppose the marginal distributions of X and Y are the same, with
p(0) = p(1) = 0.5. One possible bivariate distribution with these marginal
distributions is the one for which there is independence between X and Y . This has
pX,Y (x, y) = pX (x) pY (y) for all x, y. Writing it in full:
pX,Y (0, 0) = pX,Y (1, 0) = pX,Y (0, 1) = pX,Y (1, 1) = 0.5 × 0.5 = 0.25.
The table of probabilities for this choice of independence is shown in the first table
below.
Trying some other value for pX,Y (0, 0), like 0.2, gives the second table below.
X/Y 0 1 X/Y 0 1
0 0.25 0.25 0 0.2 0.3
1 0.25 0.25 1 0.3 0.2
The construction of these probabilities is done by making sure the row and column
totals are equal to 0.5, and so we now have a second distribution with the same
marginal distributions as the first.
This example is very simple, but one can almost always construct many bivariate
distributions with the same marginal distributions even for continuous random
variables.
153
5. Multivariate random variables
Let x be one possible value of X, for which pX (x) > 0. The conditional
distribution of Y given that X = x is the discrete probability distribution with
the pf:
Example 5.6 Recall that in the football example the joint and marginal pfs were:
Y =y
X=x 0 1 2 3 pX (x)
0 0.100 0.031 0.039 0.031 0.201
1 0.100 0.146 0.092 0.015 0.353
2 0.085 0.108 0.092 0.023 0.308
3 0.062 0.031 0.039 0.006 0.138
pY (y) 0.347 0.316 0.262 0.075 1.000
We can now calculate the conditional pf of Y given X = x for each x, i.e. of away
goals given home goals. For example:
pY |X (y | x) when y is:
X=x 0 1 2 3 Sum
0 0.498 0.154 0.194 0.154 1.00
1 0.283 0.414 0.261 0.042 1.00
2 0.276 0.351 0.299 0.075 1.00
3 0.449 0.225 0.283 0.043 1.00
if the home team scores 0 goals, the probability that the visiting team scores 1
goal is pY |X (1 | 0) = 0.154
if the home team scores 1 goal, the probability that the visiting team wins the
match is pY |X (2 | 1) + pY |X (3 | 1) = 0.261 + 0.042 = 0.303.
154
5.6. Conditional distributions
P (A ∩ B) P (Y = y and X = x)
P (A | B) = =
P (B) P (X = x)
= P (Y = y | X = x)
pX,Y (x, y)
=
pX (x)
= pY |X (y | x).
The conditional distribution and pf of X given Y = y (for any y such that pY (y) > 0) is
defined similarly, with the roles of X and Y reversed:
pX,Y (x, y)
pX|Y (x | y) =
pY (y)
pX,Y (x, y)
pY|X (y | x) =
pX (x)
where pX,Y (x, y) is the joint pf of the random vector (X, Y), and pX (x) is the marginal
pf of the random vector X.
These are known as the conditional mean and conditional variance, and are
denoted, respectively, by:
155
5. Multivariate random variables
So, if the home team scores 0 goals, the expected number of goals by the visiting
team is EY |X (Y | 0) = 1.00.
EY |X (Y | x) for x = 1, 2 and 3 are obtained similarly.
Here X is the number of goals by the home team, and Y is the number of goals by
the visiting team:
pY |X (y | x) when y is:
X=x 0 1 2 3 EY |X (Y | x)
0 0.498 0.154 0.194 0.154 1.00
1 0.283 0.414 0.261 0.042 1.06
2 0.276 0.351 0.299 0.075 1.17
3 0.449 0.225 0.283 0.043 0.92
Home goals x
Expected away goals E(Y|x)
2.5
2.0
1.5
1.0
0.5
0.0
Goals
156
5.7. Covariance and correlation
If two random variables are associated (dependent), knowing the value of one (for
example, X) will help to predict the likely value of the other (for example, Y ).
We next consider two measures of association which are used to summarise the
strength of an association in a single number: covariance and correlation (scaled
covariance).
5.7.1 Covariance
Definition of covariance
(Note that these involve expected values of products of two random variables, which
have not been defined yet. We will do so later in this chapter.)
Properties of covariance
The covariance of a random variable with itself is the variance of the random
variable:
Cov(aX + b, cY + d) = ac Cov(X, Y ).
Activity 5.2 Suppose that X and Y have a bivariate distribution. Find the
covariance of the new random variables W = aX + bY and V = cX + dY where a, b,
c and d are constants.
157
5. Multivariate random variables
Solution
The covariance of W and V is:
5.7.2 Correlation
Definition of correlation
When Cov(X, Y ) = 0, then Corr(X, Y ) = 0. When this is the case, we say that X
and Y are uncorrelated.
Correlation and covariance are measures of the strength of the linear (‘straight-line’)
association between X and Y .
The further the correlation is from 0, the stronger is the linear association. The most
extreme possible values of correlation are −1 and +1, which are obtained when Y is an
exact linear function of X.
Corr(X, Y ) = +1 when Y = aX + b with a > 0.
Corr(X, Y ) = −1 when Y = aX + b with a < 0.
Example 5.8 Recall the joint pf pX,Y (x, y) in the football example:
Y =y
X=x 0 1 2 3
0 0 0 0 0
0.100 0.031 0.039 0.031
1 0 1 2 3
0.100 0.146 0.092 0.015
2 0 2 4 6
0.085 0.108 0.092 0.023
3 0 3 6 9
0.062 0.031 0.039 0.006
158
5.7. Covariance and correlation
Here, the numbers in bold are the values of xy for each combination of x and y.
From these and their probabilities, we can derive the probability distribution of XY .
For example:
XY = xy 0 1 2 3 4 6 9
P (XY = xy) 0.448 0.146 0.200 0.046 0.092 0.062 0.006
Hence:
also:
E(X 2 ) = 2.827 and E(Y 2 ) = 2.039
hence:
The numbers of goals scored by the home and visiting teams are very nearly
uncorrelated (i.e. not linearly associated).
X=x 0 1 2 Y =y 1 2
pX (x) 0.4 0.2 0.4 pY (y) 0.4 0.6
159
5. Multivariate random variables
Solution
W =w
0 2 4 pZ (z)
−1 0.00 0.00 0.16 0.16
Z=z 0 0.00 0.08 0.24 0.32
1 0.16 0.12 0.00 0.28
2 0.24 0.00 0.00 0.24
pW (w) 0.40 0.20 0.40 1.00
P (W = 2 ∩ Z = 1) 0.12 3
P (W = 2 | Z = 1) = = = .
P (Z = 1) 0.28 7
Also: XX
E(W Z) = w z p(w, z) = −4 × 0.16 + 2 × 0.12 = −0.4
w z
hence:
Activity 5.4 The joint probability distribution of the random variables X and Y is:
X=x
−1 0 1
−1 0.05 0.15 0.10
Y =y 0 0.10 0.05 0.25
1 0.10 0.05 0.15
(a) Identify the marginal distributions of X and Y and the conditional distribution
of X given Y = 1.
160
5.7. Covariance and correlation
Solution
X=x −1 0 1 Y =y −1 0 1
pX (x) 0.25 0.25 0.50 pY (y) 0.30 0.40 0.30
X = x|Y = 1 −1 0 1
pX|Y =1 (x | Y = 1) 1/3 1/6 1/2
(Note that Var(X) and Var(Y ) are not strictly necessary here!)
Next:
XX
E(XY ) = x y p(x, y)
x y
So:
Cov(X, Y ) = E(XY ) − E(X) E(Y ) = 0 ⇒ Corr(X, Y ) = 0.
(c) X and Y are not independent random variables since, for example:
Activity 5.5 The random variables X1 and X2 are independent and have the
common distribution given in the table below:
X=x 0 1 2 3
pX (x) 0.2 0.4 0.3 0.1
(a) Calculate the table of probabilities which defines the joint distribution of W
and Y .
161
5. Multivariate random variables
(b) Find:
i. the marginal distribution of W
ii. the conditional distribution of Y given W = 2
iii. E(Y | W = 2) and Var(Y | W = 2)
iv. Cov(W, Y ).
Solution
W =w
0 1 2 3
0 (0.2)2 2(0.2)(0.4) 2(0.2)(0.3) 2(0.2)(0.1)
Y =y 1 0 (0.4)(0.4) 2(0.4)(0.3) 2(0.4)(0.1)
2 0 0 (0.3)(0.3) 2(0.3)(0.1)
3 0 0 0 (0.1)(0.1)
(0.2)2 (0.8)(0.4) (1.5)(0.3) (1.9)(0.1)
which is:
W =w
0 1 2 3
0 0.04 0.16 0.12 0.04
Y =y 1 0.00 0.16 0.24 0.08
2 0.00 0.00 0.09 0.06
3 0.00 0.00 0.00 0.01
0.04 0.32 0.45 0.19
162
5.7. Covariance and correlation
Activity 5.6 Consider two random variables X and Y . X can take the values −1, 0
and 1, and Y can take the values 0, 1 and 2. The joint probabilities for each pair are
given by the following table:
X = −1 X = 0 X = 1
Y =0 0.10 0.20 0.10
Y =1 0.10 0.05 0.10
Y =2 0.10 0.05 0.20
Solution
X=x −1 0 1
pX (x) 0.3 0.3 0.4
Y =y 0 1 2
pY (y) 0.40 0.25 0.35
Hence:
E(X) = −1 × 0.3 + 0 × 0.3 + 1 × 0.4 = 0.1
and:
E(Y ) = 0 × 0.40 + 1 × 0.25 + 2 × 0.35 = 0.95.
(b) We have:
Cov(U, V ) = Cov(X + Y, X − Y )
= E((X + Y )(X − Y )) − E(X + Y )E(X − Y )
= E(X 2 − Y 2 ) − (E(X) + E(Y )) (E(X) − E(Y ))
hence:
163
5. Multivariate random variables
(c) U = 1 is achieved for (X, Y ) pairs (−1, 2), (0, 1) or (1, 0). The corresponding
values of V are −3, −1 and 1. We have:
0.1 2
P (V = −3 | U = 1) = =
0.25 5
0.05 1
P (V = −1 | U = 1) = =
0.25 5
0.1 2
P (V = 1 | U = 1) = =
0.25 5
hence:
2 1 2
E(V | U = 1) = −3 × + −1 × + 1× = −1.
5 5 5
Activity 5.7 Two refills for a ballpoint pen are selected at random from a box
containing three blue refills, two red refills and three green refills. Define the
following random variables:
(b) Form the table showing the joint probability distribution of X and Y .
(e) Are X and Y independent random variables? Give a reason for your answer.
Solution
(b) We have:
X=x
0 1 2
0 3/28 9/28 3/28
Y =y 1 3/14 3/14 0
2 1/28 0 0
164
5.7. Covariance and correlation
X=x 0 1 2
pX (x) 10/28 15/28 3/28
Hence:
10 15 3 3
E(X) = 0 × +1× +2× = .
28 28 28 4
The marginal distribution of Y is:
Y =y 0 1 2
pY (y) 15/28 12/28 1/28
Hence:
15 12 1 1
E(Y ) = 0 × +1× +2× = .
28 28 28 2
The conditional distribution of X given Y = 1 is:
X = x|Y = 1 0 1
pX|Y =1 (x | y = 1) 1/2 1/2
Hence:
1 1 1
E(X | Y = 1) = 0 × +1× = .
2 2 2
(d) The distribution of XY is:
XY = xy 0 1
pXY (xy) 22/28 6/28
Hence:
22 6 3
E(XY ) = 0 × +1× =
28 28 14
and:
3 3 1 9
Cov(X, Y ) = E(XY ) − E(X) E(Y ) = − × =− .
14 4 2 56
(e) Since Cov(X, Y ) 6= 0, a necessary condition for independence fails to hold. The
random variables are not independent.
Activity 5.8 A fair coin is tossed four times. Let X be the number of heads
obtained on the first three tosses of the coin. Let Y be the number of heads on all
four tosses of the coin.
(d) Find the mean of the conditional probability distribution of Y given that X = 2.
165
5. Multivariate random variables
Solution
Y =y \X=x 0 1 2 3
0 1/16 0 0 0
1 1/16 3/16 0 0
2 0 3/16 3/16 0
3 0 0 3/16 1/16
4 0 0 0 1/16
X=x 0 1 2 3
p(x) 1/8 3/8 3/8 1/8
Hence:
X 1 1 3
E(X) = x p(x) = 0 × + ··· + 3 × =
x
8 8 2
X 1 1
E(X 2 ) = x2 p(x) = 02 × + · · · + 32 × = 3
x
8 8
and:
9 3
Var(X) = 3 − = .
4 4
(c) We have:
p(2, 0) 0
P (Y = 0 | X = 2) = = =0
pX (2) 3/8
p(2, 1) 0
P (Y = 1 | X = 2) = = =0
pX (2) 3/8
p(2, 2) 3/16 1
P (Y = 2 | X = 2) = = =
pX (2) 3/8 2
p(2, 3) 3/16 1
P (Y = 3 | X = 2) = = =
pX (2) 3/8 2
p(2, 4) 0
P (Y = 4 | X = 2) = = = 0.
pX (2) 3/8
Hence:
Y = y|X = 2 2 3
p(y | X = 2) 1/2 1/2
(d) We have:
1 1 5
E(Y | X = 2) = 2 × +3× = .
2 2 2
166
5.7. Covariance and correlation
Activity 5.9 X and Y are discrete random variables which can assume values 0, 1
and 2 only.
(a) Draw up a table to describe the joint distribution of X and Y and find the
value of the constant A.
Solution
X=x
0 1 2
0 0 A 2A
Y =y 1 A 2A 3A
2 2A 3A 4A
PP
Since pX,Y (x, y) = 1, we have A = 1/18.
∀x ∀y
X=x 0 1 2
P (X = x) 3A = 1/6 6A = 1/3 9A = 1/2
X = x|y = 1 0 1 2
PX|Y =1 (X = x | y = 1) A/6A = 1/6 2A/6A = 1/3 3A/6A = 1/2
Hence:
1 1 1 4
E(X | Y = 1) = 0 × + 1× + 2× = .
6 3 2 3
(d) Even though the distributions of X and X | Y = 1 are the same, X and Y are
not independent. For example, P (X = 0, Y = 0) = 0 although P (X = 0) 6= 0
and P (Y = 0) 6= 0.
Activity 5.10 X and Y are discrete random variables with the following joint
probability function:
167
5. Multivariate random variables
X=x
−1 0 1
Y =y 0 0.15 0.05 0.15
1 0.30 0.25 0.10
Solution
(a) The marginal distributions are found by adding across rows and columns:
X=x −1 0 1
pX (x) 0.45 0.30 0.25
and:
Y =y 0 1
pY (y) 0.35 0.65
(b) We have:
E(X) = −1 × 0.45 + 0 × 0.30 + 1 × 0.25 = −0.20
and:
E(X 2 ) = (−1)2 × 0.45 + 02 × 0.30 + 12 × 0.25 = 0.70
so Var(X) = 0.70 − (−0.20)2 = 0.66. Also:
and:
E(Y 2 ) = 02 × 0.35 + 12 × 0.65 = 0.65
so Var(Y ) = 0.65 − (0.65)2 = 0.2275.
168
5.7. Covariance and correlation
Y = y | X = −1 0 1
pY |X=−1 (y | x = −1) 0.15/0.45 = 0.3̇ 0.30/0.45 = 0.6̇
and:
X = x|Y = 0 −1 0 1
pX|Y =0 (x | y = 0) 0.15/0.35 = 0.4286 0.05/0.35 = 0.1429 0.15/0.35 = 0.4286
Cov(X, Y ) −0.07
Corr(X, Y ) = p =√ = −0.1807.
Var(X) Var(Y ) 0.66 × 0.2275
(g) Since X and Y are (weakly) negatively correlated (as determined in (e)), they
cannot be independent.
While the non-zero correlation is a sufficient explanation in this case, for other
such bivariate distributions which are uncorrelated, i.e. when Corr(X, Y ) = 0, it
becomes necessary to check whether pX,Y (x, y) = pX (x) pY (y) for all pairs of
values of (x, y). Here, for example, pX,Y (0, 0) = 0.05, pX (0) = 0.30 and
pY (0) = 0.35. We then have that pX (0) pY (0) = 0.105, which is not equal to
pX,Y (0, 0) = 0.05. Hence X and Y cannot be independent.
Activity 5.11 A box contains 4 red balls, 3 green balls and 3 blue balls. Two balls
are selected at random without replacement. Let X represent the number of red
balls in the sample and Y the number of green balls in the sample.
(a) Arrange the different pairs of values of (X, Y ) as the cells in a table, each cell
being filled with the probability of that pair of values occurring, i.e. provide the
joint probability distribution.
169
5. Multivariate random variables
Solution
(a) We have:
3 2 6 1
P (X = 0, Y = 0) = × = =
10 9 90 15
3 3 18 3
P (X = 0, Y = 1) = 2 × × = =
10 9 90 15
3 2 6 1
P (X = 0, Y = 2) = × = =
10 9 90 15
4 3 24 4
P (X = 1, Y = 0) = 2 × × = =
10 9 90 15
4 3 24 4
P (X = 1, Y = 1) = 2 × × = =
10 9 90 15
4 3 12 2
P (X = 2, Y = 0) = × = = .
10 9 90 15
All other values have probability 0. We then construct the table of joint
probabilities:
Y =0 Y =1 Y =2
X = 0 1/15 3/15 1/15
X = 1 4/15 4/15 0
X = 2 2/15 0 0
(c) We have:
4 4 2 4
E(X) = 1 × + +2× =
15 15 15 5
3 4 1 3
E(Y ) = 1 × + +2× =
15 15 15 5
and:
4 4
E(XY ) = 1 × 1 × = .
15 15
So:
4 4 3 16
Cov(X, Y ) = − × =− .
15 5 5 75
(d) We have:
4/15 + 4/15 2
P (X = 1 | |X − Y | < 2) = = .
1/15 + 3/15 + 4/15 + 4/15 3
170
5.7. Covariance and correlation
Activity 5.12 Suppose that Var(X) = Var(Y ) = 1, and that X and Y have
correlation coefficient ρ. Show that it follows from Var(X − ρY ) ≥ 0 that ρ2 ≤ 1.
Solution
We have:
Hence 1 − ρ2 ≥ 0, and so ρ2 ≤ 1.
X=x −1 0 1
P (X = x) a b a
Solution
This is an example of two random variables X and Y = X 2 which are uncorrelated,
but obviously dependent. The bivariate distribution of (X, Y ) in this case is singular
because of the complete functional dependence between them.
We have:
E(X) = −1 × a + 0 × b + 1 × a = 0
E(X 2 ) = +1 × a + 0 × b + 1 × a = 2a
E(X 3 ) = −1 × a + 0 × b + 1 × a = 0
There are many possible choices for a and b which give a valid probability
distribution, for instance a = 0.25 and b = 0.5.
Activity 5.14 A fair coin is thrown n times, each throw being independent of the
ones before. Let R = ‘the number of heads’, and S = ‘the number of tails’. Find the
covariance of R and S. What is the correlation of R and S?
Solution
One can go about this in a straightforward way. If Xi is the number of heads and Yi
is the number of tails on the ith throw, then the distribution of Xi and Yi is given by:
X/Y 0 1
0 0 0.5
1 0.5 0
171
5. Multivariate random variables
Cov(R, S) = −0.25n.
(add the variances of the Xi s or Yi s). The correlation between R and S works out as
−0.25n/0.25n = −1.
Activity 5.15 Suppose that X and Y are random variables, and a, b, c and d are
constants.
(c) Suppose that Z = cX + d, where c and d are constants. Using the result you
obtained in (b), or in some other way, show that:
and:
Corr(X, Z) = −1 for c < 0.
Solution
172
5.7. Covariance and correlation
as required.
Cov(aX + b, cY + d)
Corr(aX + b, cY + d) =
sd(aX + b) sd(cY + d)
ac Cov(X, Y )
=
|ac| sd(X) sd(Y )
ac
= Corr(X, Y ).
|ac|
(c) First, note that the correlation of a random variable with itself is 1, since:
Cov(X, X) Var(X)
Corr(X, X) = p = = 1.
Var(X) Var(X) Var(X)
173
5. Multivariate random variables
Sample covariance
Sample correlation
Plot (f) shows that r can be 0 even if two variables are clearly related, if that
relationship is not linear.
174
5.8. Independent random variables
for all numbers x1 , x2 , . . . , xn , where p1 (x1 ), . . . , pn (xn ) are the univariate marginal
pfs of X1 , . . . , Xn , respectively.
175
5. Multivariate random variables
for all x1 , x2 , . . . , xn , where f1 (x1 ), . . . , fn (xn ) are the univariate marginal pdfs of
X1 , . . . , Xn , respectively.
If two random variables are independent, they are also uncorrelated, i.e. we have:
The reverse is not true, i.e. two random variables can be dependent even when their
correlation is 0. This can happen when the dependence is non-linear.
(xi − µ)2
1
f (xi ) = √ exp −
2πσ 2 2σ 2
176
5.8. Independent random variables
where:
eiθ
πi =
1 + eiθ
for i = 1, 2, . . . , n. Derive the joint probability function, p(x1 , x2 , . . . , xn ).
Solution
Since the Xi s are independent (but not identically distributed) random variables, we
have: n
Y
p(x1 , x2 , . . . , xn ) = p(xi ).
i=1
Solution
Since the Xi s are independent (and identically distributed) random variables, we
177
5. Multivariate random variables
have: n
Y
f (x1 , x2 , . . . , xn ) = f (xi ).
i=1
θx
m
p(x) = for x = 0, 1, 2, . . . , m
x (1 + θ)m
Solution
Since the Xi s are independent (and identically distributed) random variables, we
have: n
Y
p(x1 , x2 , . . . , xn ) = p(xi ).
i=1
P (X ≤ x ∩ Y ≤ y) = (1 − e−x ) (1 − e−2y )
for all x, y > 0, then X and Y are independent random variables, each with an
exponential distribution.
Solution
The right-hand side of the result given is the product of the cdf of an exponential
random variable X with mean 1 and the cdf of an exponential random variable Y
with mean 2. So the result follows from the definition of independent random
variables.
Activity 5.20 The random variable X has a discrete uniform distribution with
values 1, 2 and 3, i.e. P (X = i) = 1/3 for i = 1, 2, 3. The random variable Y has a
discrete uniform distribution with values 1, 2, 3 and 4, i.e. P (Y = i) = 1/4 for
i = 1, 2, 3, 4. X and Y are independent.
178
5.8. Independent random variables
Solution
(a) The possible values of the sum are 2, 3, 4, 5, 6 and 7. Since X and Y are
independent, the probabilities of the different sums are:
1 1 1
P (X + Y = 2) = P (X = 1, Y = 1) = P (X = 1) P (Y = 1) = × =
3 4 12
2 1
P (X + Y = 3) = P (X = 1) P (Y = 2) + P (X = 2) P (Y = 1) = =
12 6
P (X + Y = 4) = P (X = 1) P (Y = 3) + P (X = 2) P (Y = 2)
3 1
+ P (X = 3) P (Y = 1) = =
12 4
P (X + Y = 5) = P (X = 1) P (Y = 4) + P (X = 2) P (Y = 3)
3 1
+ P (X = 3) P (Y = 2) = =
12 4
2 1
P (X + Y = 6) = P (X = 2) P (Y = 4) + P (X = 3) P (Y = 3) = =
12 6
1
P (X + Y = 7) = P (X = 3) P (Y = 4) =
12
and 0 for all other real numbers.
(b) You could find the expectation and variance directly from the distribution of
X + Y above. However, it is easier to use the expected value and variance of the
discrete uniform distribution for both X and Y , and then the results on the
expectation and variance of sums of independent random variables to get:
1+3 1+4
E(X + Y ) = E(X) + E(Y ) = + = 4.5
2 2
and:
32 − 1 42 − 1 23
Var(X + Y ) = Var(X) + Var(Y ) = + = ≈ 1.92.
12 12 12
179
5. Multivariate random variables
Solution
(a) We have: !
k
X k
X k
X
E ai X i = E(ai Xi ) = ai E(Xi ).
i=1 i=1 i=1
(b) We have:
! !2
k
X k
X k
X
Var ai X i = E ai X i − ai E(Xi )
i=1 i=1 i=1
!2
k
X
= E ai (Xi − E(Xi ))
i=1
k
X
= a2i E((Xi − E(Xi ))2 )+
i=1
X
ai aj E((Xi − E(Xi ))(Xj − E(Xj )))
1≤i6=j≤n
k
X
= a2i Var(Xi )+
i=1
X
ai aj E(Xi − E(Xi )) E(Xj − E(Xj ))
1≤i6=j≤n
k
X
= a2i Var(Xi ).
i=1
Additional note: remember there are two ways to compute the variance:
Var(X) = E((X − µ)2 ) and Var(X) = E(X 2 ) − (E(X))2 . The former is more
convenient for analytical derivations/proofs (see above), while the latter should be
used to compute variances for common distributions such as Poisson or exponential
distributions. Actually it is rather difficult to compute the variance for a Poisson
distribution using the formula Var(X) = E((X − µ)2 ) directly.
n
X
ai X i + b = a1 X 1 + a2 X 2 + · · · + an X n + b (5.2)
i=1
180
5.9. Sums and products of random variables
and:
n
Y
ai Xi = (a1 X1 ) (a2 X2 ) · · · (an Xn )
i=1
Example 5.13 In the football example, the sum Z = X + Y is the total number of
goals scored in a match.
Its probability function is obtained from the joint pf pX,Y (x, y), that is:
Z=z 0 1 2 3 4 5 6
pZ (z) 0.100 0.131 0.270 0.293 0.138 0.062 0.006
However, what can we say about such distributions in general, in cases where we cannot
derive them as easily?
General results for the distributions of sums and products of random variables are
available as follows:
Sums Products
Only for independent
Mean Yes
random variables
Variance Yes No
Normal: Yes
Distributional Some other distributions:
No
form only for independent
random variables
181
5. Multivariate random variables
In particular, for n = 2:
These results also hold whenever Cov(Xi , Xj ) = 0 for all i 6= j, even if the random
variables are not independent.
182
5.9. Sums and products of random variables
Activity 5.22 Cars pass a point on a busy road at an average rate of 150 per hour.
Assume that the number of cars in an hour follows a Poisson distribution. Other
motor vehicles (lorries, motorcycles etc.) pass the same point at the rate of 75 per
hour. Assume a Poisson distribution for these vehicles too, and assume that the
number of other vehicles is independent of the number of cars.
183
5. Multivariate random variables
(a) What is the probability that one car and one other motor vehicle pass in a
two-minute period?
(b) What is the probability that two motor vehicles of any type (cars, lorries,
motorcycles etc.) pass in a two-minute period?
Solution
(a) Let X denote the number of cars, and Y denote the number of other motor
vehicles in a two-minute period. We need the probability given by
P (X = 1, Y = 1), which is P (X = 1) P (Y = 1) since X and Y are independent.
A rate of 150 cars per hour is a rate of 5 per two minutes, so X ∼ Poisson(5).
The probability of one car passing in two minutes is
P (X = 1) = e−5 (5)1 /1! = 0.0337. The rate for other vehicles over two minutes is
2.5, so Y ∼ Poisson(2.5) and so P (Y = 1) = e−2.5 (2.5)1 /1! = 0.2052. Hence the
probability for one vehicle of each type is 0.0337 × 0.2055 = 0.0069.
(b) Here we require P (Z = 2), where Z = X + Y . Since the sum of two independent
Poisson variables is again Poisson (see Section 5.10.5), then
Z ∼ Poisson(5 + 2.5) = Poisson(7.5). Therefore, the required probability is:
e−7.5 × (7.5)2
P (Z = 2) = = 0.0156.
2!
An easy proof that the mean and variance of X ∼ Bin(n, π) are E(X) = n π and
Var(X) = n π (1 − π) is as follows.
All sums (linear combinations) of normally distributed random variables are also
normally distributed.
184
5.9. Sums and products of random variables
where:
n
X n
X XX
2
µ= ai µi + b and σ = a2i σi2 + 2 ai aj Cov(Xi , Xj ).
i=1 i=1 i<j
If the Xi s are independent (or just uncorrelated), i.e. if Cov(Xi , Xj ) = 0 for all i 6= j,
n
the variance simplifies to σ 2 = a2i σi2 .
P
i=1
Example 5.15 Suppose that in the population of English people aged 16 or over:
the heights of men (in cm) follow a normal distribution with mean 174.9 and
standard deviation 7.39
the heights of women (in cm) follow a normal distribution with mean 161.3 and
standard deviation 6.85.
Suppose we select one man and one woman at random and independently of each
other. Denote the man’s height by X and the woman’s height by Y . What is the
probability that the man is at most 10 cm taller than the woman?
In other words, what is the probability that the difference between X and Y is at
most 10?
Since X and Y are independent we have:
2
D = X − Y ∼ N (µX − µY , σX + σY2 )
= N (174.9 − 161.3, (7.39)2 + (6.85)2 )
= N (13.6, (10.08)2 ).
185
5. Multivariate random variables
If pairs of pistons and cylinders are selected at random for assembly, for what
proportion will the piston not fit into the cylinder (i.e. for which the piston diameter
exceeds the cylinder diameter)?
Solution
Let P ∼ N (10.42, (0.03)2 ) for the pistons, and C ∼ N (10.52, (0.04)2 ) for the
cylinders. It follows that D ∼ N (0.1, (0.05)2 ) for the difference (adding the
variances, assuming independence). The piston will fit if D > 0. We require:
0 − 0.1
P (D > 0) = P Z > = P (Z > −2) = 0.9772
0.05
so the proportion of 1 − 0.9772 = 0.0228 will not fit.
The number of pistons, N , failing to fit out of 100 will be a binomial random
variable such that N ∼ Bin(100, 0.0228).
(b) Using the Poisson approximation with λ = 100 × 0.0228 = 2.28, we have the
following.
i. P (N = 0) ≈ e−2.28 = 0.1023.
ii. P (N ≤ 2) ≈ e−2.28 + e−2.28 × 2.28 + e−2.28 × (2.28)2 /2! = 0.6013.
The approximations are good (note there will be some rounding error, but the
values are close with the two methods). It is not surprising that there is close
agreement since n is large, π is small and n π < 5.
186
5.10. Overview of chapter
1. Consider two random variables X and Y taking the values 0 and 1. The joint
probabilities for the pair are given by the following table
X=0 X=1
Y =0 1/2 − α α
Y =1 α 1/2 − α
(a) What are the values α can take? Explain your answer.
Now let α = 1/4, and:
max(X, Y )
U= and V = min(X, Y )
3
where max(X, Y ) means the larger of X and Y , and min(X, Y ) means the smaller
of X and Y . For example, max(0, 1) = 1, min(0, 1) = 0, and
min(0, 0) = max(0, 0) = 0.
(b) Compute the mean of U and the mean of V .
(c) Are U and V independent? Explain your answer.
2. The amount of coffee dispensed into a coffee cup by a coffee machine follows a
normal distribution with mean 150 ml and standard deviation 10 ml. The coffee is
sold at the price of £1 per cup. However, the coffee cups are marked at the 137 ml
level, and any cup with coffee below this level will be given away free of charge.
The amounts of coffee dispensed in different cups are independent of each other.
187
5. Multivariate random variables
(a) Find the probability that the total amount of coffee in 5 cups exceeds 700 ml.
(b) Find the probability that the difference in the amounts of coffee in 2 cups is
smaller than 20 ml.
(c) Find the probability that one cup is filled below the level of 137 ml.
(d) Find the expected income from selling one cup of coffee.
3. There are six houses on Station Street, numbered 1 to 6. The postman has six
letters to deliver, one addressed to each house. As he is sloppy and in a hurry he
does not look at which letter he puts in which letterbox (one per house).
(a) Explain in words why the probability that the people living in the first house
receive the correct letter is equal to 1/6.
(b) Let Xi (for i = 1, . . . , 6) be the random variable which is equal to 1 if the
people living in house number i receive the correct letter, and equal to 0
otherwise. Show that E(Xi ) = 1/6.
(c) Show that X1 and X2 are not independent.
(d) Calculate Cov(X1 , X2 ).
188
Chapter 6
Sampling distributions of statistics
prove and apply the results for the mean and variance of the sampling distribution
of the sample mean when a random sample is drawn with replacement
state the central limit theorem and recall when the limit is likely to provide a good
approximation to the distribution of the sample mean.
6.3 Introduction
Suppose we have a sample of n observations of a random variable X:
{X1 , X2 , . . . , Xn }.
189
6. Sampling distributions of statistics
For simplicity, we may often use phrases like ‘distribution f (x; θ)’ or ‘distribution
F (x; θ)’ when we mean ‘distribution with the pf/pdf f (x; θ)’ and ‘distribution with the
cdf F (x; θ)’, respectively.
The simplest assumptions about the joint distribution of the sample are as follows.
1. {X1 , X2 , . . . , Xn } are independent random variables.
2. {X1 , X2 , . . . , Xn } are identically distributed random variables. Each Xi has the
same distribution f (x; θ), with the same value of the parameter(s) θ.
We will assume this most of the time from now. So you will see many examples and
questions which begin something like:
‘Let {X1 , . . . , Xn } be a random sample from a normal distribution with mean
µ and variance σ 2 . . . ’.
190
6.5. Statistics and their sampling distributions
Not all problems can be seen as IID random samples of a single random variable. There
are other possibilities, which you will see more of in the future.
n √
the sample variance S 2 = (Xi − X̄)2 /(n − 1) and standard deviation S =
P
S2
i=1
191
6. Sampling distributions of statistics
Here we focus on single (univariate) statistics. More generally, we could also consider
vectors of statistics, i.e. multivariate statistics.
Here is one such random sample (with values rounded to 2 decimal places):
6.28 5.22 4.19 3.56 4.15 4.11 4.03 5.81 5.43 6.09
4.98 4.11 5.55 3.95 4.97 5.68 5.66 3.37 4.98 6.58
For this random sample, the values of our statistics are:
x̄ = 4.94
s2 = 0.90
maxx = 6.58.
192
6.5. Statistics and their sampling distributions
Here is another such random sample (with values rounded to 2 decimal places):
5.44 6.14 4.91 5.63 3.89 4.17 5.79 5.33 5.09 3.90
5.47 6.62 6.43 5.84 6.19 5.63 3.61 5.49 4.55 4.27
For this sample, the values of our statistics are:
(a) Write down the formula for the cumulative distribution function FY (y) of Y , i.e.
for the probability that all observations in the sample are ≤ y.
(b) From the result in (a), derive the probability density function fY (y) of Y .
(c) The heights (in cm) of men aged over 16 in England are approximately
normally distributed with a mean of 174.9 and a standard deviation of 7.39.
What is the probability that in a random sample of 60 men from this
population at least one man is more than 1.92 metres tall?
Solution
193
6. Sampling distributions of statistics
The sampling distribution of a statistic is the distribution of the values of the statistic
in (infinitely) many repeated samples. However, typically we only have one sample
which was actually observed. Therefore, the sampling distribution seems like an
essentially hypothetical concept.
Nevertheless, it is possible to derive the forms of sampling distributions of statistics
under different assumptions about the sampling schemes and population distribution
f (x; θ).
There are two main ways of doing this.
Example 6.3 Consider again a random sample of size n = 20 from the population
X ∼ N (5, 1), and the statistics X̄, S 2 and maxX .
Figures 6.1, 6.2 and 6.3 show histograms of the statistics for these 10,000
random samples.
We now consider deriving the exact sampling distribution. Here this is possible. For
a random sample of size n from N (µ, σ 2 ) we have:
where FX (x) and fX (x) are the cdf and pdf of X ∼ N (µ, σ 2 ), respectively.
Curves of the densities of these distributions are also shown in Figures 6.1, 6.2 and
6.3.
194
6.6. Sample mean from a normal population
Sample mean
and: !
n
X n
X
Var ai X i = a2i Var(Xi ).
i=1 i=1
For a random sample, all Xi s are independent and E(X P i ) = E(X) is the same
Pfor all of
them, since the Xi s are identically distributed. X̄ = i Xi /n is of the form i ai Xi ,
with ai = 1/n for all i = 1, . . . , n.
Therefore:
n
X 1 1
E(X̄) = E(X) = n × E(X) = E(X)
i=1
n n
and:
n
X 1 1 Var(X)
Var(X̄) = 2
Var(X) = n × 2 Var(X) = .
i=1
n n n
195
6. Sampling distributions of statistics
Sample variance
5 6 7 8 9
Maximum value
196
6.6. Sample mean from a normal population
So the mean and variance of X̄ are E(X) and Var(X)/n, respectively, for a random
sample from any population distribution of X. What about the form of the sampling
distribution of X̄?
This depends on the distribution of X, and is not generally known. However, when the
distribution of X is normal, we do know that the sampling distribution of X̄ is also
normal.
Suppose that {X1 , . . . , Xn } is a random sample from a normal distribution with mean µ
and variance σ 2 , then:
σ2
X̄ ∼ N µ, .
n
For example, the pdf drawn on the histogram in Figure 6.1 is that of N (5, 1/20).
We have E(X̄) = E(X) = µ.
More interestingly, the sampling variance gets smaller when the sample size n
increases.
Example 6.4 Suppose that the heights (in cm) of men (aged over 16) in a
population follow a normal distribution with some unknown mean µ and a known
standard deviation of 7.39.
We plan to select a random sample of n men from the population, and measure their
heights. How large should n be so that there is a probability of at least 0.95 that the
sample mean X̄ will be within 1 cm of the population mean µ?
√
Here X ∼ N (µ, (7.39)2 ), so X̄ ∼ N (µ, (7.39/ n)2 ). What we need is the smallest n
such that:
P (|X̄ − µ| ≤ 1) ≥ 0.95.
197
6. Sampling distributions of statistics
n=100
n=20
n=5
So:
P (|X̄ − µ| ≤ 1) ≥ 0.95
P (−1 ≤ X̄ − µ ≤ 1) ≥ 0.95
−1 X̄ − µ 1
P √ ≤ √ ≤ √ ≥ 0.95
7.39/ n 7.39/ n 7.39/ n
√ √
n n
P − ≤Z≤ ≥ 0.95
7.39 7.39
√
n 0.05
P Z> < = 0.025
7.39 2
where Z ∼ N (0, 1). From Table 4 of the New Cambridge Statistical Tables, we see
that the smallest z which satisfies P (Z > z) < 0.025 is z = 1.97. Therefore:
√
n
≥ 1.97 ⇔ n ≥ (7.39 × 1.97)2 = 211.9.
7.39
Therefore, n should be at least 212.
Activity 6.2 Suppose that the heights of students are normally distributed with a
mean of 68.5 inches and a standard deviation of 2.7 inches. If 200 random samples of
size 25 are drawn from this population with means recorded to the nearest 0.1 inch,
find:
(a) the expected mean and standard deviation of the sampling distribution of the
mean
198
6.6. Sample mean from a normal population
(b) the expected number of recorded sample means which fall between 67.9 and
69.2 inclusive
(c) the expected number of recorded sample means falling below 67.0.
Solution
(a) The sampling distribution of the mean of 25 observations has the same mean as
the population, which is√68.5 inches. The standard deviation (standard error) of
the sample mean is 2.7/ 25 = 0.54.
(b) Notice that the samples are random, so we cannot be sure exactly how many
will have means between 67.9 and 69.2 inches. We can work out the probability
that the sample mean will lie in this interval using the sampling distribution:
X̄ ∼ N (68.5, (0.54)2 ).
We need to make a continuity correction, to account for the fact that the
recorded means are rounded to the nearest 0.1 inch. For example, the probability
that the recorded mean is ≥ 67.9 inches is the same as the probability that the
sample mean is > 67.85. Therefore, the probability we want is:
67.85 − 68.5 69.25 − 68.5
P (67.85 < X < 69.25) = P <Z<
0.54 0.54
= P (−1.20 < Z < 1.39)
= Φ(1.39) − Φ(−1.20)
= 0.9177 − (1 − 0.1151)
= 0.8026.
Since there are 200 independent random samples drawn, we can now think of
each as a single trial. The recorded mean lies between 67.9 and 69.2 with
probability 0.8026 at each trial. We are dealing with a binomial distribution
with n = 200 trials and probability of success π = 0.8026. The expected number
of successes is:
n π = 200 × 0.8026 = 160.52.
(c) The probability that the recorded mean is < 67.0 inches is:
66.95 − 68.5
P (X < 66.95) = P Z < = P (Z < −2.87) = Φ(−2.87) = 0.00205
0.54
so the expected number of recorded means below 67.0 out of a sample of 200 is:
Activity 6.3 Suppose that we plan to take a random sample of size n from a
normal distribution with mean µ and standard deviation σ = 2.
199
6. Sampling distributions of statistics
Solution
(a) Let {X1 , . . . , Xn } denote the random sample. We know that the sampling
distribution of X̄ is N (µ, σ 2 /n), here N (4, 22 /20) = N (4, 0.2).
i. The probability we need is:
X̄ − 4 5−4
P (X̄ > 5) = P √ > √ = P (Z > 2.24) = 0.0126
0.2 0.2
where, as usual, Z ∼ N (0, 1).
ii. P (X̄ < 3) is obtained similarly. Note that this leads to
P (Z < −2.24) = 0.0126, which is equal to the P (X̄ > 5) = P (Z > 2.24)
result obtained above. This is because 5 is one unit above the mean µ = 4,
and 3 is one unit below the mean, and because the normal distribution is
symmetric around its mean.
iii. One way of expressing this is:
P (X̄ − µ > 1) = P (X̄ − µ < −1) = 0.0126
for µ = 4. This also shows that:
P (X̄ − µ > 1) + P (X̄ − µ < −1) = P (|X̄ − µ| > 1) = 2 × 0.0126 = 0.0252
and hence:
P (|X̄ − µ| ≤ 1) = 1 − 2 × 0.0126 = 0.9748.
In other words, the probability is 0.9748 that the sample mean is within
one unit of the true population mean, µ = 4.
(b) We can use the same ideas as in (a). Since X̄ ∼ N (µ, 4/n) we have:
P (|X̄ − µ| ≤ 0.5) = 1 − 2 × P (X̄ − µ > 0.5)
!
X̄ − µ 0.5
=1−2×P p >p
4/n 4/n
√
= 1 − 2 × P (Z > 0.25 n)
≥ 0.95
200
6.7. The central limit theorem
(c) Here n > 62, yet x̄ is further than 0.5 units from the claimed mean of µ = 5.
Based on the result in (b), this would be quite unlikely if µ is really 5. One
explanation of this apparent contradiction is that µ is not really equal to 5.
This kind of reasoning will be the basis of statistical hypothesis testing, which
will be discussed later in the course.
σ2
X̄ ∼ N µ,
n
201
6. Sampling distributions of statistics
It may appear that the CLT is still somewhat limited, in that it applies only to sample
means calculated from random (IID) samples. However, this is not really true, for two
main reasons.
There are more general versions of the CLT which do not require the observations
Xi to be IID.
Even the basic version applies very widely, when we realise that the ‘X’ can also be
a function of the original variables in the data. For example, if X and Y are
random variables in the sample, we can also apply the CLT to:
n n
X log(Xi ) X X i Yi
or .
i=1
n i=1
n
Therefore, the CLT can also be used to derive sampling distributions for many statistics
which do not initially look at all like X̄ for a single random variable in an IID sample.
You may get to do this in future courses.
The larger the sample size n, the better the normal approximation provided by the CLT
is. In practice, we have various rules-of-thumb for what is ‘large enough’ for the
approximation to be ‘accurate enough’. This also depends on the population
distribution of Xi . For example:
202
6.7. The central limit theorem
n=5 n = 10
n=1
0 10 20 30 40 0 2 4 6 8 10 12 14 2 4 6 8 10
n = 30 n = 100 n = 1000
2 3 4 5 6 7 2.5 3.0 3.5 4.0 4.5 5.0 5.5 3.6 3.8 4.0 4.2 4.4
Figure 6.5: Sampling distributions of X̄ for various n when sampling from the
Exponential(0.25) distribution.
Example 6.6 In the second case, we simulate 10,000 independent random samples
of sizes:
n = 1, 10, 30, 50, 100 and 1000
from the Bernoulli(0.2) distribution (for which µ = 0.2 and σ 2 = 0.16).
Here the distribution of Xi itself is not even continuous, and has only two possible
values, 0 and 1. Nevertheless, the sampling distribution of X̄ can be very
well-approximated by the normal distribution, when n is large enough.
n
P
Note that since here Xi = 1 or Xi = 0 for all i, X̄ = Xi /n = m/n, where m is the
i=1
number of observations for which Xi = 1. In other words, X̄ is the sample
proportion of the value X = 1.
The normal approximation is clearly very bad for small n, but reasonably good
already for n = 50, as shown by the histograms in Figure 6.6.
(a) Explain what is meant by the sampling distribution of this average and discuss
its relationship to the population mean.
(c) If the population of all audits has a mean of £54 and a standard deviation of
£10, find the probability that:
203
6. Sampling distributions of statistics
n = 30
n = 10
n=1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5
n = 100 n = 1000
n = 50
0.0 0.1 0.2 0.3 0.4 0.50.05 0.10 0.15 0.20 0.25 0.30 0.35 0.16 0.18 0.20 0.22 0.24
Figure 6.6: Sampling distributions of X̄ for various n when sampling from the
Bernoulli(0.2) distribution.
Solution
(a) The sample average is composed of 25 randomly sampled data which are
subject to sampling variability, hence the average is also subject to this
variability. Its sampling distribution describes its probability properties. If a
large number of such averages were independently sampled, then their
histogram would be the sampling distribution.
(b) It is reasonable to assume that this sampling distribution is normal due to the
CLT, although the sample size is rather small. If n = 25 and µ = 54 and σ = 10,
then the CLT says that:
σ2
100
X̄ ∼ N µ, = N 54, .
n 25
(c) i. We have:
!
60 − 54
P (X̄ > 60) = P Z>p = P (Z > 3) = 0.0013.
100/25
204
6.7. The central limit theorem
which would help detect whether too many objects are being put in a box. Explain
how you would calculate the (approximate?) value of this probability. Mention any
relevant theorems or assumptions needed.
Solution
Let Xi denote the weight of the ith object, for i = 1, . . . , 200. The Xi s are assumed
to be independent and identically distributed with E(Xi ) = 1 and Var(Xi ) = (0.03)2 .
We require:
200
! 200
!
X X Xi
P Xi > 200.5 = P > 1.0025 = P (X̄ > 1.0025).
i=1 i=1
200
If the weights are not normally distributed, then by the central limit theorem:
1.0025 − 1
P (X̄ > 1.0025) ≈ P Z > √ = P (Z > 1.18) = 0.1190.
0.03/ 200
If the weights are normally distributed, then this is the exact (rather than
approximate) probability.
Activity 6.6
n
P
(b) Write down the sampling distribution of X̄ = Xi /n for the sample considered
i=1
in (a). In other words, write down the possible values of X̄ and their
probabilities.
P
Hint: what are the possible values of i Xi , and their probabilities?
(c) Suppose we have a random sample of size n = 100 from the Bernoulli(0.2)
distribution. What is the approximate sampling distribution of X̄ suggested by
205
6. Sampling distributions of statistics
the central limit theorem in this case? Use this distribution to calculate an
approximate value for the probability that X̄ > 0.3. (The true value of this
probability is 0.0061.)
Solution
(a) The sum of n independent Bernoulli random variables, each with success
4
P
probability π, is Bin(n, π). Here n = 4 and π = 0.2, so Xi ∼ Bin(4, 0.2).
i=1
P
(b) The possible values of Xi are 0, 1, 2, 3 and 4, and their probabilities can be
calculated from the binomial distribution. For example:
X 4
P Xi = 1 = (0.2)1 (0.8)3 = 4 × 0.2 × 0.512 = 0.4096.
1
This is very close to the probability obtained from the exact sampling
distribution, which is about 0.0061.
Activity 6.7 A country is about to hold a referendum about leaving the European
Union. A survey of a random sample of adult citizens of the country is conducted. In
the sample, n respondents say that they plan to vote in the referendum. These n
respondents are then asked whether they plan to vote ‘Yes’ or ‘No’. Define X = 1 if
such a person plans to vote ‘Yes’, and X = 0 if such a person plans to vote ‘No’.
Suppose that in the whole population 49% of those people who plan to vote are
currently planning to vote Yes, and hence the referendum result would show a (very
206
6.7. The central limit theorem
(b) If there are n = 50 likely voters in the sample, what is the probability that
X̄ > 0.5? (Such an opinion poll would suggest a majority supporting leaving the
European Union in the referendum.)
(c) How large should n be so that there is less than a 1% chance that X̄ > 0.5 in
the random sample? (This means less than a 1% chance of the opinion poll
incorrectly predicting a majority supporting leaving the European Union in the
referendum.)
Solution
(a) Here the individual responses, the Xi s, follow a Bernoulli distribution with
probability parameter π = 0.49. The mean of this distribution is 0.49, and the
variance is 0.49 × 0.51. Therefore, the central limit theorem (CLT)
approximation of the sampling distribution of X̄ is:
2 !
0.49 × 0.51 0.4999
X̄ ∼ N 0.49, = N 0.49, √ .
n n
(b) When n = 50, the CLT approximation from (a) is X̄ ∼ N (0.49, (0.0707)2 ).
With this, we get:
X̄ − 0.49 0.5 − 0.49
P (X̄ > 0.5) = P > = P (Z > 0.14) = 0.4443.
0.0707 0.0707
√
X̄ − 0.49 0.5 − 0.49
P (X̄ > 0.5) = P √ > √ = P (Z > 0.0200 n) < 0.01.
0.4999/ n 0.4999/ n
Using Table 4 of the New Cambridge Statistical Tables, the smallest z such that
P (Z > z) < 0.01 is z = 2.33. Therefore, we need:
2
√
2.33
0.0200 n ≥ 2.33 ⇔ n≥ = 13572.25
0.0200
which means that we need at least n = 13,573 likely voters in the sample –
which is a very large sample size! Of course, the reason for this is that the
population of likely voters is almost equally split between those supporting
leaving the European Union, and those opposing. Hence such a large sample
size is necessary to be very confident of obtaining a representative sample.
207
6. Sampling distributions of statistics
n
P
(b) Write down the sampling distribution of X̄ = Xi /n. In other words, write
i=1
down the possible values of X̄ and their probabilities. (Assume n is not large.)
Pn
Hint: What are the possible values of Xi and their respective probabilities?
i=1
(c) What are the mean and variance of the sampling distribution of X̄ when λ = 5
and n = 100?
Solution
(a) The sum of n independent Poisson(λ) random variables follows the Poisson(nλ)
distribution.
P P
(b) Since i Xi has possible values 0, 1, 2, . . ., the possible values of X̄ = i Xi /n
are 0/n, 1/n, 2/n, . . .. The probabilities
P of these values are determined by the
probabilities of the values of i Xi , which are obtained from P the Poisson(nλ)
distribution. Therefore, the probability function of X̄ = i Xi /n is:
(
e−nλ (nλ)nx̄ /(nx̄)! for x̄ = 0, 1/n, 2/n, . . .
p(x̄) =
0 otherwise.
(c) For Xi ∼ Poisson(λ) we have E(Xi ) = Var(Xi ) = λ, so the general results for X̄
give E(X̄) = λ and Var(X̄) = λ/n. When λ = 5 and n = 100, E(X̄) = 5 and
Var(X̄) = 5/100 = 0.05.
Solution
By the central limit theorem we have:
σ2
X̄ ∼ N µ,
n
approximately, as n → ∞. Hence:
√ √
n(X̄n − µ) n(X̄n − 4)
= → Z ∼ N (0, 1).
σ 2
208
6.8. Some common sampling distributions
Therefore:
√ √
P (|X̄n − µ| < 0.2) ≈ P (|Z| < 0.1 n) = 2 × Φ(0.1 n) − 1.
√
However, 2 × Φ(0.1 n) − 1 ≥ 0.95 if and only if:
√ 1 + 0.95
Φ(0.1 n) ≥ = 0.975
2
which is satisfied if: √
0.1 n ≥ 1.96 ⇒ n ≥ 384.16.
Hence the smallest possible value of n is 385.
(n − 1) 2 (m − 1) 2
SX ∼ χ2n−1 and SY ∼ χ2m−1
σ2 σ2
s
n+m−2 X̄ − Ȳ
×p ∼ tn+m−2
1/n + 1/m (n − 1)SX2
+ (m − 1)SY2
and:
2
SX
∼ Fn−1, m−1 .
SY2
Here ‘χ2 ’, ‘t’ and ‘F ’ refer to three new families of probability distributions:
the t distribution
the F distribution.
These are not often used as distributions of individual variables. Instead, they are used
as sampling distributions for various statistics. Each of them arises from the normal
distribution in a particular way.
We will now briefly introduce their main properties. This is in preparation for statistical
inference, where the uses of these distributions will be discussed at length.
209
6. Sampling distributions of statistics
The χ2k distribution is a continuous distribution, which can take values of x ≥ 0. Its
mean and variance are:
E(X) = k
Var(X) = 2k.
where: Z ∞
Γ(α) = xα−1 e−x dx
0
is the gamma function, which is defined for all α > 0. (Note the formula of the pdf of
X ∼ χ2k is not examinable.)
The shape of the pdf depends on the degrees of freedom k, as illustrated in Figure 6.7.
In most applications of the χ2 distribution the appropriate value of k is known, in which
case it does not need to be estimated from data.
If X1 , X2 , . . . , Xm are independent random variables and Xi ∼ χ2ki , then their sum is
also χ2 -distributed where the individual degrees of freedom are added, such that:
The uses of the χ2 distribution will be discussed later. One example though is if
{X1 , X2 , . . . , Xn } is a random sample from the population N (µ, σ 2 ), and S 2 is the
sample variance, then:
(n − 1)S 2
∼ χ2n−1 .
σ2
This result is used to derive basic tools of statistical inference for both µ and σ 2 for the
normal distribution.
210
6.8. Some common sampling distributions
0.10
0.6
k=1 k=10
k=2 k=20
0.5
0.08
k=4 k=30
k=6 k=40
0.4
0.06
0.3
0.04
0.2
0.02
0.1
0.0
0.0
0 2 4 6 8 0 10 20 30 40 50
In the examination, you will need a table of some probabilities for the χ2 distribution.
Table 8 of the New Cambridge Statistical Tables shows the following information.
The numbers in the table are values of z such that P (X > z) = P/100 for the k
and P in that row and column, respectively.
Example 6.7 Consider the ‘ν = 5’ row, the 9.236 in the ‘P = 10’ column and the
11.07 in the ‘P = 5’ column. These mean, for X ∼ χ25 , that:
These also provide bounds for probabilities of other values. For example, since 10.00
is between 9.236 and 11.07, we can conclude that:
211
6. Sampling distributions of statistics
Solution
We can compute the probability in two different ways. Working with the standard
normal distribution, we have:
√ √
P (Z 2 < 3.841) = P (− 3.841 < Z < 3.841)
= P (−1.96 < Z < 1.96)
= Φ(1.96) − Φ(−1.96)
= 0.9750 − (1 − 0.9750)
= 0.95.
Alternatively, we can use the fact that Z 2 follows a χ21 distribution. Using Table 8 of
the New Cambridge Statistical Tables we can see that 3.841 is the 5% right-tail value
for this distribution, and so P (Z 2 < 3.84) = 0.95, as before.
Activity 6.11 Suppose that X1 and X2 are independent N (0, 4) random variables.
Compute P (X12 < 36.84 − X22 ).
Solution
Rearrange the inequality to obtain:
Since X1 /2 and X2 /2 are independent N (0, 1) random variables, the sum of their
squares will follow a χ22 distribution. Using Table 8 of the New Cambridge Statistical
Tables, we see that 9.210 is the 1% right-tail value, so the probability we are looking
for is 0.99.
212
6.8. Some common sampling distributions
Solution
(a) P (B < 12) ≈ 0.9, directly from Table 8 of the New Cambridge Statistical Tables,
where B ∼ χ27 .
(c) A chi-squared random variable only assumes non-negative values. Hence each of
A, B and C is non-negative, so A3 + B 3 + C 3 ≥ 0, and:
P (A3 + B 3 + C 3 < 0) = 0.
Suppose Z ∼ N (0, 1), X ∼ χ2k , and Z and X are independent. The distribution of
the random variable:
Z
T =p
X/k
is the t distribution with k degrees of freedom. This is denoted T ∼ tk or
T ∼ t(k). The distribution is also known as ‘Student’s t distribution’.
N(0,1)
k=1
k=3
k=8
0.3
k=20
0.2
0.1
0.0
−2 0 2
213
6. Sampling distributions of statistics
In the examination, you will need a table of some probabilities for the t distribution.
Table 10 of the New Cambridge Statistical Tables shows the following information.
Example 6.8 Consider the ‘ν = 4’ row, and the ‘P = 5’ column. This means,
where T ∼ t4 , that:
The table also provides bounds for other probabilities. For example, the number in
the ‘P = 2.5’ column is 2.776, so P (T > 2.776) = 0.025. Since 2.132 < 2.5 < 2.776,
we know that 0.025 < P (T > 2.5) < 0.05.
Results for left-tail probabilities P (T < z) = P/100, where T ∼ tk , can also be
obtained, because the t distribution is symmetric around 0. This means that
P (T < t) = P (T > −t). Using T ∼ t4 , for example:
and P (T < −2.5) < 0.05 [since P (T > 2.5) < 0.05].
This is the same trick that we used for the standard normal distribution.
214
6.8. Some common sampling distributions
Activity 6.13 The independent random variables X1 , X2 and X3 are each normally
distributed with a mean of 0 and a variance of 4. Find:
Solution
X1 − X2 − X3 ∼ N (0, 12).
So:
P (X1 > X2 + X3 ) = P (X1 − X2 − X3 > 0) = P (Z > 0) = 0.5
using Table 3 of the New Cambridge Statistical Tables.
(b) We have:
1/2 !
X22 X32
X1
P (X1 > 5(X22 + X32 )1/2 ) = P >5 +
2 4 4
1/2 ! !
√ X22 X32 √
X1
=P >5 2 + 2
2 4 4
√ p
i.e. P (Y1 > 5 2Y2 ), where Y1 ∼ N (0, 1) and Y2 ∼ χ22 /2, or P (Y3 > 7.07),
where Y3 ∼ t2 . From Table 10 of the New Cambridge Statistical Tables, this is
approximately 0.01.
Solution
(a) We have Xi ∼ N (0, 4), for i = 1, 2, 3, 4, hence 3X1 ∼ N (0, 36) and
4X2 ∼ N (0, 64). Therefore:
3X1 + 4X2
= Z ∼ N (0, 1).
10
So, P (3X1 + 4X2 > 5) = k = P (Z > 0.5) = 0.3085, using Table 4 of the New
Cambridge Statistical Tables.
215
6. Sampling distributions of statistics
(b) We have Xi /2 ∼ N (0, 1), for i = 1, 2, 3, 4, hence (X32 + X42 )/4 ∼ χ22 . So:
√
q
P X1 > k X32 + X42 = 0.025 = P (T > k 2)
√
where T ∼ t2 and hence k 2 = 4.303, so k = 3.04268, using Table 10 of the
New Cambridge Statistical Tables.
where X ∼ χ23 . Therefore, k/4 = 6.251 using Table 8 of the New Cambridge
Statistical Tables. Hence k = 25.004.
Activity 6.15 Suppose Xi ∼ N (0, 4), for i = 1, 2, 3, 4. Assume all these random
variables are independent. Derive the value of k in each of the following.
Solution
(b) Xi2 /4 ∼ χ21 for i = 1, 2, 3, 4, hence (X12 + X22 + X32 + X42 )/4 ∼ χ24 , so:
2 2 2 2 k
P (X1 + X2 + X3 + X4 < k) = P X < = 0.99
4
Therefore: √
P (T < 2k) = 0.01
√
where T ∼ t2 . Hence 2k = −6.965, so k = −4.925.
216
6.8. Some common sampling distributions
Let U and V be two independent random variables, where U ∼ χ2p and V ∼ χ2k . The
distribution of:
U/p
F =
V /k
is the F distribution with degrees of freedom (p, k), denoted F ∼ Fp, k or
F ∼ F (p, k).
(10,50)
(10,10)
(10,3)
f(x)
0 1 2 3 4
For F ∼ Fp, k , E(F ) = k/(k − 2), for k > 2. If F ∼ Fp, k , then 1/F ∼ Fk, p . If T ∼ tk ,
then T 2 ∼ F1, k .
Tables of F distributions will be needed for some purposes. They will be available in the
examination. We will postpone practice with them until later in the course.
(a) χ23 .
(b) t2 .
(c) F1, 2 .
217
6. Sampling distributions of statistics
Solution
The following are possible, but not exhaustive, solutions.
(a) Z12
(b) Z12 /Z22
p
(c) Z1 / Z22
k
P
(d) Zi /k
i=1
k
Zi2
P
(e)
i=1
Solution
k
Zi2 ∼ χ2k
P
(e)
i=1
218
6.9. Prelude to statistical inference
Solution
(a) We have X1 ∼ N (0, 9) and X2 ∼ N (0, 9). Hence 2X2 ∼ N (0, 36) and
X1 + 2X2 ∼ N (0, 45). So:
9
P (X1 + 2X2 > 9) = P Z > √ = P (Z > 1.34) = 0.0901
45
using Table 3 of the New Cambridge Statistical Tables.
(b) We have X1 /3 ∼ N (0, 1) and X2 /3 ∼ N (0, 1). Hence X12 /9 ∼ χ21 and
X22 /9 ∼ χ21 . Therefore, X12 /9 + X22 /9 ∼ χ22 . So:
(c) We have X12 /9 + X22 /9 ∼ χ22 and also X32 /9 + X42 /9 ∼ χ22 . So:
219
6. Sampling distributions of statistics
These questions are difficult to study in a laboratory, and admit no self-evident axioms.
Statistics provides a way of answering these types of questions using data.
What should we learn in ‘Statistics’ ? The basic ideas, methods and theory. Some
guidelines for learning/applying statistics are the following.
Understand what data say in each specific context. All the methods are just tools
to help us to understand data.
It may take a while to catch the basic idea of statistics – keep thinking!
Example 6.9 A new type of tyre was designed to increase its lifetime. The
manufacturer tested 120 new tyres and obtained the average lifetime (over these 120
tyres) of 35,391 miles. So the manufacturer claims that the mean lifetime of new
tyres is 35,391 miles.
Example 6.10 A newspaper sampled 1,000 potential voters, and 350 of them were
supporters of Party X. It claims that the proportion of Party X voters in the whole
country is 350/1000 = 0.35, i.e. 35%.
In both cases, the conclusion is drawn on a population (i.e. all the objects concerned)
based on the information from a sample (i.e. a subset of the population).
In Example 6.9, it is impossible to measure the whole population. In Example 6.10, it is
not economical to measure the whole population. Therefore, errors are inevitable!
The population is the entire set of objects concerned, and these objects are typically
represented by some numbers. We do not know the entire population in practice.
In Example 6.9, the population consists of the lifetimes of all tyres, including those to
be produced in the future. For the opinion poll in Example 6.10, the population consists
of many ‘1’s and ‘0’s, where each ‘1’ represents a voter for Party X, and each ‘0’
represents a voter for other parties.
220
6.9. Prelude to statistical inference
Because the questions we ask concern the entire population, not just the data we
have. Having a model for the population tells us that the remaining population is
not much different from our data or, in other words, that the data are
representative of the population.
Because the process of drawing a sample from a population is a bit like the process
of generating random variables. A different sample would produce different values.
Therefore, the population from which we draw a random sample is represented as a
probability distribution.
Example 6.11 Continuing with Example 6.9, the population may be assumed to
be N (µ, σ 2 ) with θ = (µ, σ 2 ), where µ is the ‘true’ lifetime.
Let:
P (X = 1) = P (a Party X voter) = π
and:
P (X = 0) = P (a non-Party X voter) = 1 − π
where:
221
6. Sampling distributions of statistics
Example 6.13 For the tyre lifetime in Example 6.9, suppose the realised sample
(of size n = 120) gives the sample mean:
n
1X
x̄ = xi = 35391.
n i=1
Is the sample mean X̄ a good estimator of the unknown ‘true’ lifetime µ? Obviously,
we cannot use the real number 35,391 to assess how good this estimator is, as a different
sample may give a different average value, such as 36,721.
By treating {X1 , . . . , Xn } as random variables, X̄ is also a random variable. If the
distribution of X̄ concentrates closely around (unknown) µ, X̄ is a good estimator of µ.
Definition of a statistic
Any known function of a random sample is called a statistic. Statistics are used for
statistical inference such as estimation and testing.
An observed random sample is often denoted as {x1 , . . . , xn }, indicating that they are n
real numbers. They are seen as a realisation of n IID random variables {X1 , . . . , Xn }.
The connection between a population and a sample is shown in Figure 6.10, where θ is
a parameter. A known function of {X1 , . . . , Xn } is called a statistic.
222
6.9. Prelude to statistical inference
20!
P (X = x) = π x (1 − π)20−x for x = 0, 1, . . . , 20
x! (20 − x)!
and 0 otherwise.
Some probability questions are as follows. Treating π as known:
what is P (X < 10) (the proportion of students attending fewer than half of the
lectures)?
223
6. Sampling distributions of statistics
1. Let X be the amount of money won or lost in betting $5 on red in roulette, such
that:
18 20
P (X = 5) = and P (X = −5) = .
38 38
If a gambler bets on red 100 times, use the central limit theorem to estimate the
probability that these wagers result in less than $50 in losses.
Z12
(c) 5
.
Zi2 /4
P
i=2
224
Chapter 7
Point estimation
find estimators using the method of moments, least squares and maximum
likelihood.
7.3 Introduction
The basic setting is that we assume a random sample {X1 , . . . , Xn } is observed from a
population F (x; θ). The goal is to make inference (i.e. estimation or testing) for the
unknown parameter(s) θ.
225
7. Point estimation
Example 7.1 Let {X1 , . . . , Xn } be a random sample from a population with mean
µ = E(Xi ). Find an estimator of µ.
Since µ is the mean of the population, a natural estimator would be the sample
mean µb = X̄, where:
n
1X 1
X̄ = Xi = (X1 + · · · + Xn ).
n i=1 n
We call µ
b = X̄ a point estimator (or simply an estimator) of µ.
For example, if we have an observed sample of 9, 16, 15, 4 and 12, hence of size
n = 5, the sample mean is:
1
µ
b = (9 + 16 + 15 + 4 + 12) = 11.2.
5
The value 11.2 is a point estimate of µ. For an observed sample of 15, 16, 10, 8
and 9, we obtain µb = 11.6.
Bias of an estimator
unbiased if b −θ =0
E(θ)
226
7.4. Estimation criteria: bias, variance and mean squared error
In words, the expected value of the estimator is the true parameter being estimated, i.e.
on average, under repeated sampling, an unbiased estimator correctly estimates θ.
We view bias as a ‘bad’ thing, so, other things being equal, the smaller an estimator’s
bias the better.
Variance of an estimator
σ2
Var(X̄) = . (7.2)
n
It is clear that in (7.2) increasing the sample size n decreases the estimator’s variance
(and hence the standard error, i.e. the square root of the estimator’s variance), therefore
increasing the precision of the estimator.2 We conclude that variance is also a ‘bad’
thing so, other things being equal, the smaller an estimator’s variance the better.
The mean squared error (MSE) of an estimator is the average squared error.
Formally, this is defined as:
2
MSE(θ)
b =E θb − θ . (7.3)
2
Remember, however, that this increased precision comes at a cost – namely the increased expenditure
on data collection.
227
7. Point estimation
It is possible to decompose the MSE into components involving the bias and variance of
an estimator. Recall that:
Also, note that for any constant k, Var(X ± k) = Var(X), that is adding or subtracting
a constant has no effect on the variance of a random variable. Noting that the true
parameter
θis some (unknown) constant,3 it immediately follows, by setting
X = θb − θ , that:
2
MSE(θ)
b =E θb − θ
2
= Var(θb − θ) + E θb − θ
2
= Var(θ)
b + Bias(θ)
b . (7.4)
E(T1 ) = E(X̄) = µ
and:
σ2
Var(T1 ) = Var(X̄) =
.
n
Hence T1 is an unbiased estimator of µ. So the MSE of T1 is just the variance of T1 ,
since the bias is 0. Therefore, MSE(T1 ) = σ 2 /n.
3
Even though θ is an unknown constant, it is known to be a constant!
4
Or, for that matter, a ‘very bad’ thing!
228
7.4. Estimation criteria: bias, variance and mean squared error
Moving to T2 , note:
X 1 + Xn 1 1
E(T2 ) = E = (E(X1 ) + E(Xn )) = (µ + µ) = µ
2 2 2
and:
1 1 2 σ2
Var(T2 ) = (Var(X 1 ) + Var(X n )) = × (2σ ) = .
22 4 2
So T2 is also an unbiased estimator of µ, hence MSE(T2 ) = σ 2 /2.
Finally, consider T3 , noting:
E(T3 ) = E X̄ + 3 = E(X̄) + 3 = µ + 3
and:
σ2
Var(T3 ) = Var(X̄ + 3) = Var(X̄) = .
n
So T3 is a positively-biased estimator of µ, with a bias of 3. Hence we have
MSE(T3 ) = σ 2 /n + 32 = σ 2 /n + 9.
We seek the estimator with the smallest MSE. Clearly, MSE(T1 ) < MSE(T3 ) so we
can eliminate T3 . Now comparing T1 with T2 , we note that:
i. µ
b = X̄ is a better estimator of µ than X1 as:
σ2
MSE (b
µ) = < MSE(X1 ) = σ 2 .
n
ii. As n → ∞, MSE(X̄) → 0, i.e. when the sample size tends to infinity, the error in
estimation goes to 0. Such an estimator is called a (mean-square) consistent
estimator.
Consistency is a reasonable requirement. It may be used to rule out some silly
estimators.
For µ̃ = (X1 + X4 )/2, MSE(µ̃) = σ 2 /2 which does not converge to 0 as n → ∞.
This is due to the fact that only a small portion of information (i.e. X1 and X4 )
is used in the estimation.
iii. For any random sample {X1 , . . . , Xn } from a population with mean µ and variance
σ 2 , it holds that:
σ2
E(X̄) = µ and Var(X̄) = .
n
229
7. Point estimation
The derivation of the expected value and variance of the sample mean was covered
in Chapter 6.
Example 7.5 Bias by itself cannot be used to measure the quality of an estimator.
Consider two artificial estimators of θ, θb1 and θb2 , such that θb1 takes only the two
values, θ − 100 and θ + 100, and θb2 takes only the two values θ and θ + 0.2, with the
following probabilities:
P θ1 = θ − 100 = P θ1 = θ + 100 = 0.5
b b
and:
P θ2 = θ = P θ2 = θ + 0.2 = 0.5.
b b
and:
MSE(θb2 ) = E[(θb2 − θ)2 ] = 02 × 0.5 + (0.2)2 × 0.5 = 0.02.
Hence θb2 is a much better (i.e. more accurate) estimator of θ than θb1 .
Solution
We have:
X1 X2 1 1 1 1
E(X) = E + = × E(X1 ) + × E(X2 ) = × µ + × µ = µ
2 2 2 2 2 2
and:
X1 2X2 1 2 1 2
E(Y ) = E + = × E(X1 ) + × E(X2 ) = × µ + × µ = µ.
3 3 3 3 3 3
230
7.4. Estimation criteria: bias, variance and mean squared error
Solution
Let us consider the bias first. The estimator θb1 is just the sample mean, so we know
that it is unbiased. The estimator θb2 has expectation:
X1 + X2 E(X1 ) + E(X2 ) θ+θ
E(θ2 ) = E
b = = =θ
2 2 2
σ2
Var(θb1 ) = Var(X̄) =
n
and:
σ2 + σ2 σ2
X 1 + X2 Var(X1 ) + Var(X2 )
Var(θb2 ) = Var = = = .
2 4 4 2
Since n > 2, we can see that θb1 has a lower variance than θb2 , so it is a better
estimator. Unsurprisingly, we obtain a better estimator of θ by considering the whole
sample, rather than just the first two values.
Activity 7.3 Find the MSEs of the estimators in the previous activity. Are they
consistent estimators of θ?
Solution
The MSEs are:
2 σ 2 σ2
MSE(θb1 ) = Var(θb1 ) + Bias(θb1 ) = +0=
n n
and: 2 σ 2
σ2
MSE(θb2 ) = Var(θb2 ) + Bias(θb2 ) = +0= .
2 2
Note that the MSE of an unbiased estimator is equal to its variance.
The estimator θb1 has MSE equal to σ 2 /n, which converges to 0 as n → ∞. The
estimator θb2 has MSE equal to σ 2 /2, which stays constant as n → ∞. Therefore, θb1
is a (mean-square) consistent estimator of θ, whereas θb2 is not.
Activity 7.4 Let X1 and X2 be two independent random variables with the same
mean, µ, and the same variance, σ 2 < ∞. Let µ
b = aX1 + bX2 be an estimator of µ,
where a and b are two non-zero constants.
(b) Find the minimum mean squared error (MSE) among all unbiased estimators of
µ.
231
7. Point estimation
Solution
Solution
Therefore, π
b is an unbiased estimator of π. Furthermore, by independence:
n
! n
1X 1 X π (1 − π)
MSE(b π ) = Var(b
π ) = Var Xi = 2 Var(Xi ) =
n i=1 n i=1 n
232
7.5. Method of moments (MM) estimation
(b) Y may only take the integer values 0, 1, . . . , n. For 0 ≤ y ≤ n, the event Y = y
occurs if and only if there are exactly y 1s and (n − y) 0s among the values of
X1 , . . . , Xn . However, those y 1s may take any y out of the n positions. Hence:
n y n!
P (Y = y) = π (1 − π)n−y = π y (1 − π)n−y .
y y! (n − y)!
(c) Note πb = Y /n. Hence π b has a rescaled binomial distribution on the n + 1 points
{0, 1/n, 2/n, . . . , 1}.
Finding estimators
Let {X1 , . . . , Xn } be a random sample from a population F (x; θ). Suppose θ has p
components (for example, for a normal population N (µ, σ 2 ), p = 2; for a Poisson
population with parameter λ, p = 1).
Let:
µk = µk (θ) = E(X k )
denote the kth population moment, for k = 1, 2, . . .. Therefore, µk depends on the
unknown parameter θ, as everything else about the distribution F (x; θ) is known.
Denote the kth sample moment by:
n
1X k 1
Mk = Xi = (X1k + · · · + Xnk ).
n i=1 n
µk (θ)
b = Mk for k = 1, . . . , p.
233
7. Point estimation
Example 7.6 Let {X1 , . . . , Xn } be a random sample from a population with mean
µ and variance σ 2 < ∞. Find the MM estimator of (µ, σ 2 ).
There are two unknown parameters. Let:
n
1X 2
µ
b=µ
b1 = M1 and µ
b2 = M2 = X .
n i=1 i
This gives us µ
b = M1 = X̄.
Since σ 2 = µ2 − µ21 = E(X 2 ) − (E(X))2 , we have:
n n
2 1X 2 1X
b = M2 −
σ M12 = Xi − X̄ 2 = (Xi − X̄)2 .
n i=1 n i=1
Note we have:
n
!
1X 2
σ2) = E
E(b X − X̄ 2
n i=1 i
n
1X
= E(Xi2 ) − E(X̄ 2 )
n i=1
= E(X 2 ) − E(X̄ 2 )
2
2 2 σ 2
=σ +µ − +µ
n
(n − 1)σ 2
= .
n
Since:
σ2
σ2) − σ2 = −
E(b <0
n
b2 is a negatively-biased estimator of σ 2 .
σ
The sample variance, defined as:
n
1 X
S2 = (Xi − X̄)2
n − 1 i=1
Note the MME does not use any information on F (x; θ) beyond the moments.
234
7.5. Method of moments (MM) estimation
The idea is that Mk should be pretty close to µk when n is sufficiently large. In fact:
n
1X k
Mk = X
n i=1 i
converges to:
µk = E(X k )
as n → ∞. This is due to the law of large numbers (LLN). We illustrate this
phenomenon by simulation using R.
Example 7.7 For N (2, 4), we have µ1 = 2 and µ2 = 8. We use the sample
moments M1 and M2 as estimators of µ1 and µ2 , respectively. Note how the sample
moments converge to the population moments as the sample size increases.
For a sample of size n = 10, we obtained m1 = 0.5145838 and m2 = 2.171881.
235
7. Point estimation
Activity 7.6 Let {X1 , . . . , Xn } be a random sample from the (continuous) uniform
distribution such that X ∼ Uniform[0, θ], where θ > 0. Find the method of moments
estimator (MME) of θ.
Solution
The pdf of Xi is: (
θ−1 for 0 ≤ xi ≤ θ
f (xi ; θ) =
0 otherwise.
Therefore: θ
θ
1 x2i
Z
1 θ
E(Xi ) = xi dxi = = .
θ 0 θ 2 0 2
Therefore, setting µ
b1 = M1 , we have:
θb
= X̄ ⇒ θb = 2X̄.
2
Solution
The mean of the Uniform[a, b] distribution is (a + b)/2. In our case, this gives
E(X) = (−θ + θ)/2 = 0. The first population moment does not depend on θ, so we
need to move to the next (i.e. second) population moment.
Recall that the variance of the Uniform[a, b] distribution is (b − a)2 /12. Hence the
second population moment is:
(θ − (−θ))2 θ2
E(X 2 ) = Var(X) + E(X)2 = + 02 = .
12 3
We set this equal to the second sample moment to obtain:
n
1 X 2 θb2
X = .
n i=1 i 3
236
7.5. Method of moments (MM) estimation
Activity 7.8 Consider again the Uniform[−θ, θ] distribution from the previous
question. Suppose that we observe the following data:
Solution
The point estimate is: v
u 8
u3 X
θbM M =t x2 ≈ 2.518
8 i=1 i
which implies that the data came from a Uniform[−2.518, 2.518] distribution.
However, this clearly cannot be true since the observation x5 = 2.8 falls outside this
range!
The method of moments does not take into account that all of the observations need
to lie in the interval [−θ, θ], and so it fails to produce a useful estimate.
Activity 7.9 Let X ∼ Bin(n, π), where n is known. Find the methods of moments
estimator (MME) of π.
Solution
The pf of the binomial distribution is:
n!
P (X = x) = π x (1 − π)n−x for x = 0, 1, . . . , n
x! (n − x)!
237
7. Point estimation
The estimator X̄ is also the least squares estimator (LSE) of µ, defined as:
n
X
µ
b = X̄ = min (Xi − a)2 .
a
i=1
n n
(Xi − a)2 = (Xi − X̄)2 + n(X̄ − a)2 , where all terms are
P P
Proof : Given that S =
i=1 i=1
non-negative, then the value of a for which S is minimised is when n(X̄ − a)2 = 0, i.e.
a = X̄.
Activity 7.10 Suppose that you are given observations y1 , y2 , y3 and y4 such that:
y1 = α + β + ε1
y2 = −α + β + ε2
y3 = α − β + ε3
y4 = −α − β + ε4 .
The random variables εi , for i = 1, 2, 3, 4, are independent and normally distributed
with mean 0 and variance σ 2 .
Solution
238
7.6. Least squares (LS) estimation
and:
∂S
= −2(y1 − α − β) − 2(y2 + α − β) + 2(y3 − α + β) + 2(y4 + α + β)
∂β
= −2(y1 + y2 − y3 − y4 ) + 8β.
(b) α
b is an unbiased estimator of α since:
y1 − y2 + y3 − y4 α+β+α−β+α−β+α+β
E(bα) = E = = α.
4 4
4 σ2 σ2
y1 − y2 + y3 − y4
Var(b
α) = Var = = .
4 16 4
Estimator accuracy
σ2
µ − µ)2 = .
MSE(b
µ) = E (b
n
for any z, where Φ(z) is the cdf of N (0, 1), i.e. when n is large, X̄ ∼ N (µ, σ 2 /n)
approximately.
Hence when n is large:
σ
P |X̄ − µ| ≤ 1.96 × √ ≈ 0.95.
n
239
7. Point estimation
Activity 7.11 AP random sample of size n = 400 produced the sample sums of
2
P
i xi = 983 and i xi = 4729.
(a) Calculate point estimates for the population mean and the population standard
deviation.
Solution
(a) As before, we use the sample mean to estimate the population mean, i.e.
µ
b = x̄ = 983/400 = 2.4575, and the sample variance to estimate the population
variance, i.e. we have:
400 400
!
1 X 1 X
s2 = (xi − x̄)2 = x2 − nx̄2
n − 1 i=1 n − 1 i=1 i
1
4729 − 400 × (2.4575)2
=
399
= 5.7977.
240
7.7. Maximum likelihood (ML) estimation
Therefore,
√ the estimate for the population standard deviation is
s = 5.7977 = 2.4078.
√ √
(b) The estimated standard error is s/ n = 2.4078/ 400 = 0.1204.
Note that the estimated standard error is rather small, indicating that the estimate
of the population mean is rather accurate. This is due to two factors: (i.) the
population variance is small, as evident from the small value of s2 , and (ii.) the
sample size of n = 400 is rather large.
Note also that using the n divisor (i.e. the method of moments estimator of σ 2 ) we
n
have (xi − x̄)2 /n = 5.7832, which is pretty close to s2 .
P
i=1
Example 7.9 Suppose we toss a coin 10 times, and record the number of ‘heads’ as
a random variable X. Therefore:
X ∼ Bin(10, π)
Nevertheless, π = 0.8 is the most likely, or ‘maximally’ likely value of the parameter.
Why do we think ‘π = 0.8’ is most likely?
Let:
10! 8
L(π) = P (X = 8) = π (1 − π)2 .
8! 2!
Since x = 8 is the event which occurred in the experiment, this probability would be
very large. Figure 7.1 shows a plot of L(π) as a function of π.
The most likely value of π should make this probability as large as possible. This
value is taken as the maximum likelihood estimate of π.
Maximising L(π) is equivalent to maximising:
241
7. Point estimation
Let f (x1 , . . . , xn ; θ) be the joint probability density function (or probability function)
for random variables (X1 , . . . , Xn ). The maximum likelihood estimator (MLE) of θ
based on the observations {X1 , . . . , Xn } is defined as:
θb = θ(X
b 1 , . . . , Xn ).
242
7.7. Maximum likelihood (ML) estimation
iii. It is often more convenient to use the log-likelihood function5 denoted as:
n
X
l(θ) = log L(θ) = log (f (Xi ; θ))
i=1
θb = max l(θ).
θ
iv. For a smooth likelihood function, the MLE is often the solution of the equation:
d
l(θ) = 0.
dθ
Example 7.10 Let {X1 , . . . , Xn } be a random sample from a distribution with pdf:
(
λ2 x e−λx for x > 0
f (x; λ) =
0 otherwise
243
7. Point estimation
n
Q
The log-likelihood function is l(λ) = 2n log λ − nλX̄ + c, where c = log Xi is a
i=1
constant.
Note the MLE λ b may be obtained from maximising L(λ) directly. However, it is
much easier to work with l(λ) instead.
By the invariance principle, the MLE of λ2 would be λ
b2 = 4/X̄ 2 .
Let us assume {X1 , . . . , Xn } is the sample (i.e. n observed values). Among them,
there are n1 ‘1’s, n2 ‘2’s, and n − n1 − n2 ‘3’s. The likelihood function is (where ∝
means ‘proportional to’):
n
Y
L(θ) = p(Xi ; θ) = p(1; θ)n1 p(2; θ)n2 p(3; θ)n−n1 −n2
i=1
(1 − θ)
b (2n1 + n2 ) = θb (2n − 2n1 − n2 )
244
7.7. Maximum likelihood (ML) estimation
where:
X(n) = max Xi .
i
(b) For the given data, the maximum observation is x(3) = 1.2. Therefore, the
maximum likelihood estimate is θb = 1.2. The likelihood function looks like:
245
7. Point estimation
Solution
The probability function is:
e−λ λx
P (X = x) =.
x!
The likelihood and log-likelihood functions are, respectively:
n
Y e−λ λXi e−nλ λnX̄
L(λ) = = Qn
Xi !
i=1 Xi !
i=1
and:
l(λ) = log L(λ) = nX̄ log(λ) − nλ + C = n(X̄ log(λ) − λ) + C
where C is a constant (i.e. it may depend on Xi but cannot depend on the
parameter). Setting:
dl(λ) X̄
=n −1 =0
dλ λ
b
we obtain the MLE λ
b = X̄, which is also the MME.
Solution
The likelihood function is:
n
Y n
Y P
L(λ) = f (xi ; θ) = λ e−λXi = λn e−λ i Xi
= λn e−λnX̄
i=1 i=1
d2 n
l(λ) = −
dλ2 λ2
246
7.7. Maximum likelihood (ML) estimation
Activity 7.14 Use the observed random sample x1 = 8.2, x2 = 10.6, x3 = 9.1 and
x4 = 4.9 to calculate the maximum likelihood estimate of λ in the exponential pdf:
(
λ e−λx for x ≥ 0
f (x; λ) =
0 otherwise.
Solution
We derive a general formula with a random sample {X1 , . . . , Xn } first. The joint pdf
is: (
λn e−λnx̄ for x1 , . . . , xn ≥ 0
f (x1 , . . . , xn ; λ) =
0 otherwise.
Setting:
d n b= 1.
l(λ) = − nX̄ = 0 ⇒ λ
dλ λ
b X̄
For the given sample, x̄ = (8.2 + 10.6 + 9.1 + 4.9)/4 = 8.2. Therefore, λ
b = 0.1220.
Activity 7.15 Let {X1 , . . . , Xn } be a random sample from a population with the
probability distribution specified in (a) and (b) below, respectively. Find the MLEs
of the following parameters.
(b) π and θ = π/(1 − π), when the population has a Bernoulli (two-point)
distribution, that is p(1; π) = π = 1 − p(0; π), and 0 otherwise.
Solution
n
P
Noting that Xi = nX̄, the likelihood function is:
i=1
L(λ) = λn e−λnX̄ .
247
7. Point estimation
Setting:
d n
l(λ) = − nX̄ = 0
dλ λ
b
we obtain the MLE λ
b = 1/X̄. The MLE of µ is:
1
µ
b = µ(λ)
b = = X̄
λ
b
n
P
where y = xi . The likelihood function is:
i=1
L(π) = π Y (1 − π)n−Y .
Setting:
d Y n−Y
l(π) = − =0
dπ π
b 1−π
b
we obtain the MLE π
b = Y /n = X̄. The MLE of θ is:
π
b X̄
θb = θ(b
π) = =
1−πb 1 − X̄
making use of the invariance principle again.
Solution
The joint pdf of the observations is:
n " n
#
Y 1 1 2 1 1X 2
f (x1 , . . . , xn ; µ) = √ exp − (xi − µ) = exp − (xi − µ) .
i=1
2π 2 (2π)n/2 2 i=1
We write the above as a function of µ only:
" n
#
1X
L(µ) = C exp − (Xi − µ)2
2 i=1
248
7.8. Overview of chapter
where C > 0 is a constant. The MLE µ b maximises this function, and also maximises
the function: n
1X
l(µ) = log L(µ) = − (Xi − µ)2 + log(C).
2 i=1
n
(Xi − µ)2 , i.e. the MLE is also the least
P
Therefore, the MLE effectively minimises
i=1
squares estimator (LSE), i.e. µ
b = X̄.
(a) Find the method of moments estimator (MME) of θ. (Note you should derive
any required population moments.)
(b) If n = 3, with the observed data x1 = 0.2, x2 = 3.6 and x3 = 1.1, use the MME
obtained in i. to compute the point estimate of θ for this sample. Do you trust
this estimate? Justify your answer.
Hint: You may wish to make reference to the law of large numbers.
249
7. Point estimation
2. Suppose that you are given independent observations y1 , y2 and y3 such that:
y1 = α + β + ε1
y2 = α + 2β + ε2
y3 = α + 4β + ε3 .
250
Chapter 8
Interval estimation
explain the link between confidence intervals and distribution theory, and critique
the assumptions made to justify the use of various confidence intervals.
8.3 Introduction
Point estimation is simple but not informative enough, since a point estimator is
always subject to errors. A more scientific approach is to find an upper bound
U = U (X1 , . . . , Xn ) and a lower bound L = L(X1 , . . . , Xn ), and hope that the unknown
parameter θ lies between the two bounds L and U (life is not always as simple as that,
but it is a good start).
An intuitive guess for estimating the population mean would be:
where k > 0 is a constant and S.E.(X̄) is the standard error of the sample mean.
The (random) interval (L, U ) forms an interval estimator of θ. For estimation to be
as precise as possible, intuitively the width of the interval, U − L, should be small.
251
8. Interval estimation
Activity 8.1 Why do we not always choose a very high confidence level for a
confidence interval?
Solution
We do not always want to use a very high confidence level because the confidence
interval would be very wide. We have a trade-off between the width of the confidence
interval and the coverage probability.
252
8.4. Interval estimation for means of normal distributions
What is P (1.27 < µ < 3.23) = 0.95 in Example 8.1? Well, this probability does not
mean anything, since µ is an unknown constant!
We treat (1.27, 3.23) as one realisation of the random interval (X̄ − 0.98, X̄ + 0.98)
which covers µ with probability 0.95.
What is the meaning of ‘with probability 0.95’ ? If one repeats the interval estimation a
large number of times, about 95% of the time the interval estimator covers the true µ.
Some remarks are the following.
i. The confidence level is often specified as 90%, 95% or 99%. Obviously the higher
the confidence level, the wider the interval.
For the normal distribution example:
|X̄ − µ|
0.90 = P √ ≤ 1.645
σ/ n
σ σ
= P X̄ − 1.645 × √ < µ < X̄ + 1.645 × √
n n
|X̄ − µ|
0.95 = P √ ≤ 1.96
σ/ n
σ σ
= P X̄ − 1.96 × √ < µ < X̄ + 1.96 × √
n n
|X̄ − µ|
0.99 = P √ ≤ 2.576
σ/ n
σ σ
= P X̄ − 2.576 × √ < µ < X̄ + 2.576 × √ .
n n
√ √
√ three intervals are 2 × 1.645 × σ/ n, 2 × 1.96 × σ/ n and
The widths of the
2 × 2.576 × σ/ n, corresponding to the confidence levels of 90%, 95% and 99%,
respectively.
To achieve a 100% confidence level in the normal example, the width of the interval
would have to be infinite!
ii. Among all the confidence intervals at the same confidence level, the one with the
smallest width gives the most accurate estimation and is, therefore, optimal.
iii. For a distribution with a symmetric unimodal density function, optimal confidence
intervals are symmetric, as depicted in Figure 8.1.
253
8. Interval estimation
Figure 8.1: Symmetric unimodal density function showing that a given probability is
represented by the narrowest interval when symmetric about the mean.
Activity 8.2
(a) Find the length of a 95% confidence interval for the mean of a normal
distribution with known variance σ 2 .
(b) Find the minimum sample size such that the width of a 95% confidence interval
is not wider than d, where d > 0 is a prescribed constant.
Solution
(a) With an available random sample {X1 , . . . , Xn } from the normal distribution
N (µ, σ 2 ) with σ 2 known, a 95% confidence interval for µ is of the form:
σ σ
X̄ − 1.96 × √ , X̄ + 1.96 × √ .
n n
Activity 8.3 Assume that the random variable X is normally distributed and that
σ 2 is known. What confidence level would be associated with each of the following
intervals?
254
8.4. Interval estimation for means of normal distributions
(a) P (−1.645 < Z < 2.326) = 0.94, hence a 94% confidence level.
(b) P (−∞ < Z < 2.576) = 0.995, hence a 99.5% confidence level.
Activity 8.4 A personnel manager has found that historically the scores on
aptitude tests given to applicants for entry-level positions are normally distributed
with σ = 32.4 points. A random sample of nine test scores from the current group of
applicants had a mean score of 187.9 points.
(a) Find an 80% confidence interval for the population mean score of the current
group of applicants.
(b) Based on these sample results, a statistician found for the population mean a
confidence interval extending from 165.8 to 210.0 points. Find the confidence
level of this interval.
Solution
(a) We have n = 9, x̄ = 187.9, σ = 32.4 and 1 − α = 0.8, hence α/2 = 0.1 and, from
Table 4 of the New Cambridge Statistical Tables, P (Z > 1.282) = 1 − Φ(1.282)
= 0.1. So an 80% confidence interval is:
32.4
187.9 ± 1.282 × √ ⇒ (174.05, 201.75).
9
(b) The half-width of the confidence interval is 210.0 − 187.9 = 22.1, which is equal
to the margin of error, i.e. we have:
σ 32.4
22.1 = k × √ = k × √ ⇒ k = 2.05.
n 9
P (Z > 2.05) = 1 − Φ(2.05) = 0.02018 = α/2 ⇒ α = 0.04036. Hence we have a
100(1 − α)% = 100(1 − 0.04036)% ≈ 96% confidence interval.
255
8. Interval estimation
Activity 8.5 Five independent samples, each of size n, are to be drawn from a
normal distribution where σ 2 is known. For each sample, the interval:
σ σ
x̄ − 0.96 × √ , x̄ + 1.06 × √
n n
will be constructed. What is the probability that at least four of the intervals will
contain the unknown µ?
Solution
The probability that the given interval will contain µ is:
The probability of four or five such intervals is binomial with n = 5 and π = 0.6869,
so let the number of such intervals be Y ∼ Bin(5, 0.6869). The required probability
is:
5 4 5
P (Y ≥ 4) = (0.6869) (0.3131) + (0.6869)5 = 0.5014.
4 5
In practice the standard deviation σ is typically unknown, and we replace it with the
sample standard deviation:
n
!1/2
1 X
S= (Xi − X̄)2
n − 1 i=1
where k is a constant determined by the confidence level and also by the distribution of
the statistic:
X̄ − µ
√ . (8.1)
S/ n
However, the distribution of (8.1) is no longer normal – it is the Student’s t distribution.
where E.S.E.(X̄) denotes the estimated standard error of the sample mean.
256
8.4. Interval estimation for means of normal distributions
An accurate 100(1 − α)% confidence interval for µ, where α ∈ (0, 1), is:
S S
X̄ − c × √ , X̄ + c × √ = (X̄ − c × E.S.E.(X̄), X̄ + c × E.S.E.(X̄))
n n
where c > 0 is a constant such that P (T > c) = α/2, where T ∼ tn−1 .
Activity 8.6 Suppose that 9 bags of sugar are selected from the supermarket shelf
at random and weighed. The weights in grammes are 812.0, 786.7, 794.1, 791.6,
811.1, 797.4, 797.8, 800.8 and 793.2. Construct a 95% confidence interval for the
mean weight of all the bags on the shelf. Assume the population is normal.
Solution
Here we have a random sample of size n = 9. The mean is 798.30. The sample
variance is s2 = 72.76, which gives a sample standard deviation s = 8.53. From Table
10 of the New Cambridge Statistical Tables, the top 2.5th percentile of the t
distribution with n − 1 = 8 degrees of freedom is 2.306. Therefore, a 95% confidence
interval is:
8.53 8.53
798.30 − 2.306 × √ , 798.30 + 2.306 × √ = (798.30 − 6.56, 798.30 + 6.56)
9 9
= (791.74, 804.86).
Activity 8.7 Continuing the previous activity, suppose we are now told that σ, the
population standard deviation, is known to be 8.5 g. Construct a 95% confidence
interval using this information.
Solution
From Table 10 of the New Cambridge Statistical Tables, the top 2.5th percentile of
the standard normal distribution z0.025 = 1.96 (recall t∞ = N (0, 1)) so a 95%
confidence interval for the population mean is:
8.5 8.5
798.30 − 1.96 × √ , 798.30 + 1.96 × √ = (798.30 − 6.53, 798.30 + 6.53)
9 9
= (792.75, 803.85).
Again, it may be more useful to write this as 798.30 ± 5.55. Note that this
confidence interval is less wide than the one in the previous question, even though
our initial estimate s turned out to be very close to the true value of σ.
257
8. Interval estimation
Activity 8.8 A business requires an inexpensive check on the value of stock in its
warehouse. In order to do this, a random sample of 50 items is taken and valued.
The average value of these is computed to be £320.41 with a (sample) standard
deviation of £40.60. It is known that there are 9,875 items in the total stock.
Assume a normal distribution.
(a) Estimate the total value of the stock to the nearest £10,000.
(b) Construct a 95% confidence interval for the mean value of all items and hence
construct a 95% confidence interval for the total value of the stock.
(c) You are told that the confidence interval in (b) is too wide for decision-making
purposes and you are asked to assess how many more items would need to be
sampled to obtain a confidence interval with the same level of confidence, but
with half the width.
Solution
(a) The total value of the stock is 9875µ, where µ is the mean value of an item of
stock. From Chapter 7, X̄ is the obvious estimator of µ, so 9875X̄ is the obvious
estimator of 9875µ. Therefore, an estimate for the total value of the stock is
9875 × 320.41 = £3,160,000 (to the nearest £10,000).
(£3,050,000, £3,280,000).
Activity 8.9 In a survey of students, the number of hours per week of private
study is recorded. For a random sample of 23 students, the sample mean is 18.4
hours and the sample standard deviation is 3.9 hours. Treat the data as a random
sample from a normal distribution.
258
8.4. Interval estimation for means of normal distributions
(a) Find a 99% confidence interval for the mean number of hours per week of
private study in the student population.
(b) Recompute your confidence interval in the case that the sample size is, in fact,
121, but the sample mean and sample standard deviation values are unchanged.
Comment on the two intervals.
Solution
We have x̄ = 18.4 and s = 3.9, so a 99% confidence interval is of the form:
s
x̄ ± tn−1, 0.005 × √ .
n
(a) When n = 23, t22, 0.005 = 2.819. Hence a 99% confidence interval is:
3.9
18.4 ± 2.819 × √ ⇒ (16.11, 20.69).
23
(b) When n = 121, t120, 0.005 = 2.617. Hence a 99% confidence interval is:
3.9
18.4 ± 2.617 × √ ⇒ (17.47, 19.33).
121
In spite of the same sample mean and sample standard deviation, the sample of
size n = 121 offers a much more accurate estimate as the interval width is
merely 19.33 − 17.47 = 1.86 hours, in contrast to the interval width of
20.69 − 16.11 = 4.58 hours with the sample size of n = 23.
Note that to derive a confidence interval for µ with σ 2 unknown, the formula used in
the calculation involves both n and n − 1. We then refer to the Student’s t
distribution with n − 1 degrees of freedom.
Also, note that t120, α ≈ zα , where P (Z > zα ) = α for Z ∼ N (0, 1). Therefore, it
would be acceptable to use z0.005 = 2.576 as an approximation for t120, 0.005 = 2.617.
2. Approximate σ by S.
259
8. Interval estimation
Example 8.2 The salary data of 253 graduates from a UK business school (in
thousands
√ of pounds) yield the following: n = 253, x̄ = 47.126, s = 6.843 and so
s/ n = 0.43.
A point estimate of the average salary µ is x̄ = 47.126.
An approximate 95% confidence interval for µ is:
Activity 8.10 Suppose a random survey of 400 first-time home buyers finds that
the sample mean of annual household income is £36,000 and the sample standard
deviation is £17,000.
(a) An economist believes that the ‘true’ standard deviation is σ = £12,000. Based
on this assumption, find an approximate 90% confidence interval for µ, i.e. for
the average annual household income of all first-time home buyers.
(b) Without the assumption that σ is known, find an approximate 90% confidence
interval for µ.
(c) Are the two confidence intervals very different? Which one would you trust
more, and why?
Solution
(a) Based on the central limit theorem for the sample mean, an approximate 90%
confidence interval is:
σ 12000
x̄ ± z0.05 × √ = 36000 ± 1.645 × √
n 400
= 36000 ± 987
⇒ (£35,013, £36,987).
260
8.4. Interval estimation for means of normal distributions
Now, according to the survey results (only), we may conclude at the 90%
confidence level that the average of all first-time home buyers’ incomes is
between £34,602 and £37,398.
(c) The interval estimates are different. The first one gives a smaller range by £822.
This was due to the fact that the economist’s assumed σ of £12,000 is much
smaller than the sample standard deviation, s, of £17,000. With a sample size
as large as 400, we would think that we should trust the data more than an
assumption by an economist!
The key question is whether σ being £12,000 is a reasonable assumption. This
issue will be properly addressed using statistical hypothesis testing.
Activity 8.11 In a study of consumers’ views on guarantees for new products, 370
out of a random sample of 425 consumers agreed with the statement: ‘Product
guarantees are worded more for lawyers to understand than to be easily understood
by consumers.’
(a) Find an approximate 95% confidence interval for the population proportion of
consumers agreeing with this statement.
(b) Would a 99% condidence interval for the population proportion be wider or
narrower than that found in (a)? Explain your answer.
Solution
The population is a Bernoulli distribution on two points: 1 (agree) and 0 (disagree).
We have a random sample of size n = 425, i.e. {X1 , . . . , X425 }. Let π = P (Xi = 1),
hence E(Xi ) = π and Var(Xi ) = π (1 − π) for i = 1, . . . , 425. The sample mean and
variance are:
425
1 X 370
x̄ = xi = = 0.8706
425 i=1 425
and:
425
!
1 X 1
s2 = x2i − 425x̄2 370 − 425 × (0.8706)2 = 0.1129.
=
424 i=1
424
(a) Based on the central limit theorem for the sample mean, an approximate 95%
confidence interval for π is:
r
s 0.1129
x̄ ± z0.025 × √ = 0.8706 ± 1.96 ×
n 425
= 0.8706 ± 0.0319
⇒ (0.8387, 0.9025).
(b) For a 99% confidence interval, we use z0.005 = 2.576 instead of z0.025 = 1.96 in
the above formula. Therefore, the confidence interval becomes wider.
261
8. Interval estimation
Note that the width of a confidence interval is a random variable, i.e. it varies from
sample to sample. The comparison in (b) above is with the understanding that the
same random sample is used to construct the two confidence intervals.
Be sure to pay close attention to how we interpret confidence intervals in the context
of particular practical problems.
Activity 8.12
(a) A sample of 954 adults in early 1987 found that 23% of them held shares. Given
a UK adult population of 41 million and assuming a proper random sample was
taken, construct a 95% confidence interval estimate for the number of
shareholders in the UK.
(b) A ‘similar’ survey the previous year had found a total of 7 million shareholders.
Assuming ‘similar’ means the same sample size, construct a 95% confidence
interval estimate of the increase in shareholders between the two years.
Solution
Therefore, we estimate there are about 9.4 million shareholders in the UK, with
a margin of error of 1.1 million.
(b) Let us start by finding a 95% confidence interval for the difference in the two
proportions. We use the formula:
s
b1 (1 − π
π b1 ) πb2 (1 − π
b2 )
b1 − π
π b2 ± 1.96 × + .
n1 n2
The estimates of the proportions π1 and π2 are 0.23 and 0.171, respectively. We
know n1 = 954 and although n2 is unknown we can assume it is approximately
equal to 954 (noting the ‘similar’ in the question), so an approximate 95%
confidence interval is:
r
0.23 × 0.77 0.171 × 0.829
0.23−0.171±1.96× + = 0.059±0.036 ⇒ (0.023, 0.094).
954 954
262
8.5. Use of the chi-squared distribution
We estimate that the number of shareholders has increased by about 2.4 million
in the two years. There is quite a large margin of error, i.e. 1.5 million, especially
when compared with a point estimate (i.e. interval midpoint) of 2.4 million.
Yi − µ
∼ N (0, 1).
σ
Hence: n
1 X
(Yi − µ)2 ∼ χ2n .
σ 2 i=1
Note that: n n
1 X 2 1 X 2 n(Ȳ − µ)2
(Y i − µ) = (Y i − Ȳ ) + . (8.2)
σ 2 i=1 σ 2 i=1 σ2
Proof : We have:
n
X n
X
2
(Yi − µ) = ((Yi − Ȳ ) + (Ȳ − µ))2
i=1 i=1
n
X n
X n
X
2 2
= (Yi − Ȳ ) + (Ȳ − µ) + 2 (Yi − Ȳ )(Ȳ − µ)
i=1 i=1 i=1
n
X n
X
2 2
= (Yi − Ȳ ) + n(Ȳ − µ) + 2(Ȳ − µ) (Yi − Ȳ )
i=1 i=1
n
X
= (Yi − Ȳ )2 + n(Ȳ − µ)2 .
i=1
Hence: n n
1 X 2 1 X 2 n(Ȳ − µ)2
(Y i − µ) = (Y i − Ȳ ) + .
σ 2 i=1 σ 2 i=1 σ2
Since Ȳ ∼ N (µ, σ 2 /n), then n(Ȳ − µ)2 /σ 2 ∼ χ21 . It can be proved that:
n
1 X
(Yi − Ȳ )2 ∼ χ2n−1 .
σ 2 i=1
263
8. Interval estimation
For any given small α ∈ (0, 1), we can find 0 < k1 < k2 such that:
α
P (X < k1 ) = P (X > k2 ) =
2
where X ∼ χ2n−1 . Therefore:
M M 2 M
1 − α = P k1 < 2 < k2 = P <σ < .
σ k2 k1
Hence a 100(1 − α)% confidence interval for σ 2 is:
M M
, .
k2 k1
Example 8.3 Suppose n = 15 and the sample variance is s2 = 24.5. Let α = 0.05.
From Table 8 of the New Cambridge Statistical Tables, we find:
where X ∼ χ214 .
Hence a 95% confidence interval for σ 2 is:
14 × S 2 14 × S 2
M M
, = ,
26.119 5.629 26.119 5.629
= (0.536 × S 2 , 2.487 × S 2 )
= (13.132, 60.934).
Solution
For a 99% confidence interval, we need the lower and upper half percentile values
from the χ2n−1 = χ215 distribution. These are χ20.995, 15 = 4.601 and χ20.005, 15 = 32.801,
264
8.6. Interval estimation for variances of normal distributions
Note that this is a very wide confidence interval due to (i.) a high level of confidence
(99%), and (ii.) a small sample size (n = 16).
(b) Would a 99% confidence interval for this variance be wider or narrower than
that found in (a)?
Solution
(a) We have n = 10, s2 = (2.36)2 = 5.5696, χ20.975, 9 = 2.700 and χ20.025, 9 = 19.023.
Hence a 95% confidence interval for σ 2 is:
(n − 1)s2 (n − 1)s2
9 × 5.5696 9 × 5.5696
, = , = (2.64, 18.57).
χ20.025, n−1 χ20.975, n−1 19.023 2.700
χ20.995, n−1 < χ20.975, n−1 and χ20.005, n−1 > χ20.025, n−1 .
Activity 8.15 Construct a 90% confidence interval for the variance of the bags of
sugar in Activity 8.6. Does the given value of 8.5 g for the population standard
deviation seem plausible?
Solution
We have n = 9 and s2 = 72.76. For a 90% confidence interval, we need the bottom
and top 5th percentiles of the chi-squared distribution on n − 1 = 8 degrees of
freedom. These are:
= (37.536, 213.010).
265
8. Interval estimation
The given value falls well within this confidence interval, so we have no reason to
doubt it.
Activity 8.16 The data below are from a random sample of size n = 9 taken from
the distribution N (µ, σ 2 ):
3.75, 5.67, 3.14, 7.89, 3.40, 9.32, 2.80, 10.34, 14.31.
(a) Assume σ 2 = 16. Find a 95% confidence interval for µ. If the width of such a
confidence interval must not exceed 2.5, at least how many observations do we
need?
(b) Suppose σ 2 is now unknown. Find a 95% confidence interval for µ. Compare the
result with that obtained in (a) and comment.
(c) Obtain a 95% confidence interval for σ 2 .
Solution
(a) We have x̄ = 6.74. For a 95% confidence interval, α = 0.05 so we need to find
the top 100α/2 = 2.5th percentile of N (0, 1), which is 1.96. Since σ = 4 and
n = 9, a 95% confidence interval for µ is:
σ 4 4
x̄ ± 1.96 × √ ⇒ 6.74 − 1.96 × , 6.74 + 1.96 × = (4.13, 9.35).
n 3 3
In general, a 100(1 − α)% confidence interval for µ is:
σ σ
X̄ − zα/2 × √ , X̄ + zα/2 × √
n n
where zα denotes the top 100αth percentile of the standard normal distribution,
i.e. such that:
P (Z > zα ) = α
where Z ∼ N (0, 1). Hence the width of the confidence interval is:
σ
2 × zα/2 × √ .
n
For this example, α = 0.05, z0.025 = 1.96 and σ = 4. Setting the width of the
confidence interval to be at most 2.5, we have:
σ 15.68
2 × 1.96 × √ = √ ≤ 2.5.
n n
Hence: 2
15.68
n≥ = 39.34.
2.5
So we need a sample of at least 40 observations in order to obtain a 95%
confidence interval with a width not greater than 2.5.
266
8.7. Overview of chapter
P (T > tα, k ) = α
(c) Note (n − 1)S 2 /σ 2 ∼ χ2n−1 = χ28 . From Table 8 of the New Cambridge Statistical
Tables, for X ∼ χ28 , we find that:
Hence:
8 × S2
P 2.180 < < 17.535 = 0.95.
σ2
Therefore, the lower bound for σ 2 is 8 × s2 /17.535 = 7.298, and the upper
bound is 8 × s2 /2.180 = 58.701. Therefore, a 95% confidence interval for σ 2 ,
noting s2 = 16, is:
(7.30, 58.72).
Note that the estimation in this example is rather inaccurate. This is due to two
reasons.
i. The sample size is small.
ii. The population variance, σ 2 , is large.
267
8. Interval estimation
268
Chapter 9
Hypothesis testing
9.3 Introduction
Hypothesis testing, together with statistical estimation, are the two most
frequently-used statistical inference methods. Hypothesis testing addresses a different
type of practical question from statistical estimation.
Based on the data, a (statistical) test is to make a binary decision on a hypothesis,
denoted by H0 :
reject H0 or not reject H0 .
Solution
We can see immediately if x̄ = 2 by calculating the sample mean. Inference is
concerned with the population from which the sample was taken. We are not very
interested in the sample mean in its own right.
269
9. Hypothesis testing
H0 : π = 0.5.
If π
b = 0.9, H0 is unlikely to be true.
If π
b = 0.45, H0 may be true (and also may be untrue).
If π
b = 0.7, what to do then?
Example 9.2 A customer complains that the amount of coffee powder in a coffee
tin is less than the advertised weight of 3 pounds.
A random sample of 20 tins is selected, resulting in an average weight of x̄ = 2.897
pounds. Is this sufficient to substantiate the complaint?
Again statistical estimation cannot provide a firm answer, due to random
fluctuations between different random samples. So we cast the problem into a
hypothesis testing problem as follows.
Let the weight of coffee in a tin be a normal random variable X ∼ N (µ, σ 2 ). We
need to test the hypothesis µ < 3. In fact, we use the data to test the hypothesis:
H0 : µ = 3.
Example 9.3 Suppose one is interested in evaluating the mean income (in £000s)
of a community. Suppose income in the population is modelled as N (µ, 25) and a
random sample of n = 25 observations is taken, yielding the sample mean x̄ = 17.
Independently of the data, three expert economists give their own opinions as
follows.
270
9.5. Setting p-value, significance level, test statistic
If Dr A’s claim is correct, X̄ ∼ N (16, 1). The observed value x̄ = 17 is one standard
deviation away from µ, and may be regarded as a typical observation from the
distribution. Hence there is little inconsistency between the claim and the data
evidence. This is shown in Figure 9.1.
If Ms B’s claim is correct, X̄ ∼ N (15, 1). The observed value x̄ = 17 begins to look a
bit ‘extreme’, as it is two standard deviations away from µ. Hence there is some
inconsistency between the claim and the data evidence. This is shown in Figure 9.2.
If Mr C’s claim is correct, X̄ ∼ N (14, 1). The observed value x̄ = 17 is very extreme,
as it is three standard deviations away from µ. Hence there is strong inconsistency
between the claim and the data evidence. This is shown in Figure 9.3.
Figure 9.1: Comparison of claim and data evidence for Dr A in Example 9.3.
Figure 9.2: Comparison of claim and data evidence for Ms B in Example 9.3.
271
9. Hypothesis testing
Figure 9.3: Comparison of claim and data evidence for Mr C in Example 9.3.
Definition of p-values
A p-value is the probability of the event that the test statistic takes the observed
value or more extreme (i.e. more unlikely) values under H0 . It is a measure of the
discrepancy between the hypothesis H0 and the data.
272
9.5. Setting p-value, significance level, test statistic
H0 : θ = θ0 vs. H1 : θ ∈ Θ1
273
9. Hypothesis testing
Example 9.5 Let {X1 , . . . , X20 }, taking values either 1 or 0, be the outcomes of an
experiment of tossing a coin 20 times, where:
Suppose there are 17 Xi s taking the value 1, and 3 Xi s taking the value 0. Will you
reject the null hypothesis at the 5% significance level?
Let T = X1 + · · · + X20 . Therefore, T ∼ Bin(20, π). We use T as the test statistic.
With the given sample, we observe t = 17. What are the more extreme values of T if
H0 is true?
Under H0 , E(T ) = n π0 = 10. Hence 3 is as extreme as 17, and the more extreme
values are:
0, 1, 2, 18, 19 and 20.
Therefore, the p-value is:
3 20
! 3 20
!
X X X X 20!
+ PH0 (T = i) = + (0.5)i (1 − 0.5)20−i
i=0 i=17 i=0 i=17
(20 − i)! i!
3
X 20!
= 2 × (0.5)20
i=0
(20 − i)! i!
20 20 × 19 20 × 19 × 18
= 2 × (0.5) × 1 + 20 + +
2! 3!
= 0.0026.
Activity 9.2 Let {X1 , . . . , X14 }, taking values either 1 or 0, be the outcomes of an
experiment of tossing a coin 14 times, where:
Suppose there are 4 Xi s taking the value 1, and 10 Xi s taking the value 0. Will you
reject the null hypothesis at the 5% significance level?
Solution
Let T = X1 + · · · + X14 . Therefore, T ∼ Bin(14, π). We use T as the test statistic.
With the given sample, we observe t = 4. We now determine which are the more
extreme values of T if H0 is true.
274
9.5. Setting p-value, significance level, test statistic
Since α = 0.05 < 0.1796, we do not reject the null hypothesis of a fair coin at the 5%
significance level. The observed data are consistent with the null hypothesis of a fair
coin.
Activity 9.3 You wish to test whether a coin is fair. In 400 tosses of a coin, 217
heads and 183 tails appear. Is it reasonable to assume that the coin is fair? Justify
your answer with an appropriate hypothesis test. Calculate the p-value of the test,
and assume a 5% significance level.
Solution
Let {X1 , . . . , X400 }, taking values either 1 or 0, be the outcomes of an experiment of
tossing a coin 400 times, where:
P (Xi = 1) = π = 1 − P (Xi = 0)
2 × P (Z ≥ 1.65) = 0.0990
275
9. Hypothesis testing
which is far larger than α = 0.05, hence we do not reject H0 and conclude that there
is no evidence to suggest that the coin is not fair.
(Note that the test would be significant if we set H1 : π > 0.5, as the p-value would
be 0.0495 which is less than 0.05 (just). However, we have no a priori reason to
perform an upper-tailed test – we should not determine our hypotheses by observing
the sample data, rather the hypotheses should be set before any data are observed.)
Alternatively, one could apply the central limit theorem such that under H0 we have:
π (1 − π)
X̄ ∼ N π, = N (0.5, 0.000625)
n
Activity 9.4 In a given city, it is assumed that the number of car accidents in a
given week follows a Poisson distribution. In past weeks, the average number of
accidents per week was 9, and this week there were 3 accidents. Is it justified to
claim that the accident rate has dropped? Calculate the p-value of the test, and
assume a 5% significance level.
Solution
Let T be the number of car accidents per week such that T ∼ Poisson(λ). We are
interested in testing:
H0 : λ = 9 vs. H1 : λ < 9.
Under H0 , then T ∼ Poisson(9), and we observe t = 3. Hence the p-value is:
3
e−9 9t 92 93
X
−9
P (T ≤ 3) = =e 1+9+ + = 0.0212.
t=0
t! 2! 3!
Since 0.0212 < 0.05, we reject H0 and conclude that there is evidence to suggest that
the accident rate has dropped.
276
9.5. Setting p-value, significance level, test statistic
P
Intuitively if H0 is true, X̄ = i Xi /n should be close to µ0 . Therefore, large values of
|X̄ − µ0 | suggest a departure from H0 .
√
Under H0 , X̄ ∼ N (µ0 , σ 2 /n), i.e. n(X̄ − µ0 )/σ ∼ N (0, 1). Hence the test statistic
may be defined as: √
n(X̄ − µ0 ) X̄ − µ0
T = = √ ∼ N (0, 1)
σ σ/ n
and we reject H0 for sufficiently ‘large’ values of |T |.
How large is ‘large’ ? This is determined by the significance level.
Suppose
√ µ0 = 3, σ = 0.148, n = 20 and x̄ = 2.897. Therefore, the observed value of T is
t = 20 × (2.897 − 3)/0.148 = −3.112. Hence the p-value is:
Pµ0 (|T | ≥ 3.112) = P (|Z| > 3.112) = 0.0019
where Z ∼ N (0, 1). Therefore, the null hypothesis of µ = 3 will be rejected even at the
1% significance level.
Alternatively, for a given 100α% significance level we may find the critical value cα
such that Pµ0 (|T | > cα ) = α. Therefore, the p-value is ≤ α if and only if the observed
value of |T | ≥ cα .
Using this alternative approach, we do not need to compute the p-value.
For this example, cα = zα/2 , that is the top 100α/2th percentile of N (0, 1), i.e. the
z-value which cuts off α/2 probability in the upper tail of the standard normal
distribution.
For α = 0.1, 0.05 and 0.01, zα/2 = 1.645, 1.96 and 2.576, respectively. Since we observe
|t| = 3.112, the null hypothesis is rejected at all three significance levels.
277
9. Hypothesis testing
i. We use a one-tailed test when we are only interested in the departure from H0 in
one direction.
ii. The distribution of a test statistic under H0 must be known in order to calculate
p-values or critical values.
iii. A test may be carried out by either computing the p-value or determining the
critical value.
iv. The probability of incorrect decisions in hypothesis testing is typically positive. For
example, the significance level is the probability of rejecting a true H0 .
9.6 t tests
t tests are one of the most frequently-used statistical tests.
Let {X1 , . . . , Xn } be a random sample from N (µ, σ 2 ), where both µ and σ 2 > 0 are
unknown. We are interested in testing the hypotheses:
H0 : µ = µ0 vs. H1 : µ < µ0
where µ0 is known.
√
Now we cannot use n(X̄ − µ0 )/σ as a statistic, since σ is unknown. Naturally we
replace it by S, where:
n
2 1 X
S = (Xi − X̄)2 .
n − 1 i=1
The test statistic is then the famous t statistic:
√ n
!1/2
n(X̄ − µ0 ) X̄ − µ0 √ . 1 X
T = = √ = n(X̄ − µ0 ) (Xi − X̄)2 .
S S/ n n − 1 i=1
We reject H0 if t < c, where c is the critical value determined by the significance level:
PH0 (T < c) = α
where PH0 denotes the distribution under H0 (with mean µ0 and unknown σ 2 ).
Under H0 , T ∼ tn−1 . Hence:
α = PH0 (T < c)
i.e. c is the 100αth percentile of the t distribution with n − 1 degrees of freedom. By
symmetry, c = −tα, n−1 , where tα, k denotes the top 100αth percentile of the tk
distribution.
278
9.6. t tests
Example 9.7 To deal with the customer complaint that the average amount of
coffee powder in a coffee tin is less than the advertised 3 pounds, 20 tins were
weighed, yielding the following observations:
2.82, 3.01, 3.11, 2.71, 2.93, 2.68, 3.02, 3.01, 2.93, 2.56,
2.78, 3.01, 3.09, 2.94, 2.82, 2.81, 3.05, 3.01, 2.85, 2.79.
Although H0 does not specify the population distribution completely (σ 2 > 0), the
distribution of the test statistic, T , under H0 is completely known. This enables us
to find the critical value or p-value.
Activity 9.5 A doctor claims that the average European is more than 8.5 kg
overweight. To test this claim, a random sample of 12 Europeans were weighed, and
the difference between their actual weight and their ideal weight was calculated. The
data are:
14, 12, 8, 13, −1, 10, 11, 15, 13, 20, 7, 14.
Assuming the data follow a normal distribution, conduct a t test to infer at the 5%
significance level whether or not the doctor’s claim is true.
Solution
We have a random sample of size n = 12 from N (µ, σ 2 ), and we test H0 : µ = 8.5 vs.
H1 : µ > 8.5. The test statistic, under H0 , is:
X̄ − 8.5 X̄ − 8.5
T = √ = √ ∼ t11 .
S/ n S/ 12
We reject H0 if t > t0.05, 11 = 1.796. For the given data:
12 12
!
1 X 1 X
x̄ = xi = 11.333 and s2 = x2i − 12x̄2 = 26.606.
12 i=1 11 i=1
Hence:
11.333 − 8.5
t= p = 1.903 > 1.796 = t0.05, 11
26.606/12
279
9. Hypothesis testing
Activity 9.6 A sample of seven is taken at random from a large batch of (nominally
12-volt) batteries. These are tested and their true voltages are shown below:
Solution
(a) We are to test H0 : µ = 12 vs. H1 : µ 6= 12. The key points here are that n is
small and that σ 2 is unknown. We can use the t test and this is valid provided
the data are normally distributed. The test statistic value is:
x̄ − 12 12.7 − 12
t= √ = √ = 2.16.
s/ 7 0.858/ 7
• If you suspected before collecting the data that the mean voltage was less than
12 volts, the one-sided test would be appropriate.
• If you had no prior reason to believe that the mean was less than 12 volts you
would perform a two-sided test.
280
9.7. General approach to statistical tests
(c) H0 : µ = 9 vs. H1 : µ 6= 9.
Repeat the above exercise with the additional assumption that σ 2 = 0.375. Compare
the results with those derived without this assumption and comment.
Solution
√
When σ 2 is unknown, we use the test statistic T = n(X̄ − 9)/S. Under H0 ,
T ∼ t15 . With α = 0.05, we reject H0 if:
For the given sample, t = 2.02. Hence we reject H0 against the alternative
H1 : µ > 9, but we will not reject H0 against the two other alternative hypotheses.
√
When σ 2 is known, we use the test statistic T = n(X̄ − 9)/σ. Now under H0 ,
T ∼ N (0, 1). With α = 0.05, we reject H0 if:
For the given sample, t = 2.02. Hence we reject H0 against the alternative H1 : µ > 9
and H1 : µ 6= 9, but we will not reject H0 against H1 : µ < 9.
With σ 2 known, we should be able to perform inference better simply because we
have more information about the population. More precisely, for the given
significance level, we require less extreme values to reject H0 . Put another way, the
p-value of the test is reduced when σ 2 is given. Therefore, the risk of rejecting H0 is
also reduced.
281
9. Hypothesis testing
where Θ0 and Θ1 are two non-overlapping sets. A general approach to test the above
hypotheses at the 100α% significance level may be described as follows.
3. If the observed value of T with the given sample is in the critical region C, H0 is
rejected. Otherwise, H0 is not rejected.
In order to make a test powerful in the sense that the chance of making an incorrect
decision is small, the critical region should consist of those values of T which are least
supportive of H0 (i.e. which lie in the direction of H1 ).
Decision made
H0 not rejected H0 rejected
True state H0 true Correct decision Type I error
of nature H1 true Type II error Correct decision
i. Ideally we would like to have a test which minimises the probabilities of making
both types of error, which unfortunately is not feasible.
ii. The probability of making a Type I error is the significance level, which is under
our control.
iii. We do not have explicit control over the probability of a Type II error. For a given
significance level, we try to choose a test statistic such that the probability of a
Type II error is small.
iv. The power function of the test is defined as:
β(θ) = Pθ (H0 is rejected) for θ ∈ Θ1
i.e. β(θ) = 1 − P (Type II error).
v. The null hypothesis H0 and the alternative hypothesis H1 are not treated equally in
a statistical test, i.e. there is an asymmetric treatment. The choice of H0 is based
on the subject matter concerned and/or technical convenience.
vi. It is more conclusive to end a test with H0 rejected, as the decision of ‘not reject
H0 ’ does not imply that H0 is accepted.
282
9.8. Two types of error
Activity 9.8
(a) Of 100 clinical trials, 5 have shown that wonder-drug ‘Zap2’ is better than the
standard treatment (aspirin). Should we be excited by these results?
(b) Of the 1,000 clinical trials of 1,000 different drugs this year, 30 trials found
drugs which seem better than the standard treatments with which they were
compared. The television news reports only the results of those 30 ‘successful’
trials. Should we believe these reports?
(c) A child welfare officer says that she has a test which always reveals when a child
has been abused, and she suggests it be put into general use. What is she saying
about Type I and Type II errors for her test?
Solution
(a) If 5 clinical trials out of 100 report that Zap2 is better, this is consistent with
there being no difference whatsoever between Zap2 and aspirin if a 5% Type I
error probability is being used for tests in these clinical trials. With a 5%
significance level we expect 5 trials in 100 to show spurious significant results.
(b) If the television news reports the 30 successful trials out of 1,000, and those
trials use tests with a significance level of 5%, we may well choose to be very
cautious about believing the results. We would expect 50 spuriously significant
results in the 1,000 trial results.
(c) The welfare officer is saying that the Type II error has probability zero. The
test is always positive if the null hypothesis of no abuse is false. On the other
hand, the welfare officer is saying nothing about the probability of a Type I
error. It may well be that the probability of a Type I error is high, which would
lead to many false accusations of abuse when no abuse had taken place. One
should always think about both types of error when proposing a test.
Activity 9.9 A manufacturer has developed a new fishing line which is claimed to
have an average breaking strength of 7 kg, with a standard deviation of 0.25 kg.
Assume that the standard deviation figure is correct and that the breaking strength
is normally distributed.
Suppose that we carry out a test, at the 5% significance level, of H0 : µ = 7 vs.
H1 : µ < 7. Find the sample size which is necessary for the test to have 90% power if
the true breaking strength is 6.95 kg.
Solution
The critical value for the test is z0.95 = −1.645 and the probability of rejecting H0
with this test is:
X̄ − 7
P √ < −1.645
0.25/ n
283
9. Hypothesis testing
Therefore:
7 − 6.95
√ − 1.645 = 1.282
0.25/ n
√
0.2 × n = 2.927
√
n = 14.635
n = 214.1832.
So to ensure that the test power is at least 90%, we should use a sample size of 215.
Remark: We see a rather large sample size is required. Hence investigators are
encouraged to use sample sizes large enough to come to rational decisions.
Activity 9.10 A manufacturer has developed a fishing line that is claimed to have
a mean breaking strength of 15 kg with a standard deviation of 0.8 kg. Suppose that
the breaking strength follows a normal distribution. With a sample size of n = 30,
the null hypothesis that µ = 15 kg, against the alternative hypothesis of µ < 15 kg,
will be rejected if the sample mean x̄ < 14.8 kg.
(b) Find the power of the test if the true mean is 14.9 kg, 14.8 kg and 14.7 kg,
respectively.
Solution
(a) Under H0 : µ = 15, we have X̄ ∼ N (15, σ 2 /30) where σ = 0.8. The probability
of committing a Type I error is:
P (H0 is rejected | µ = 15) = P (X̄ < 14.8 | µ = 15)
X̄ − 15 14.8 − 15
=P √ < √ µ = 15
σ/ 30 σ/ 30
14.8 − 15
=P Z< √
0.8/ 30
= P (Z < −1.37)
= 0.0853.
284
9.8. Two types of error
(b) If the true value is µ, then X̄ ∼ N (µ, σ 2 /30). The power of the test for a
particular µ is:
X̄ − µ 14.8 − µ 14.8 − µ
Pµ (H0 is rejected) = Pµ (X̄ < 14.8) = Pµ √ < √ =P Z< √
σ/ 30 σ/ 30 0.8/ 30
which is 0.2483 for µ = 14.9, 0.5 for µ = 14.8, and 0.7517 for µ = 14.7.
Activity 9.11 In a wire-based nail manufacturing process the target length for cut
wire is 22 cm. It is known that widths vary with a standard deviation equal to 0.08
cm. In order to monitor this process, a random sample of 50 separate wires is
accurately measured and the process is regarded as operating satisfactorily (the null
hypothesis) if the sample mean width lies between 21.97 cm and 22.03 cm so that
this is the decision procedure used (i.e. if the sample mean falls within this range
then the null hypothesis is not rejected, otherwise the null hypothesis is rejected).
(b) Determine the probability of making a Type II error when the process is
actually cutting to a length of 22.05 cm.
(c) Find the probability of rejecting the null hypothesis when the true cutting
length is 22.01 cm. (This is the power of the test when the true mean is 22.01
cm.)
Solution
(a) We have:
(b) We have:
285
9. Hypothesis testing
(c) We have:
Activity 9.12 It may be assumed that the length of nails produced by a particular
machine is a normally distributed random variable, with a standard deviation of 0.02
cm. The lengths of a random sample of 6 nails are 4.63 cm, 4.59 cm, 4.64 cm, 4.62
cm, 4.66 cm and 4.69 cm.
(a) Test, at the 1% significance level, the hypothesis that the machine produces
nails with a mean length of 4.62 cm (a two-sided test).
(b) Find the probability of committing a Type II error when the true mean length
is 4.64 cm.
Solution
We test the null√hypothesis H0 : µ = 4.62 vs. H1 : µ 6= 4.62 with σ = 0.02. The test
statistic is T = n(X̄ − 4.62)/σ, which is N (0, 1) under H0 . For the given sample,
t = 2.25.
(a) At the 1% significance level, we reject H0 if |t| ≥ 2.576. Since t = 2.25, H0 is not
rejected.
√ √
(b) For any µ 6= 4.62, E(T ) = n(E(X̄) − 4.62)/σ = n(µ − 4.62)/σ 6= 0, hence
T 6∼ N (0, 1). The probability of committing a Type II error is:
286
9.8. Two types of error
Note:
i. The power of the test to reject H0 when µ = 4.64 is 1 − 0.5517 = 0.4483.
The power increases when µ moves further away from µ0 = 4.62.
ii. We always express probabilities to be calculated in terms of some
‘standard’ distributions such as N (0, 1), tk , etc. We can then refer to the
relevant table in the New Cambridge Statistical Tables.
Activity 9.13 A random sample of fibres is known to come from one of two
environments, A or B. It is known from past experience that the lengths of fibres
from A have a log-normal distribution so that the log-length of an A-type fibre is
normally distributed about a mean of 0.80 with a standard deviation of 1.00.
(Original units are in microns.)
The log-lengths of B-type fibres are normally distributed about a mean of 0.65 with
a standard deviation of 1.00. In order to identify the environment from which the
given sample was taken a subsample of n fibres are to be measured and the
classification is to be made on the evidence of these measurements.
Do not be put off by the log-normal distribution. This simply means that it is the
logs of the data, rather than the original data, which have a normal distribution. If
X represents the log of a fibre length for fibres from A, then X ∼ N (0.8, 1).
(b) What sample size and decision procedures should be used if it is desired to have
error probabilities such that the chance of misclassifying as A is to be 5% and
the chance of misclassifying as B is to be 10%?
(c) If the sample is classified as A if the sample mean of log-lengths exceeds 0.75,
and the misclassification as A is to have a probability of 2%, what sample size
should be used and what is the probability of a B-type misclassification?
(d) If the sample comes from neither A nor B but from an environment with a
mean log-length of 0.70, what is the probability of classifying it as type A if the
decision procedure determined in (b) is applied?
Solution
287
9. Hypothesis testing
(b) To find the sample size n and the value a, we need to solve two conditions:
√
• α = P (X̄ > a |√H0 ) = P (Z > (a − 0.65)/(1/ n)) = 0.05 ⇒
(a − 0.65)/(1/ n) = 1.645.
√
• β = P (X̄ < a |√H1 ) = P (Z < (a − 0.80)/(1/ n)) = 0.10 ⇒
(a − 0.80)/(1/ n) = −1.28.
Solving these equations gives a = 0.734 and n = 381, remembering to round up!
(d) The rule in (b) is ‘take n = 381 and reject H0 if x̄ > 0.734’. So:
0.734 − 0.7
P (X̄ > 0.734 | µ = 0.7) = P Z > √ = P (Z > 0.66) = 0.2546.
1/ 381
288
9.9. Tests for variances of normal distributions
Turning Example 9.8 into a statistical problem, we assume that the data form a random
sample from N (µ, σ 2 ). We are interested in testing the hypotheses:
n
(Xi − X̄)2
P
2
(n − 1)S i=1
T = = ∼ χ2n−1 .
σ02 σ02
Since we will reject H0 against an alternative hypothesis σ 2 > σ02 , we should reject H0
for large values of T .
H0 is rejected if t > χ2α, n−1 , where χ2α, n−1 denotes the top 100αth percentile of the χ2n−1
distribution, i.e. we have:
P (T ≥ χ2α, n−1 ) = α.
For any σ 2 > σ02 , the power of the test at σ is:
which is greater than α, as σ02 /σ 2 < 1, where (n − 1)S 2 /σ 2 ∼ χ2n−1 when σ 2 is the true
variance, instead of σ02 . Note that here 1 − β(σ) is the probability of a Type II error.
Suppose we choose α = 0.05. For n = 25, χ2α, n−1 = χ20.05, 24 = 36.415.
With the given sample, s2 = 0.8088 and σ02 = 1, t = 24 × 0.8088 = 19.41 < χ20.05, 24 .
Hence we do not reject H0 at the 5% significance level. There is no significant evidence
from the data against the company’s claim that the variance is not beyond 1.
With σ02 = 1, the power function is:
σ2 1 1.5 2 3 4
χ20.05, 24 /σ 2 36.415 24.277 18.208 12.138 9.104
β(σ) 0.05 0.446 0.793 0.978 0.997
Approximate β(σ) 0.05 0.40 0.80 0.975 0.995
289
9. Hypothesis testing
Activity 9.14 A machine is designed to fill bags of sugar. The weight of the bags is
normally distributed with standard deviation σ. If the machine is correctly
calibrated, σ should be no greater than 20 g. We collect a random sample of 18 bags
and weigh them. The sample standard deviation is found to be equal to 32.48 g. Is
there any evidence that the machine is incorrectly calibrated?
Solution
This is a hypothesis test for the variance of a normal population, so we will use the
chi-squared distribution. Let:
X1 , . . . , X18 ∼ N (µ, σ 2 )
be the weights of the bags in the sample. An appropriate test has hypotheses:
290
9.10. Summary: tests for µ and σ 2 in N (µ, σ 2 )
Activity 9.15 {X1 , . . . , X21 } represents a random sample of size 21 from a normal
population with mean µ and variance σ 2 .
(a) Construct a test procedure with a 5% significance level to test the null
hypothesis that σ 2 = 8 against the alternative that σ 2 > 8.
(b) Evaluate the power of the test for the values of σ 2 given below.
Solution
(a) We test:
H0 : σ 2 = 8 vs. H1 : σ 2 > 8.
The test statistic, under H0 , is:
(n − 1)S 2 20 × S 2
T = = ∼ χ220 .
σ02 8
t ≥ 31.410
(b) To evaluate the power, we need the probability of rejecting H0 (which happens
if t ≥ 31.410) conditional on the actual value of σ 2 , that is:
2 8 8
P (T ≥ 31.410 | σ = k) = P T × ≥ 31.410 ×
k k
291
9. Hypothesis testing
H0 : µX = µY . (9.1)
Are customers willing to pay more for the new product than the old one?
µ = µX − µY and σ 2 = σX
2
+ σY2 .
H0 : µ = 0.
292
9.12. Comparing two normal means
√
Therefore, we should use the test statistic T = nZ̄/S, where Z̄ and S 2 denote,
respectively, the sample mean and the sample variance of {Z1 , . . . , Zn }.
At the 100α% significance level, for α ∈ (0, 1), we reject the hypothesis µX = µY when:
n m
2 1 X 1 X
SX = (Xi − X̄)2 and SY2 = (Yi − Ȳ )2 .
n − 1 i=1 m − 1 i=1
2
X̄, Ȳ , SX and SY2 are independent.
2 2 2
X̄ ∼ N (µX , σX /n) and (n − 1)SX /σX ∼ χ2n−1 .
293
9. Hypothesis testing
2
Hence X̄ − Ȳ ∼ N (µX − µY , σX /n + σY2 /m). If σX
2
= σY2 , then:
p 2
X̄ − Ȳ − (µX − µY ) σX /n + σY2 /m
p
2 2
((n − 1)SX /σX + (m − 1)SY2 /σY2 ) /(n + m − 2)
s
n+m−2 X̄ − Ȳ − (µX − µY )
= ×p ∼ tn+m−2 .
1/n + 1/m (n − 1)SX2
+ (m − 1)SY2
2
9.12.1 Tests on µX − µY with known σX and σY2
Suppose we are interested in testing:
H0 : µX = µY vs. H1 : µX 6= µY .
Note that:
X̄ − Ȳ − (µX − µY )
p
2
∼ N (0, 1).
σX /n + σY2 /m
Under H0 , µX − µY = 0, so we have:
X̄ − Ȳ
T =p 2
∼ N (0, 1).
σX /n + σY2 /m
At the 100α% significance level, for α ∈ (0, 1), we reject H0 if |t| > zα/2 , where
P (Z > zα/2 ) = α/2, for Z ∼ N (0, 1).
A 100(1 − α)% confidence interval for µX − µY is:
q
2
X̄ − Ȳ ± zα/2 × σX /n + σY2 /m.
Activity 9.16 Two random samples {X1 , . . . , Xn } and {Y1 , . . . , Ym } from two
2
normally distributed populations with variances of σX = 41 and σY2 = 15,
respectively, produced the following summary statistics:
(a) At the 5% significance level, test if the two population means are the same.
Find a 95% confidence interval for the difference between the two means.
2
(b) Repeat (a), but now with σX = 85 and σY2 = 42. Comment on the impact of
increasing the variances.
(c) Repeat (a), but now with the sample sizes n = 20 and m = 14 (i.e. using the
original variances). Comment on the impact of decreasing the sample sizes.
(d) Repeat (a), but now with x̄ = 61.5 (i.e. using the original variances and sample
sizes), and comment.
294
9.12. Comparing two normal means
Solution
X̄ − Ȳ
T =p 2 ∼ N (0, 1).
σX /n + σY2 /m
At the 5% significance level we reject H0 if |t| > z0.025 = 1.96. With the given
data, t = 2.79. Hence we reject H0 (the p-value is
2 × P (Z ≥ 2.79) = 0.00528 < 0.05 = α).
The 95% confidence interval for µX − µY obtained from the data is:
r r
2
σX σY2 41 15
x̄ − ȳ ± 1.96 × + = 3 ± 1.96 × + = 3 ± 2.105 ⇒ (0.895, 5.105).
n m 50 45
2
(b) With σX = 85 and σY2 = 42, now t = 1.85. So, since 1.85 < 1.96, we cannot
reject H0 at the 5% significance level (the p-value is
2 × P (Z ≥ 1.85) = 0.0644 > 0.05 = α). The confidence interval is
3 ± 3.181 = (−0.181, 6.181) which is much wider and contains 0 – the
hypothesised valued under H0 .
Comparing with the results in (a) above, the statistical inference become less
conclusive. This is due to the increase in the variances of the populations: as the
‘randomness’ increases, we are less certain about the parameters with the same
amount of information. This also indicates that it is not enough to look only at
the sample means, even if we are only concerned with the population means.
(c) With n = 20 and m = 14, now t = 1.70. Therefore, since 1.70 < 1.96, we cannot
reject H0 at the 5% significance level (the p-value is
2 × P (Z ≥ 1.70) = 0.0892 > 0.05 = α). The confidence interval is
3 ± 3.463 = (−0.463, 6.463) which is much wider than that obtained in (a), and
contains 0 as well. This indicates that the difference of 3 units between the
sample means is significant for the sample sizes (50, 45), but is not significant
for the sample sizes (20, 14).
(d) With x̄ = 61.5, now t = 1.40. Again, since 1.40 < 1.96, we cannot reject H0 at
the 5% significance level (the p-value is 2 × P (Z ≥ 1.40) = 0.1616 > 0.05 = α).
The confidence interval is 1.5 ± 2.105 = (−0.605, 3.605). Comparing with (a),
the difference between the samples means is not significant enough to reject H0 ,
although everything else is unchanged.
Activity 9.17 Suppose that we have two independent samples from normal
populations with known variances. We want to test the H0 that the two population
means are equal against the alternative that they are different. We could use each
sample by itself to write down 95% confidence intervals and reject H0 if these
intervals did not overlap. What would be the significance level of this test?
295
9. Hypothesis testing
Solution
Let us assume H0 : µX = µY is true, then the two 95% confidence intervals do not
overlap if and only if:
σX σY σY σX
X̄ − 1.96 × √ ≥ Ȳ + 1.96 × √ or Ȳ − 1.96 × √ ≥ X̄ + 1.96 × √ .
n m m n
which is: √ √ !
X̄ − Ȳ σX / n + σY / m
P ≥ 1.96 × p 2 .
p 2 2
σX /n + σY /m σX /n + σY2 /m
So we have: √ √ !
σX / n + σY / m
P |Z| ≥ 1.96 × p 2
σX /n + σY2 /m
where Z ∼ N (0, 1). This does not reduce in general, but if we assume n = m and
2
σX = σY2 , then it reduces to:
√
P (|Z| ≥ 1.96 × 2) = 0.0056.
The significance level is about 0.6%, which is much smaller than the usual
conventions of 5% and 1%. Putting variability into two confidence intervals makes
them more likely to overlap than you might think, and so your chance of incorrectly
rejecting the null hypothesis is smaller than you might expect!
2
9.12.2 Tests on µX − µY with σX = σY2 but unknown
This time we consider the following hypotheses:
H0 : µX − µY = δ0 vs. H1 : µX − µY > δ0
At the 100α% significance level, for α ∈ (0, 1), we reject H0 if t > tα, n+m−2 , where
P (T > tα, n+m−2 ) = α, for T ∼ tn+m−2 .
A 100(1 − α)% confidence interval for µX − µY is:
r
1/n + 1/m 2
X̄ − Ȳ ± tα/2, n+m−2 × ((n − 1)SX + (m − 1)SY2 ).
n+m−2
296
9.12. Comparing two normal means
Example 9.10 Two types of razor, A and B, were compared using 100 men in an
experiment. Each man shaved one side, chosen at random, of his face using one razor
and the other side using the other razor. The times taken to shave, Xi and Yi
minutes, for i = 1, . . . , 100, corresponding to the razors A and B, respectively, were
recorded, yielding:
H0 : µX = µY vs. H1 : µX 6= µY .
There are three approaches – a paired comparison method and two two-sample
comparisons based on different assumptions. Since the data are recorded in pairs,
the paired comparison is most relevant and effective to analyse these data.
297
9. Hypothesis testing
X̄ − Ȳ − (µX − µY )
p
2
∼ N (0, 1).
σX /100 + σY2 /100
Hence we reject H0 when |t| > 1.96 at the 5% significance level, where:
X̄ − Ȳ
T =p 2 .
σX /100 + σY2 /100
√
For the given data, t = −0.18/ 0.009 = −1.9. Hence we cannot reject H0 .
A 95% confidence interval for µX − µY is:
q
2
x̄ − ȳ ± 1.96 × σX /100 + σY2 /100 = −0.18 ± 0.186 ⇒ (−0.366, 0.006).
10(X̄ − Ȳ )
T =p 2 .
SX + SY2
For the given data, t = −1.897. Hence we cannot reject H0 at the 5% significance
level.
A 95% confidence interval for µX − µY is:
q
x̄ − ȳ ± t0.025, 198 × (s2X + s2Y )/100 = −0.18 ± 0.1870 ⇒ (−0.367, 0.007)
which contains 0.
ii. The paired comparison is intuitively the most relevant, requires the least
assumptions, and leads to the most conclusive inference (i.e. rejection of H0 ). It
also produces the narrowest confidence interval.
298
9.12. Comparing two normal means
iii. Methods II and III ignore the pairing of the data. Consequently, the inference is
less conclusive and less accurate.
iv. A general observation is that H0 is rejected at the 100α% significance level if and
only if the value hypothesised by H0 is not within the corresponding 100(1 − α)%
confidence interval.
v. It is much more challenging to compare two normal means with unknown and
unequal variances. This will not be discussed in this course.
Activity 9.18 The weights (in grammes) of a group of five-week-old chickens reared
on a high-protein diet are 336, 421, 310, 446, 390 and 434. The weights of a second
group of chickens similarly reared, except for their low-protein diet, are 224, 275,
393, 282 and 365. Is there evidence that the additional protein has increased the
average weight of the chickens? Assume normality.
Solution
Assuming normally-distributed populations with possibly different means, but the
same variance, we test:
H0 : µX = µY vs. H1 : µX > µY .
The sample means and standard deviations are x̄ = 389.5, ȳ = 307.8, sX = 55.40 and
sY = 69.45. The test statistic and its distribution under H0 are:
s
n+m−2 X̄ − Ȳ
T = ×p ∼ tn+m−2
1/n + 1/m (n − 1)SX2
+ (m − 1)SY2
and we obtain, for the given data, t = 2.175 > 1.833 = t0.05, 9 hence we reject H0 that
the mean weights are equal and conclude that the mean weight for the high-protein
diet is greater at the 5% significance level.
(a) Two independent random samples, of n1 and n2 observations, are drawn from
normal distributions with the same variance σ 2 . Let S12 and S22 be the sample
variances of the first and the second samples, respectively. Show that:
is an unbiased estimator of σ 2 .
Hint: Remember the expectation of a chi-squared variable is its degrees of
freedom.
(b) Two makes of car safety belts, A and B, have breaking strengths which are
normally distributed with the same variance. A random sample of 140 belts of
make A and a random sample of 220 belts of make B were tested. The sample
299
9. Hypothesis testing
means, and the sums of squares about the means (i.e. i (xi − x̄)2 ), of the
P
breaking strengths (in lbf units) were (2685, 19000) for make A, and (2680,
34000) for make B, respectively. Is there significant evidence to support the
hypothesis that belts of make A are stronger on average than belts of make B?
Assume a 1% significance level.
Solution
(a) We first note that (ni − 1)Si2 /σ 2 ∼ χ2ni −1 . By the definition of χ2 distributions,
we have:
E (ni − 1)Si2 = (ni − 1)σ 2 for i = 1, 2.
Hence:
(n1 − 1)S12 + (n2 − 1)S22
2
E σ =E
n1 + n2 − 2
b
1
E((n1 − 1)S12 ) + E((n2 − 1)S22 )
=
n1 + n2 − 2
(n1 − 1)σ 2 + (n2 − 1)σ 2
=
n1 + n2 − 2
= σ2.
(b) Denote x̄ = 2685 and ȳ = 2680, then 139s2X = 19000 and 219s2Y = 34000.
We test H0 : µX = µY vs. H1 : µX > µY . Under H0 we have:
2 1 1
X̄ − Ȳ ∼ N 0, σ + = N (0, 0.01169σ 2 )
140 220
and:
2
139SX + 219SY2
∼ χ2358 .
σ2
Hence: √
(X̄ − Ȳ )/ 0.01169
T =p 2
∼ t358
(139SX + 219SY2 )/358
under H0 . We reject H0 if t > t0.01, 358 ≈ 2.326. Since we observe t = 3.801 we
reject H0 , i.e. there is significant evidence to suggest that belts of make A are
stronger on average than belts of make B.
300
9.13. Tests for correlation coefficients
i. ρ ∈ [−1, 1], and |ρ| = 1 if and only if Y = aX + b for some constants a and b.
Furthermore, a > 0 if ρ = 1, and a < 0 if ρ = −1.
ii. ρ measures only the linear relationship between X and Y . When ρ = 0, X and Y
are linearly independent, that is uncorrelated.
iii. If X and Y are independent (in the sense that the joint pdf is the product of the
two marginal pdfs), ρ = 0. However, if ρ = 0, X and Y are not necessarily
independent, as there may exist some non-linear relationship between X and Y .
iv. If ρ > 0, X and Y tend to increase (or decrease) together. If ρ < 0, X and Y tend
to move in opposite directions.
n
P n
P
where X̄ = Xi /n and Ȳ = Yi /n.
i=1 i=1
Example 9.11 The measurements of height, X, and weight, Y , are taken from 69
students in a class. ρ should be positive, intuitively!
In Figure 9.5, the vertical line at x̄ and the horizontal line at ȳ divide the 69 points
into 4 quadrants: northeast (NE), southwest (SW), northwest (NW) and southeast
(SE). Most points are in either NE or SW.
301
9. Hypothesis testing
69
P
Overall, (xi − x̄)(yi − ȳ) > 0 and hence ρb > 0.
i=1
Figure 9.6 shows examples of different sample correlation coefficients using scatterplots
of bivariate observations.
H0 : ρ = 0 vs. H1 : ρ 6= 0.
Hence we reject H0 at the 100α% significance level, for α ∈ (0, 1), if |t| > tα/2, n−2 ,
where:
α
P (T > tα/2, n−2 ) = .
2
Some remarks are the following.
p
ρ| (n − 2)/(1 − ρb2 ) increases as |b
i. |T | = |b ρ| increases.
302
9.13. Tests for correlation coefficients
Activity 9.20 The following table shows the number of salespeople employed by a
company and the corresponding value of sales (in £000s):
Compute the sample correlation coefficient for these data and carry out a formal test
for a (linear) relationship between the number of salespeople and sales.
Note that: X X X
xi = 2,616, yi = 2,520, x2i = 571,500,
X X
yi2 = 529,746 and xi yi = 550,069.
303
9. Hypothesis testing
Solution
We test:
H0 : ρ = 0 vs. H1 : ρ > 0.
The corresponding test statistic and its distribution under H0 are:
√
ρb n − 2
T = p ∼ tn−2 .
1 − ρb2
We find ρb = 0.8716 and obtain t = 5.62 > 2.764 = t0.01, 10 and so we reject H0 at the
1% significance level. Since the test is highly significant, there is overwhelming
evidence of a (linear) relationship between the number of salespeople and the value
of sales.
(a) Test the null hypothesis H0 : ρ = 0 against the alternative hypothesis H1 : ρ < 0
at the 5% significance level with the sample size n = 10.
(b) Repeat (a) for n = 500.
Solution
We have: n n
P P
(Xi − X̄)(Yi − Ȳ ) Xi Yi − nX̄ Ȳ
i=1 i=1
ρb = s = .
n
P n
P (n − 1)SX SY
(Xi − X̄)2 (Yj − Ȳ )2
i=1 j=1
(a) For n = 10, −t0.05, n−2 = −1.860, ρb = −0.124 and t = −0.355. Hence we cannot
reject H0 , so there is no evidence that X and Y are correlated.
(b) For n = 500, −t0.05, n−2 ≈ −1.645, ρb = −0.112 and t = −2.52. Hence we reject
H0 , so there is significant evidence that X and Y are correlated.
Note that the sample correlation coefficient ρb = −0.124 is not significantly different
from 0 when the sample size is 10. However, ρb = −0.112 is significantly different
from 0 when the sample size is 500!
304
9.14. Tests for the ratio of two normal variances
2
× = ∼ Fn−1, m−1 .
σX SY2 SY2 /σY2
2
Under H0 , T = kSX /SY2 ∼ Fn−1, m−1 . Hence H0 is rejected if:
t < F1−α/2, n−1, m−1 or t > Fα/2, n−1, m−1
where Fα, p, k denotes the top 100αth percentile of the Fp, k distribution, that is:
P (T > Fα, p, k ) = α
available from Table 12 of the New Cambridge Statistical Tables.
Since:
σ2 S2
P F1−α/2, n−1, m−1 ≤ Y2 × X2 ≤ Fα/2, n−1, m−1 =1−α
σX SY
a 100(1 − α)% confidence interval for σY2 /σX 2
is:
SY2 SY2
F1−α/2, n−1, m−1 × 2 , Fα/2, n−1, m−1 × 2 .
SX SX
Example 9.12 Here we practise use of Table 12 of the New Cambridge Statistical
Tables to obtain critical values for the F distribution.
Table 12 can be used to find the top 100αth percentile of the Fν1 , ν2 distribution for
α = 0.10, 0.05, 0.025, 0.01, 0.005 and 0.001 using Tables 12(a) to 12(f), respectively.
For example, for ν1 = 3 and ν2 = 5, then:
305
9. Hypothesis testing
Example 9.13 The daily returns (in percentages) of two assets, X and Y , are
recorded over a period of 100 trading days, yielding average daily returns of x̄ = 3.21
and ȳ = 1.41. Also available from the data are the following quantities:
100
X 100
X 100
X
x2i = 1989.24, yi2 = 932.78 and xi yi = 661.11.
i=1 i=1 i=1
Assume the data are normally distributed. Are the two assets positively correlated
with each other, and is asset X riskier than asset Y ?
With n = 100 we have:
n n
!
1 X 1 X
s2X = (xi − x̄)2 = x2i − nx̄2 = 9.69
n − 1 i=1 n−1 i=1
and: !
n n
1 X 1 X
s2Y = (yi − ȳ)2 = yi2 − nȳ 2
= 7.41.
n − 1 i=1 n−1 i=1
Therefore: n n
P P
(xi − x̄)(yi − ȳ) xi yi − n x̄ ȳ
i=1 i=1
ρb = = = 0.249.
(n − 1) sX sY (n − 1) sX sY
First we test:
H0 : ρ = 0 vs. H1 : ρ > 0.
Under H0 , the test statistic is:
r
n−2
T = ρb ∼ t98 .
1 − ρb2
Setting α = 0.01, we reject H0 if t > t0.01, 98 = 2.37. With the given data, t = 2.545
hence we reject the null hypothesis of ρ = 0 at the 1% significance level. We
conclude that there is highly significant evidence indicating that the two assets are
positively correlated.
We measure the risks in terms of variances, and test:
2
H0 : σX = σY2 2
vs. H1 : σX > σY2 .
2
Under H0 , T = SX /SY2 ∼ F99, 99 . Hence we reject H0 if t > F0.05, 99, 99 = 1.39 at the
5% significance level, using Table 12(b) of the New Cambridge Statistical Tables.
With the given data, t = 9.69/7.41 = 1.308. Therefore, we cannot reject H0 . As the
test is not significant at the 5% significance level, we may not conclude that the
variances of the two assets are significantly different. Therefore, there is no
significant evidence indicating that asset X is riskier than asset Y .
Strictly speaking, the test is valid only if the two samples are independent of each
other, which is not the case here.
306
9.14. Tests for the ratio of two normal variances
Activity 9.22 Two independent samples from normal populations yield the
following results:
2
P
Sample 1 n=5 P (xi − x̄)2 = 4.8
Sample 2 m=7 (yi − ȳ) = 37.2
Test at the 5% signficance level whether the population variances are the same based
on the above data.
Solution
We test:
H0 : σ12 = σ22 vs. H1 : σ12 6= σ22 .
Under H0 , the test statistic is:
S12
T = ∼ Fn−1, m−1 = F4, 6 .
S22
Critical values are F0.975, 4, 6 = 1/F0.025, 6, 4 = 1/9.20 = 0.11 and F0.025, 4, 6 = 6.23,
using Table 12 of the New Cambridge Statistical Tables. The test statistic value is:
4.8/4
t= = 0.1935
37.2/6
and since 0.11 < 0.1935 < 6.23 we do not reject H0 , which means there is no
evidence of a difference in the variances.
Activity 9.23 Class A was taught using detailed PowerPoint slides. The marks in
the final examination for a random sample of Class A students were:
Students in Class B were required to read textbooks and answer questions in class
discussions. The marks in the final examination for a random sample of Class B
students were:
48, 50, 42, 53, 81, 59, 64, 45.
Assuming examination marks are normally distributed, can we infer that the
variances of the marks differ between the two classes? Test at the 5% significance
level.
Solution
We test H0 : σA2 = σB2 vs. H1 : σA2 6= σB2 . Under H0 we have:
SA2
T = ∼ FnA −1, nB −1 .
SB2
307
9. Hypothesis testing
For the given data, nA = 9, s2A = 176.778, nB = 8 and s2B = 159.929. Setting
α = 0.05, F0.975, 8, 7 = 0.221 and F0.025, 8, 7 = 4.90. Since:
Activity 9.24 After the machine in Activity 9.14 is calibrated, we collect a new
sample of 21 bags. The sample standard deviation of their weights is 23.72 g. Based
on this sample, can you conclude that the calibration has reduced the variance of the
weights of the bags?
Solution
Let:
Y1 , . . . , Y21 ∼ N (µY , σY2 )
2
be the weights of the bags in the new sample, and use σX to denote the variance of
the distribution of the previous sample, to avoid confusion. We want to test for a
reduction in variance, so we set:
2 2
σX σX
H0 : = 1 vs. H 1 : > 1.
σY2 σY2
If the null hypothesis is true, the test statistic will follow an F18−1, 21−1 = F17, 20
distribution.
At the 5% significance level, the upper-tail critical value of the F17, 20 distribution is
F0.05, 17, 20 = 2.17. Our test statistic does not exceed this value, so we cannot reject
the null hypothesis.
We move to the 10% significance level. The upper-tail critical value is
F0.10, 17, 20 = 1.821, so we can now reject the null hypothesis (if only barely). We
conclude that there is some evidence that the variance is reduced, but it is not very
strong evidence.
Notice the difference between the conclusions of these two tests. We have a much
more powerful test when we compare our standard deviation of 32.48 g to a fixed
standard deviation of 25 g, than when we compare it to an estimated standard
deviation of 23.78 g, even though the values are similar.
308
9.16. Overview of chapter
2
σY
Null hypothesis, H0 µX − µY = δ µX − µY = δ ρ=0 2
σX
=k
2
(σX , σY2 known) 2
(σX = σY2 unknown) (n = m)
q q 2
X̄−Ȳ −δ X̄−Ȳ −δ SX
Test statistic, T √ 2 /n+σ 2 /m
n+m−2
1/n+1/m
×√ 2 +(m−1)S 2
ρb n−2
ρ2
1−b
k SY2
σX Y (n−1)SX Y
Distribution of T
N (0, 1) tn+m−2 tn−2 Fn−1, m−1
under H0
1. Suppose that one observation, i.e. n = 1, is taken from the geometric distribution:
(
(1 − π)x−1 π for x = 1, 2, . . .
p(x; π) =
0 otherwise
(a) What is the probability that a Type II error will be committed when the true
parameter value is π = 0.4?
309
9. Hypothesis testing
2. Let X have a Poisson distribution with mean λ. We want to test the null
hypothesis that λ = 1/2 against the alternative λ = 2. We reject the null
hypothesis if and only if x > 1. Calculate the size and power of the test. You may
use the approximate value e ≈ 2.718.
310
Chapter 10
Analysis of variance (ANOVA)
restate and interpret the models for one-way and two-way analysis of variance
perform hypothesis tests and construct confidence intervals for one-way and
two-way analysis of variance
10.3 Introduction
Analysis of variance (ANOVA) is a popular tool which has an applicability and power
which we can only start to appreciate in this course. The idea of analysis of variance is
to investigate how variation in structured data can be split into pieces associated with
components of that structure. We look only at one-way and two-way classifications,
providing tests and confidence intervals which are widely used in practice.
311
10. Analysis of variance (ANOVA)
Example 10.1 To assess the teaching quality of class teachers, a random sample of
6 examination marks was selected from each of three classes. The examination marks
for each class are listed in the table below.
Can we infer from these data that there is no significant difference in the
examination marks among all three classes?
Suppose examination marks from Class j follow the distribution N (µj , σ 2 ), for
j = 1, 2, 3. So we assume examination marks are normally distributed with the same
variance in each class, but possibly different means.
We need to test the hypothesis:
H0 : µ1 = µ2 = µ3 .
The data form a 6 × 3 array. Denote the data point at the (i, j)th position as Xij .
We compute the column means first where the jth column mean is:
X1j + X2j + · · · + Xnj j
X̄·j =
nj
Observation
1 2 3 4 5 6 Mean
Class 1 85 75 82 76 71 85 79
Class 2 71 75 73 74 69 82 74
Class 3 59 64 62 69 75 67 66
Note that similar problems arise from other practical situations. For example:
If H0 is true, the three observed sample means x̄·1 , x̄·2 and x̄·3 should be very close to
each other, i.e. all of them should be close to the overall sample mean, x̄, which is:
x̄·1 + x̄·2 + x̄·3 79 + 74 + 66
x̄ = = = 73
3 3
312
10.5. One-way analysis of variance
Hence we would reject H0 for large values of T . (Note t = 0 if x̄·1 = x̄·2 = x̄·3 which
would mean that there is no variation at all between the sample means. In this case
all the sample means would equal x̄.)
It remains to determine the distribution of T under H0 .
313
10. Analysis of variance (ANOVA)
k
P
where n = nj is the total number of observations across all k groups.
j=1
k
P
with n − k = (nj − 1) degrees of freedom.
j=1
We have already discussed the jth sample mean and overall sample mean. The total
variation is a measure of the overall (total) variability in the data from all k groups
about the overall sample mean. The ANOVA decomposition decomposes this into two
components: between-groups variation (which is attributable to the factor level) and
within-groups variation (which is attributable to the variation within each group and is
assumed to be the same σ 2 for each group).
Some remarks are the following.
314
10.5. One-way analysis of variance
nj
k P
ii. W/σ 2 = (Xij − X̄·j )2 /σ 2 ∼ χ2n−k .
P
j=1 i=1
k
iii. Under H0 : µ1 = · · · = µk , then B/σ 2 = nj (X̄·j − X̄)2 /σ 2 ∼ χ2k−1 .
P
j=1
315
10. Analysis of variance (ANOVA)
p-value = P (F > f ).
It is clear that f > Fα, k−1, n−k if and only if the p-value < α, as we must reach the same
conclusion regardless of whether we use the critical value approach or the p-value
approach to hypothesis testing.
Example 10.2 Continuing with Example 10.1, for the given data, k = 3,
n1 = n2 = n3 = 6, n = n1 + n2 + n3 = 18, x̄·1 = 79, x̄·2 = 74, x̄·3 = 66 and x̄ = 73.
The sample variances are calculated to be s21 = 34, s22 = 20 and s23 = 32. Therefore:
3
X
b= 6(x̄·j − x̄)2 = 6[(79 − 73)2 + (74 − 73)2 + (66 − 73)2 ] = 516
j=1
and:
3 X
X 6 3 X
X 6 3
X
w= (xij − x̄·j )2 = x2ij − 6 x̄2·j
j=1 i=1 j=1 i=1 j=1
3
X
= 5s2j
j=1
= 5(34 + 20 + 32)
= 430.
Hence:
b/(k − 1) 516/2
f= = = 9.
w/(n − k) 430/15
Under H0 : µ1 = µ2 = µ3 , F ∼ Fk−1, n−k = F2, 15 . Since F0.01, 2, 15 = 6.359 < 9, using
Table 12(d) of the New Cambridge Statistical Tables, we reject H0 at the 1%
significance level. In fact the p-value (using a computer) is P (F > 9) = 0.003.
Therefore, we conclude that there is a significant difference among the mean
examination marks across the three classes.
316
10.5. One-way analysis of variance
Source DF SS MS F p-value
Class 2 516 258 9 0.003
Error 15 430 28.67
Total 17 946
> attach(UhAh)
> summary(UhAh)
Frequency Department
Min. : 0.00 English :100
1st Qu.: 4.00 Mathematics :100
Median : 5.00 Political Science:100
Mean : 5.48
3rd Qu.: 7.00
Max. :11.00
> xbar <- tapply(Frequency, Department, mean)
> s <- tapply(Frequency, Department, sd)
> n <- tapply(Frequency, Department, length)
> sem <- s/sqrt(n)
> list(xbar,s,n,sem)
[[1]]
English Mathematics Political Science
5.81 5.30 5.33
[[2]]
English Mathematics Political Science
2.493203 2.012587 1.974867
[[3]]
English Mathematics Political Science
100 100 100
[[4]]
English Mathematics Political Science
0.2493203 0.2012587 0.1974867
317
10. Analysis of variance (ANOVA)
Surprisingly, professors in English say ‘uh’ or ‘ah’ more on average than those in
Mathematics and Political Science (compare the sample means of 5.81, 5.30 and
5.33), but the difference seems small. However, we need to formally test whether the
(seemingly small) differences are statistically significant.
Using the data, R produces the following one-way ANOVA table:
Response: Frequency
Df Sum Sq Mean Sq F value Pr(>F)
Department 2 16.38 8.1900 1.7344 0.1783
Residuals 297 1402.50 4.7222
Since the p-value for the F test is 0.1783, we cannot reject the following hypothesis:
H0 : µ1 = µ2 = µ3 .
An estimator of σ is: r
W
σ
b=S= .
n−k
95% confidence intervals for µj are given by:
S
X̄·j ± t0.025, n−k × √ for j = 1, . . . , k
nj
where t0.025, n−k is the top 2.5th percentile of the Student’s tn−k distribution, which
can be obtained from Table 10 of the New Cambridge Statistical Tables.
Example 10.4 Assuming a common variance for each group, from the preceding
output in Example 10.3 we see that:
1402.50 √
r
σ
b=s= = 4.72 = 2.173.
297
Since t0.025, 297 ≈ t0.025, ∞ = 1.96, using Table 10 of the New Cambridge Statistical
Tables, we obtain the following 95% confidence intervals for µ1 , µ2 and µ3 ,
respectively:
2.173
j = 1 : 5.81 ± 1.96 × √ ⇒ (5.38, 6.24)
100
2.173
j=2: 5.30 ± 1.96 × √ ⇒ (4.87, 5.73)
100
2.173
j=3: 5.33 ± 1.96 × √ ⇒ (4.90, 5.76).
100
318
10.5. One-way analysis of variance
6
4
2
0
Example 10.5 In early 2001, the American economy was slowing down and
companies were laying off workers. A poll conducted during February 2001 asked a
random sample of workers how long (in months) it would be before they faced
significant financial hardship if they lost their jobs. They are classified into four
groups according to their incomes. Below is part of the R output of the descriptive
statistics of the classified data. Can we infer that income group has a significant
impact on the mean length of time before facing financial hardship?
Hardship Income.group
Min. : 0.00 $20 to 30K: 81
1st Qu.: 8.00 $30 to 50K:114
Median :15.00 Over $50K : 39
Mean :16.11 Under $20K: 67
3rd Qu.:22.00
Max. :50.00
319
10. Analysis of variance (ANOVA)
[[2]]
$20 to 30K $30 to 50K Over $50K Under $20K
9.233260 9.507464 11.029099 8.087043
[[3]]
$20 to 30K $30 to 50K Over $50K Under $20K
81 114 39 67
[[4]]
$20 to 30K $30 to 50K Over $50K Under $20K
1.0259178 0.8904556 1.7660693 0.9879896
Inspection of the sample means suggests that there is a difference between income
groups, but we need to conduct a one-way ANOVA test to see whether the
differences are statistically significant.
We apply one-way ANOVA to test whether the means in the k = 4 groups are equal,
i.e. H0 : µ1 = µ2 = µ3 = µ4 , from highest to lowest income groups.
We have n1 = 39, n2 = 114, n3 = 81 and n4 = 67, hence:
k
X
n= nj = 39 + 114 + 81 + 67 = 301.
j=1
Also x̄·1 = 22.21, x̄·2 = 18.456, x̄·3 = 15.49, x̄·4 = 9.313 and:
k
1X 39 × 22.21 + 114 × 18.456 + 81 × 15.49 + 67 × 9.313
x̄ = nj X̄·j = = 16.109.
n j=1 301
Now:
k
X
b= nj (x̄·j − x̄)2
j=1
We have s21 = (11.03)2 = 121.661, s22 = (9.507)2 = 90.383, s23 = (9.23)2 = 85.193 and
320
10.5. One-way analysis of variance
Consequently:
b/(k − 1) 5205.097/3
f= = = 19.84.
w/(n − k) 25968.24/(301 − 4)
Under H0 , F ∼ Fk−1, n−k = F3, 297 . Since F0.01, 3, 297 ≈ 3.848 < 19.84, we reject H0 at
the 1% significance level, i.e. there is strong evidence that income group has a
significant impact on the mean length of time before facing financial hardship.
The pooled estimate of σ is:
r r
w 25968.24
s= = = 9.351.
n−k 301 − 4
Response: Hardship
Df Sum Sq Mean Sq F value Pr(>F)
Income.group 3 5202.1 1734.03 19.828 9.636e-12 ***
Residuals 297 25973.3 87.45
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that minor differences are due to rounding errors in calculations.
321
10. Analysis of variance (ANOVA)
Activity 10.1 Show that under the one-way ANOVA assumptions, for any set of
k
P
constants {a1 , . . . , ak }, the quantity aj X̄·j is normally distributed with mean
j=1
k k
2
a2j /nj .
P P
aj µj and variance σ
j=1 j=1
Solution
Under the one-way ANOVA assumptions, Xij ∼IID N (µj , σ 2 ) within each
j = 1, . . . , k. Therefore, since the Xij s are independent with a common variance, σ 2 ,
we have:
σ2
X̄·j ∼ N µj , for j = 1, . . . , k.
nj
Hence:
a2j σ 2
aj X̄·j ∼ N aj µj , for j = 1, . . . , k.
nj
Therefore: !
k k k
X X X a2j
aj X̄·j ∼ N aj µ j , σ 2 .
j=1 j=1 j=1
nj
Activity 10.2 Do the following data appear to violate the assumptions underlying
one-way analysis of variance? Explain why or why not.
Treatment
A B C D
1.78 8.41 0.57 9.45
8.26 5.61 3.04 8.47
3.57 3.90 2.67 7.69
4.69 3.77 1.66 8.53
2.13 1.08 2.09 9.04
6.17 2.67 1.57 7.11
Solution
We have s2A = 6.1632, s2B = 6.4106, s2C = 0.7715 and s2D = 0.7400. So we observe that
although s2A ≈ s2B and s2C ≈ s2D , the sample variances for treatments are very
different across all groups, suggesting that the assumption that σ 2 is the same for all
treatment levels may not be true.
Activity 10.3 An indicator of the value of a stock relative to its earnings is its
price-earnings ratio: the average of a given year’s high and low selling prices divided
by its annual earnings. The following table provides the price-earnings ratios for a
sample of 27 stocks, nine each from the financial, industrial and utility sectors of the
New York Stock Exchange. Test at the 1% significance level whether the true mean
price-earnings ratios for the three market sectors are the same. Use the ANOVA
table format to summarise your calculations. You may exclude the p-value.
322
10.5. One-way analysis of variance
Solution
For these n = 27 observations and k = 3 groups, we have x̄·1 = 12.46, x̄·2 = 18.08,
x̄·3 = 14.96 and x̄ = 15.16. Also:
3 X
X 9
x2ij = 6548.3.
j=1 i=1
= 142.82.
Source DF SS MS F
Sector 2 142.82 71.41 8.67
Error 24 197.76 8.24
Total 26 340.58
To test the null hypothesis that the three types of stocks have equal price-earnings
ratios, on average, we reject H0 if:
Since 5.61 < 8.67, we reject H0 and conclude that there is strong evidence of a
difference in the mean price-earnings ratios across the sectors.
323
10. Analysis of variance (ANOVA)
Activity 10.4 Three trainee salespeople were working on a trial basis. Salesperson
A went in the field for 5 days and made a total of 440 sales. Salesperson B was tried
for 7 days and made a total of 630 sales. Salesperson C was tried for 10 days and
made a total of 690 sales. Note that these figures
P are total sales, not daily averages.
The sum of the squares of all 22 daily sales ( x2i ) is 146,840.
(b) Would you say there is a difference between the mean daily sales of the three
salespeople? Justify your answer.
(c) Construct a 95% confidence interval for the mean difference between salesperson
B and salesperson C. Would you say there is a difference?
Solution
(a) The means are 440/5 = 88, 630/7 = 90 and 690/10 = 69. We will perform a
one-way ANOVA. First, we calculate the overall mean. This is:
440 + 630 + 690
= 80.
22
We can now calculate the sum of squares between salespeople. This is:
Source DF SS MS F p-value
Salesperson 2 2230 1115 5.56 ≈ 0.01
Error 19 3810 200.53
Total 21 6040
(b) As 5.56 > 3.52 = F0.05, 2, 19 , which is the top 5th percentile of the F2, 19
distribution (interpolated from Table 12 of the New Cambridge Statistical
Tables), we reject H0 : µ1 = µ2 = µ3 and conclude that there is evidence that
the means are not equal.
(c) We have:
s
1 1
90 − 69 ± 2.093 × 200.53 × + = 21 ± 14.61.
7 10
Here 2.093 is the top 2.5th percentile point of the t distribution with 19 degrees
of freedom. A 95% confidence interval is (6.39, 35.61). As zero is not included,
there is evidence of a difference.
324
10.5. One-way analysis of variance
Activity 10.5 The total times spent by three basketball players on court were
recorded. Player A was recorded on three occasions and the times were 29, 25 and 33
minutes. Player B was recorded twice and the times were 16 and 30 minutes. Player
C was recorded on three occasions and the times were 12, 14 and 16 minutes. Use
analysis of variance to test whether there is any difference in the average times the
three players spend on court.
Solution
We have x̄·A = 29, x̄·B = 23, x̄·C = 14 and x̄ = 21.875. Hence:
Source DF SS MS F p-value
Players 2 340.875 170.4375 6.175 ≈ 0.045
Error 5 138 27.6
Total 7 478.875
We test H0 : µ1 = µ2 = µ3 (i.e. the average times they play are the same) vs. H1 :
The average times they play are not the same.
As 6.175 > 5.79 = F0.05, 2, 5 , which is the top 5th percentile of the F2, 5 distribution,
we reject H0 and conclude that there is evidence of a difference between the means.
Activity 10.6 Three independent random samples were taken. Sample A consists
of 4 observations taken from a normal distribution with mean µA and variance σ 2 ,
sample B consists of 6 observations taken from a normal distribution with mean µB
and variance σ 2 , and sample C consists of 5 observations taken from a normal
distribution with mean µC and variance σ 2 .
The average value of the first sample was 24, the average value of the second sample
was 20, and the average value of the third sample was 18. The sum of the squared
observations (all of them) was 6,722.4. Test the hypothesis:
H0 : µA = µB = µC
against the alternative that this is not so.
Solution
We will perform a one-way ANOVA. First we calculate the overall mean:
4 × 24 + 6 × 20 + 5 × 18
= 20.4.
15
We can now calculate the sum of squares between groups:
4 × (24 − 20.4)2 + 6 × (20 − 20.4)2 + 5 × (18 − 20.4)2 = 81.6.
325
10. Analysis of variance (ANOVA)
Source DF SS MS F p-value
Sample 2 81.6 40.8 1.229 ≈ 0.327
Error 12 398.4 33.2
Total 14 480
As 1.229 < 3.89 = F0.05, 2, 12 , which is the top 5th percentile of the F2, 12 distribution,
we see that there is no evidence that the means are not equal.
(a) Based on these data, can we infer at the 5% significance level that the
population mean expenditures on prepared frozen meals are the same for the
three different income groups?
(c) Construct 95% confidence intervals for the mean expenditures of the first (under
$15,000) and the third (over $30,000) income groups.
Solution
326
10.5. One-way analysis of variance
b/(k − 1) 797.33/2
f= = = 4.269.
w/(n − k) 1120.67/12
Source DF SS MS F P
Income 2 797.33 398.67 4.269 <0.05
Error 12 1120.67 93.39
Total 14 1918.00
Activity 10.8 Does the level of success of publicly-traded companies affect the way
their board members are paid? The annual payments (in $000s) of randomly selected
publicly-traded companies to their board members were recorded. The companies
were divided into four quarters according to the returns in their stocks, and the
payments from each quarter were grouped together. Some summary statistics are
provided below.
Descriptive Statistics: 1st quarter, 2nd quarter, 3rd quarter, 4th quarter
(a) Can we infer that the amount of payment differs significantly across the four
groups of companies?
(b) Construct 95% confidence intervals for the mean payment of the 1st quarter
companies and the 4th quarter companies.
327
10. Analysis of variance (ANOVA)
Solution
S 15.09
X̄·j ± t0.025, n−k × √ = X̄·j ± t0.025, 116 × √ = X̄·j ± 5.46.
nj 30
(a) Find the missing values A1, A2, A3 and A4 in the one-way ANOVA table below.
Source DF SS MS F P
Factor A1 45496 15165 A4 0.000
Error 275 A2 A3
Total 278 329896
328
10.5. One-way analysis of variance
(b) Test whether there are differences in mean test scores between children whose
parents have different highest education levels.
(c) State the required model conditions for the inference conducted in (b).
Solution
(b) Since the p-value of the F test is 0.000, there exists strong evidence indicating
that the mean test scores are different for children whose parents have different
highest education levels.
(c) We need to assume that we have independent observations Xij ∼ N (µj , σ 2 ) for
i = 1, . . . , nj and j = 1, . . . , k.
Solution
We need to calculate the following:
3 4 5 3
1X 1X 1X 1X
X̄A = XiA , X̄B = XiB , X̄C = XiC , X̄D = XiD
3 i=1 4 i=1 5 i=1 3 i=1
and:
3
P 4
P 5
P 3
P
XiA + XiB + XiC + XiD
i=1 i=1 i=1 i=1
X̄ = .
15
Alternatively:
3X̄A + 4X̄B + 5X̄C + 3X̄D
X̄ = .
15
We then need the between-groups sum of squares:
329
10. Analysis of variance (ANOVA)
Alternatively, we could calculate only one of the two, and calculate the total sum of
squares (TSS):
3
X 4
X 5
X 3
X
2 2 2
TSS = (XiA − X̄) + (XiB − X̄) + (XiC − X̄) + (XiD − X̄)2
i=1 i=1 i=1 i=1
and use the relationship TSS = B + W to calculate the other. We then construct the
ANOVA table:
Source DF SS MS F
Factor 3 b b/3 11b/3w
Error 11 w w/11
Total 14 b+w
At the 100α% significance level, we then compare f = 11b/3w to Fα, 3, 11 using Table
12 of the New Cambridge Statistical Tables. For α = 0.01, we will reject the null
hypothesis that there is no difference if f > 6.22.
where εij ∼ N (0, σ 2 ) and the εij s are independent. µ is the average effect and βj is the
Pk
factor (or treatment) effect at the jth level. Note that βj = 0. The null hypothesis
j=1
(i.e. that the group means are all equal) can also be expressed as:
H0 : β1 = · · · = βk = 0.
330
10.7. Two-way analysis of variance
where:
In total, there are n = r × c observations. We now consider the conditions to make the
parameters µ, γi and βj identifiable for i = 1, . . . , r and j = 1, . . . , c. The conditions are:
γ1 + · · · + γr = 0 and β1 + · · · + βc = 0.
331
10. Analysis of variance (ANOVA)
The total variation is a measure of the overall (total) variability in the data and the
(two-way) ANOVA decomposition decomposes this into three components:
between-blocks variation (which is attributable to the row factor level),
between-treatments variation (which is attributable to the column factor level) and
residual variation (which is attributable to the variation not explained by the row and
column factors).
c
P
Row sample means: X̄i· = Xij /c, for i = 1, . . . , r.
j=1
r
P
Column sample means: X̄·j = Xij /r, for j = 1, . . . , c.
i=1
r P
P c r
P c
P
Overall sample mean: X̄ = Xij /n = X̄i· /r = X̄·j /c.
i=1 j=1 i=1 j=1
r P
c
Xij2 − rcX̄ 2 .
P
Total SS =
i=1 j=1
r
X̄i·2 − rcX̄ 2 .
P
Between-blocks (rows) variation: Brow = c
i=1
332
10.7. Two-way analysis of variance
c
X̄·j2 − rcX̄ 2 .
P
Between-treatments (columns) variation: Bcol = r
j=1
r P
c r c
Xij2 − c X̄i·2 − r X̄·j2 + rcX̄ 2 .
P P P
Residual SS = (Total SS) − Brow − Bcol =
i=1 j=1 i=1 j=1
In order to test the ‘no block (row) effect’ hypothesis of H0 : γ1 = · · · = γr = 0, the test
statistic is defined as:
Brow /(r − 1) (c − 1)Brow
F = = .
(Residual SS)/[(r − 1)(c − 1)] Residual SS
where Fα, r−1, (r−1)(c−1) is the top 100αth percentile of the Fr−1, (r−1)(c−1) distribution, i.e.
P (F > Fα, r−1, (r−1)(c−1) ) = α, and f is the observed test statistic value.
The p-value of the test is:
p-value = P (F > f ).
In order to test the ‘no treatment (column) effect’ hypothesis of H0 : β1 = · · · = βc = 0,
the test statistic is defined as:
Bcol /(c − 1) (r − 1)Bcol
F = = .
(Residual SS)/[(r − 1)(c − 1)] Residual SS
As with one-way ANOVA, two-way ANOVA results are presented in a table as follows:
Source DF SS MS F p-value
(c−1)Brow
Row factor r−1 Brow Brow /(r − 1) Residual SS
p
(r−1)Bcol
Column factor c−1 Bcol Bcol /(c − 1) Residual SS
p
Residual SS
Residual (r − 1)(c − 1) Residual SS (r−1)(c−1)
Total rc − 1 Total SS
Activity 10.11 Four suppliers were asked to quote prices for seven different
building materials. The average quote of supplier A was 1,315.8. The average quote
of suppliers B, C and D were 1,238.4, 1,225.8 and 1,200.0, respectively. The following
is the calculated two-way ANOVA table with some entries missing.
333
10. Analysis of variance (ANOVA)
Source DF SS MS F p-value
Materials 17800
Suppliers
Error
Total 358700
(c) Construct a 90% confidence interval for the difference between suppliers A and
D. Would you say there is a difference?
Solution
Source DF SS MS F p-value
Materials 6 106800 17800 1.604 ≈ 0.203
Suppliers 3 52148.88 17382.96 1.567 ≈ 0.232
Error 18 199751.12 11097.28
Total 27 358700
334
10.7. Two-way analysis of variance
(c) The top 5th percentile of the t distribution with 18 degrees of freedom is 1.734
and the MS value is 11097.28. So a 90% confidence interval is:
s
1 1
1315.8 − 1200 ± 1.734 × 11097.28 + = 115.8 ± 97.64
7 7
giving (18.16, 213.44). Since zero is not in the interval, there appears to be a
difference between suppliers A and D.
Source DF SS MS F p-value
Drinker 1.56
Beer 303.5
Error 695.6
Total
(b) Is there a significant difference between the effects of different beers? What
about different drinkers?
(c) Construct a 90% confidence interval for the difference between the effects of
beers C and D. Would you say there is a difference?
Solution
(a) We have:
Source DF SS MS F p-value
Drinker 3 271.284 90.428 1.56 ≈ 0.250
Beer 4 1214 303.5 5.236 ≈ 0.011
Error 12 695.6 57.967
Total 19 2180.884
335
10. Analysis of variance (ANOVA)
is a difference between the effects on different drinkers. The F value is 1.56 and
at a 5% significance level the critical value from Table 9 is F0.05, 3, 12 = 3.49, so
since 1.56 < 3.49 we fail to reject H0 and conclude that there is no evidence of a
difference.
(c) The top 5th percentile of the t distribution with 12 degrees of freedom is 1.782.
So a 90% confidence interval is:
s
1 1
95.75 − 79.25 ± 1.782 × 57.967 + = 16.5 ± 9.59
4 4
giving (6.91, 26.09). As the interval does not contain zero, there is evidence of a
difference between the effects of beers C and D.
A B C D E
Early shift 102 93 85 110 72
Late shift 85 87 71 92 73
Night shift 75 80 75 77 76
Solution
Here r = 3 and c = 5. We may obtain the two-way ANOVA table as follows:
Source DF SS MS F
Shift 2 652.13 326.07 5.62
Plant 4 761.73 190.43 3.28
Error 8 463.87 57.98
Total 14 1877.73
Under the null hypothesis of no shift effect, F ∼ F2, 8 . Since F0.05, 2, 8 = 4.46 < 5.62,
we can reject the null hypothesis at the 5% significance level. (Note the p-value is
0.030.)
Under the null hypothesis of no plant effect, F ∼ F4, 8 . Since F0.05, 4, 8 = 3.84 > 3.28,
we cannot reject the null hypothesis at the 5% significance level. (Note the p-value is
0.072.)
Overall, the data collected show some evidence of a shift effect but little evidence of
a plant effect.
Activity 10.14 Complete the two-way ANOVA table below. In the places of
p-values, indicate in the form such as ‘< 0.01’ appropriately and use the closest value
which you may find from the New Cambridge Statistical Tables.
336
10.7. Two-way analysis of variance
Source DF SS MS F p-value
Row factor 4 ? 234.23 ? ?
Column factor 6 270.84 45.14 1.53 ?
Residual ? 708.00 ?
Total 34 1915.76
Solution
First, row factor SS = (row factor MS)×4 = 936.92.
The degrees of freedom for residual is 34 − 4 − 6 = 24. Therefore, residual MS
= 708.00/24 = 29.5.
Hence the F statistic for testing no row factor effect is 234.23/29.5 = 7.94. From
Table 12 of the New Cambridge Statistical Tables, F0.001, 4, 24 = 6.59 < 7.94.
Therefore, the corresponding p-value is smaller than 0.001.
Since F0.05, 6, 24 = 2.51 > 1.53, the p-value for testing the column factor effect is
greater than 0.05.
The complete ANOVA table is as follows:
Source DF SS MS F p-value
Row factor 4 936.92 234.23 7.94 < 0.001
Column factor 6 270.84 45.14 1.53 > 0.05
Residual 24 708.00 29.50
Total 34 1915.76
Activity 10.15 The following table shows the audience shares (in %) of three
major networks’ evening news broadcasts in five major cities, with one observation
per cell so that there are 15 observations. Construct the two-way ANOVA table for
these data (without the p-value column). Is either factor statistically significant at
the 5% significance level?
Solution
We have r = 5 and c = 3.
c
P
The row sample means are calculated using X̄i· = Xij /c, which gives 19.77, 19.40,
j=1
19.87, 20.90 and 22.50 for i = 1, 2, 3, 4, 5, respectively.
337
10. Analysis of variance (ANOVA)
r
P
The column means are calculated using X̄·j = Xij /r, which gives 22.28, 17.34 and
i=1
21.84 for j = 1, 2, 3, respectively.
The overall sample mean is:
r
X x̄i·
x̄ = = 20.49.
i=1
r
Hence:
r X
X c
Total SS = x2ij − rcx̄2 = 6441.99 − 15 × (20.49)2 = 6441.99 − 6297.60 = 144.39.
i=1 j=1
r
X
brow = c x̄2i· − rcx̄2 = 3 × 2104.83 − 6297.60 = 16.88.
i=1
c
X
bcol = r x̄2·j − rcx̄2 = 5 × 1274.06 − 6297.60 = 72.70.
j=1
Source DF SS MS F
City 4 16.88 4.22 0.61
Network 2 72.70 36.35 5.31
Residual 8 54.81 6.85
Total 14 144.39
338
10.8. Residuals
10.8 Residuals
Before considering an example of two-way ANOVA, we briefly consider residuals.
Recall the original two-way ANOVA model:
Xij = µ + γi + βj + εij .
We now decompose the observations as follows:
Xij = X̄ + (X̄i· − X̄) + (X̄·j − X̄) + (Xij − X̄i· − X̄·j + X̄)
for i = 1, . . . , r and j = 1, . . . , c, where we have the following point estimators.
µ
b = X̄ is the point estimator of µ.
bi = X̄i· − X̄ is the point estimator of γi , for i = 1, . . . , r.
γ
βbj = X̄·j − X̄ is the point estimator of βj , for j = 1, . . . , c.
Example 10.6 The following table lists the percentage annual returns (calculated
four times per annum) of the Common Stock Index at the New York Stock
Exchange during 1981–85.
Using two-way ANOVA, we test the no row effect hypothesis to answer (a), and test
the no column effect hypothesis to answer (b). We have r = 5 and c = 4.
c
P
The row sample means are calculated using X̄i· = Xij /c, which gives 6.375, 6.375,
j=1
4.4, 4.6 and 4.1, for i = 1, . . . , 5, respectively.
r
P
The column sample means are calculated using X̄·j = Xij /r, which gives 5.34,
i=1
5.24, 5.22 and 4.88, for j = 1, . . . , 4, respectively.
339
10. Analysis of variance (ANOVA)
r
P
The overall sample mean is x̄ = x̄i· /r = 5.17.
i=1
r P
c
x2ij = 559.06.
P
The sum of the squared observations is
i=1 j=1
r
X
brow = c x̄2i· − rcx̄2 = 4 × 138.6112 − 534.578 = 19.867.
i=1
c
X
bcol = r x̄2·j − rcx̄2 = 5 × 107.036 − 534.578 = 0.602.
j=1
To test the no row effect hypothesis H0 : γ1 = · · · = γ5 = 0, the test statistic value is:
(c − 1)brow 3 × 19.867
f= = = 14.852.
Residual SS 4.013
Under H0 , F ∼ Fr−1, (r−1)(c−1) = F4, 12 . Using Table 12(d) of the New Cambridge
Statistical Tables, since F0.01, 4, 12 = 5.412 < 14.852, we reject H0 at the 1%
significance level. We conclude that there is strong evidence that the return does
depend on the year.
Source DF SS MS F p-value
Year 4 19.867 4.967 14.852 < 0.01
Quarter 3 0.602 0.201 0.600 > 0.10
Residual 12 4.013 0.334
Total 19 24.482
We could also provide 95% confidence interval estimates for each block and
treatment level by using the pooled estimator of σ 2 , which is:
Residual SS
S2 = = Residual MS.
(r − 1)(c − 1)
340
10.9. Overview of chapter
Response: Return
Df Sum Sq Mean Sq F value Pr(>F)
Year 4 19.867 4.9667 14.852 0.0001349 ***
Quarter 3 0.602 0.2007 0.600 0.6271918
Residuals 12 4.013 0.3344
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that the confidence intervals for years 1 and 2 (corresponding to 1981 and
1982) are separated from those for years 3 to 5 (that is, 1983 to 1985), which is
consistent with rejection of H0 in the no row effect test. In contrast, the confidence
intervals for each quarter all overlap, which is consistent with our failure to reject H0
in the no column effect test.
Finally, we may also look at the residuals:
εbij = Xij − µ
b−γ
bi − βbj for i = 1, . . . r, and j = 1, . . . , c.
If the assumed normal model (structure) is correct, the εbij s should behave like
independent N (0, σ 2 ) random variables.
341
10. Analysis of variance (ANOVA)
1. Three call centre workers were being monitored for the average number of calls
they answer per daily shift. Worker A answered a total of 187 calls in 4 days.
Worker B answered a total of 347 calls in 6 days. Worker C answered a total of 461
calls in 10 days. Note that these
P figures are total sales, not daily averages. The sum
2
of the squares of all 20 days, xi , is 50,915.
(a) Construct a one-way analysis of variance table. (You may exclude the p-value.)
(b) Would you say there is a difference between the average daily calls answered of
the three workers? Justify your answer using a 5% significance level.
2. The audience shares (in %) of three major television networks’ evening news
broadcasts in four major cities were examined. The average audience share for the
three networks (A, B and C) were 21.35%, 17.28% and 20.18%, respectively. The
following is the calculated ANOVA table with some entries missing.
Source Degrees of freedom Sum of squares Mean square F -value
City 1.95
Network
Error
Total 51.52
342
Appendix A
Linear regression (non-examinable)
derive from first principles the least squares estimators of the intercept and slope in
the simple linear regression model
explain how to construct confidence intervals and perform hypothesis tests for the
intercept and slope in the simple linear regression model
summarise the multiple linear regression model with several explanatory variables,
and explain its interpretation
A.3 Introduction
Regression analysis is one of the most frequently-used statistical techniques. It aims
to model an explicit relationship between one dependent variable, often denoted as y,
and one or more regressors (also called covariates, or independent variables), often
denoted as x1 , . . . , xp .
The goal of regression analysis is to understand how y depends on x1 , . . . , xp and to
predict or control the unobserved y based on the observed x1 , . . . , xp . We start with
some simple examples with p = 1.
343
A. Linear regression (non-examinable)
y = β0 + β1 x + ε
where ε stands for a random error term, β0 is the intercept and β1 is the slope of the
straight line.
Example 1.2 Data were collected on the heights, x, and weights, y, of 69 students
in a class.
We plot y against x, and draw a straight line through the middle of the data cloud:
y = β0 + β1 x + ε
where ε stands for a random error term, β0 is the intercept and β1 is the slope of the
straight line.
For a given height, x, the predicted value yb = β0 + β1 x may be viewed as a kind of
‘standard weight’.
344
A.5. Simple linear regression
Example 1.3 Some other possible examples of y and x are shown in the following
table.
y x
Sales Price
Weight gain Protein in diet
Present FTSE 100 index Past FTSE 100 index
Consumption Income
Salary Tenure
Daughter’s height Mother’s height
In most cases, there are several x variables involved. We will consider such situations
later in this chapter.
How to draw a line through data clouds, i.e. how to estimate β0 and β1 ?
How accurate is the fitted line?
What is the error in predicting a future y?
345
A. Linear regression (non-examinable)
Secondly:
n
∂L(β0 , β1 ) X
= −2 xi (yi − β0 − β1 xi ).
∂β1 i=1
Hence:
n
P n
P
xi (yi − ȳ) (xi − x̄)(yi − ȳ)
i=1 i=1
βb1 = Pn = n
P and βb0 = ȳ − βb1 x̄.
xi (xi − x̄) (xi − x̄)2
i=1 i=1
The estimator βb1 above is based on the fact that for any constant c, we have:
n
X n
X
xi (yi − ȳ) = (xi − c)(yi − ȳ)
i=1 i=1
346
A.5. Simple linear regression
since: n n
X X
c (yi − ȳ) = c (yi − ȳ) = 0.
i=1 i=1
n
P n
P
Given that (xi − x̄) = 0, it follows that c (xi − x̄) = 0 for any constant c.
i=1 i=1
n
(yi − β0 − β1 xi )2 . For any β0
P
An alternative derivation is as follows. Note L(β0 , β1 ) =
i=1
and β1 , we have:
n
X 2
L(β0 , β1 ) = yi − β0 − β1 xi + β0 − β0 + β1 − β1 xi
b b b b
i=1
Xn 2
= L βb0 , βb1 + βb0 − β0 + (βb1 − β1 )xi + 2B (A.1)
i=1
where:
n
X
B= βb0 − β0 + (βb1 − β1 )xi yi − β0 − β1 xi
b b
i=1
n
X n
X
= βb0 − β0 yi − β0 − β1 xi + β1 − β1
b b b xi yi − βb0 − βb1 xi .
i=1 i=1
Now let βb0 , βb1 be the solution to the equations:
n
X n
X
yi − βb0 − βb1 xi = 0 and xi yi − βb0 − βb1 xi = 0 (A.2)
i=1 i=1
Hence βb0 = ȳ − βb1 x̄. Substituting this into the second equation, we have:
n
X Xn n
X
0= xi yi − ȳ − β1 (xi − x̄) =
b xi (yi − ȳ) − β1
b xi (xi − x̄).
i=1 i=1 i=1
347
A. Linear regression (non-examinable)
Therefore:
n
P n
P
xi (yi − ȳ) (xi − x̄)(yi − ȳ)
i=1 i=1
βb1 = Pn = n
P .
xi (xi − x̄) (xi − x̄)2
i=1 i=1
We now explore the properties of the LSEs βb0 and βb1 . We now proceed to show that the
means and variances of these LSEs are:
n
x2i
P
2
σ i=1
E(βb0 ) = β0 and Var(βb0 ) = n
n P
(xi − x̄)2
i=1
for βb1 .
Proof: Recall we treat the xi s as constants, and we have E(yi ) = β0 + β1 xi and also
Var(yi ) = σ 2 . Hence:
n
! n n
1X 1X 1X
E(ȳ) = E yi = E(yi ) = (β0 + β1 xi ) = β0 + β1 x̄.
n i=1 n i=1 n i=1
Therefore:
E(yi − ȳ) = β0 + β1 xi − (β0 + β1 x̄) = β1 (xi − x̄).
Consequently, we have:
n n n
(xi − x̄)2 β1
P P P
(x i − x̄)(yi − ȳ) (x i − x̄)E(yi − ȳ)
i=1 i=1 i=1
E(βb1 ) = E
n
P
=
Pn = P n = β1 .
(xi − x̄) 2 (xi − x̄)2 (xi − x̄)2
i=1 i=1 i=1
Now:
E(βb0 ) = E(ȳ − βb1 x̄) = β0 + β1 x̄ − β1 x̄ = β0 .
Therefore, the LSEs βb0 and βb1 are unbiased estimators of β0 and β1 , respectively.
348
A.5. Simple linear regression
To work out the variances, the key is to write βb1 and βb0 as linear estimators (i.e.
linear combinations of the yi s):
n
P n
P
(xi − x̄)(yi − ȳ) (xi − x̄)yi n
X
i=1 i=1
βb1 = n
P = n
P = ai y i
(xi − x̄)2 (xk − x̄)2 i=1
i=1 k=1
n
P
where ai = (xi − x̄) (xk − x̄)2 and:
k=1
n n
X X 1
βb0 = ȳ − βb1 x̄ = ȳ − ai x̄yi = − ai x̄ yi .
i=1 i=1
n
Note that:
n n
X X 1
ai = 0 and a2i = P
n .
i=1 i=1 (xk − x̄)2
k=1
Now we note the following lemma, without proof. Let y1 , . . . , yn be uncorrelated random
variables, and b1 , . . . , bn be constants, then:
n
! n
X X
Var bi y i = b2i Var(yi ).
i=1 i=1
By this lemma:
n
! n
X
2
X σ2
Var(βb1 ) = Var ai y i =σ a2i = P
n
i=1 i=1 (xk − x̄)2
k=1
and:
n 2 n
!
X 1 1 X 2 2 σ2 nx̄2
Var(βb0 ) = σ 2 − ai x̄ =σ 2
+ a x̄ = 1 +
n n i=1 i n n
P
i=1 (xk − x̄)2
k=1
n
x2k
P
σ2 k=1
= n .
n P 2
(xk − x̄)
k=1
349
A. Linear regression (non-examinable)
ε1 , . . . , εn ∼IID N (0, σ 2 ).
yi ∼ N (β0 + β1 xi , σ 2 ).
Since any linear combination of normal random variables is also normal, the LSEs of β0
and β1 (as linear estimators) are also normal random variables. In fact:
n
2
P
xi
σ2 i=1
σ2
β0 ∼ N β0 ,
b
n
and βb1 ∼ N β1 ,
n
.
n P P
(xi − x̄)2 (xi − x̄)2
i=1 i=1
and:
σ
b
E.S.E.(βb1 ) = 1/2 .
n
P
(xi − x̄)2
i=1
The following results all make use of distributional results introduced earlier in the
course. Statistical inference (confidence intervals and hypothesis testing) for the normal
simple linear regression model can then be performed.
i. We have:
n
(yi − βb0 − βb1 xi )2
P
b2
(n − 2) σ i=1
= ∼ χ2n−2 .
σ2 σ2
βb0 − β0
∼ tn−2 .
E.S.E.(βb0 )
350
A.6. Inference for parameters in normal regression models
βb1 − β1
∼ tn−2 .
E.S.E.(βb1 )
where tα, k denotes the top 100αth percentile of the Student’s tk distribution, obtained
from Table 10 of the New Cambridge Statistical Tables.
H0 : β1 = 0 vs. H1 : β1 6= 0.
βb1
T = ∼ tn−2 .
E.S.E.(βb1 )
At the 100α% significance level, we reject H0 if |t| > tα/2, n−2 , where t is the observed
test statistic value.
Alternatively, we could use H1 : β1 < 0 or H1 : β1 > 0 if there was a rationale for
doing so. In such cases, we would reject H0 if t < −tα, n−2 and t > tα, n−2 for the
lower-tailed and upper-tailed t tests, respectively.
i. For testing H0 : β1 = b for a given constant b, the above test still applies, but now
with the following test statistic:
βb1 − b
T = .
E.S.E.(βb1 )
351
A. Linear regression (non-examinable)
ii. Tests for the regression intercept β0 may be constructed in a similar manner,
replacing β1 and βb1 with β0 and βb0 , respectively.
In the normal regression model, the LSEs βb0 and βb1 are also the MLEs of β0 and β1 ,
respectively.
Since εi = yi − β0 − β1 xi ∼IID N (0, σ 2 ), the likelihood function is:
n
2
Y 1 1
L(β0 , β1 , σ ) = √ exp − 2 (yi − β0 − β1 xi )2
i=1 2πσ 2 2σ
n/2 n
" #
1 1 X
∝ exp − 2 (yi − β0 − β1 xi )2 .
σ2 2σ i=1
Hence β0 , β1 are the MLEs of (β0 , β1 ).
b b
g(u) = n log u − ub
n
P 2
where b = yi − βb0 − βb1 xi .
i=1
352
A.6. Inference for parameters in normal regression models
Example 1.4 A dataset contains the annual cigarette consumption, x, and the
corresponding mortality rate, y, due to coronary heart disease (CHD) of 21
countries. Some useful summary statistics calculated from the data are:
21
X 21
X 21
X
xi = 45,110, yi = 3,042.2, x2i = 109,957,100,
i=1 i=1 i=1
21
X 21
X
yi2 = 529,321.58 and xi yi = 7,319,602.
i=1 i=1
Do these data support the suspicion that smoking contributes to CHD mortality?
(Note the assertion ‘smoking is harmful for health’ is largely based on statistical,
rather than laboratory, evidence.)
We fit the regression model y = β0 + β1 x + ε. Our least squares estimates of β1 and
β0 are, respectively:
P P P
x y − i xi j yj /n
P P
(x i − x̄)(y i − ȳ) x i y i − nx̄ȳ i i i
βb1 = i P 2
= Pi
2 2
= P 2
P 2
i (xi − x̄) i xi − nx̄ i xi − ( i xi ) /n
= 2181.66.
We now proceed to test H0 : β1 = 0 vs. H1 : β1 > 0. (If indeed smoking contributes
to CHD mortality, then β1 > 0.)
We have calculated βb1 = 0.06. However, is this deviation from zero due to sampling
error, or is it significantly different from zero? (The magnitude of βb1 itself is not
important in determining if β1 = 0 or not – changing the scale of x may make βb1
arbitrarily small.)
Under H0 , the test statistic is:
βb1
T = ∼ tn−2 = t19
E.S.E.(βb1 )
1/2
b/ ( i (xi − x̄)2 ) = 0.01293.
P
where E.S.E.(βb1 ) = σ
Since t = 0.06/0.01293 = 4.64 > 2.54 = t0.01, 19 , we reject the hypothesis β1 = 0 at
the 1% significance level and we conclude that there is strong evidence that smoking
contributes to CHD mortality.
353
A. Linear regression (non-examinable)
n
n
2 2 2
P P 2 2
Regression (explained) SS is β1 (xi − x̄) = β1
b b xi − nx̄ .
i=1 i=1
n
P 2
Residual (error) SS is yi − βb0 − βb1 xi = Total SS − Regression SS.
i=1
n
βb12 (xi − x̄)2 /σ 2 ∼ χ21
P
i=1
n 2
/σ 2 ∼ χ2n−2 .
P
yi − βb0 − βb1 xi
i=1
We reject H0 at the 100α% significance level if f > Fα, 1, n−2 , where f is the observed
test statistic value and Fα, 1, n−2 is the top 100αth percentile of the F1, n−2 distribution,
obtained from Table 12 of the New Cambridge Statistical Tables.
354
A.8. Confidence intervals for E(y)
yb = βb0 + βb1 x.
For the analysis to be more informative, we would like to have some ‘error bars’ for our
prediction. We introduce two methods as follows.
Standardising gives:
b(x) − µ(x)
µ
v
u ! ∼ N (0, 1).
u n
P n
P
t(σ 2 /n) (xi − x)2 / (xj − x̄)2
i=1 j=1
b(x) − µ(x)
µ
v
u ! ∼ tn−2 .
u n
P n
P
σ 2 /n)
t(b (xi − x)2 / (xj − x̄)2
i=1 j=1
355
A. Linear regression (non-examinable)
Such a confidence interval contains the true expectation E(y) = µ(x) with probability
1 − α over repeated samples. It does not cover y with probability 1 − α.
n
(xi − x)2
P
2
σ i=1
µ(x)) = σ 2 +
Var(y) + Var (b n .
n P 2
(xj − x̄)
j=1
Therefore:
n 1/2
2
P
. (xi − x)
i=1
(y − µ
b(x))
σb2
1 + Pn
∼ tn−2 .
n (xj − x̄)2
j=1
356
A.9. Prediction intervals for y
i. It holds that:
1/2
n
2
P
(xi − x)
i=1
P y ∈µ
b(x) ± tα/2, n−2 × σ
b×
1 + P = 1 − α.
n
n (xj − x̄) 2
j=1
ii. The prediction interval for y is wider than the confidence interval for E(y). The
former contains the unobserved random variable y with probability 1 − α, the
latter contains the unknown constant E(y) with probability 1 − α over repeated
samples.
Example 1.5 A dataset contains the prices (y, in $000s) of 100 three-year-old Ford
Tauruses together with their mileages (x, in thousands of miles) when they were sold
at auction. Based on these data, a car dealer needs to make two decisions.
1. To prepare cash for bidding on one three-year-old Ford Taurus with a mileage
of x = 40.
For the first task, a prediction interval would be more appropriate. For the second
task, the car dealer needs to know the average price and, therefore, a confidence
interval is appropriate. This can be easily done using R.
Call:
lm(formula = Price ~ Mileage)
Residuals:
Min 1Q Median 3Q Max
-0.68679 -0.27263 0.00521 0.23210 0.70071
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.248727 0.182093 94.72 <2e-16 ***
Mileage -0.066861 0.004975 -13.44 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
357
A. Linear regression (non-examinable)
We predict that a Ford Taurus will sell for between $13,922 and $15,227. The
average selling price of several three-year-old Ford Tauruses is estimated to be
between $14,498 and $14,650. Because predicting the selling price for one car is more
difficult, the corresponding prediction interval is wider than the confidence interval.
To produce the plots with confidence intervals for E(y) and prediction intervals for
y, we proceed as follows:
15.0
14.5
14.0
13.5
20 25 30 35 40 45 50
Mileage
358
A.10. Multiple linear regression models
yi = β0 + β1 xi1 + · · · + βp xip + εi
where:
E(εi ) = 0, Var(εi ) = σ 2 > 0 and Cov(εi , εj ) = 0 for all i 6= j.
The multiple linear regression model is a natural extension of the simple linear
regression model, just with more parameters: β0 , β1 , . . . , βp and σ 2 .
Treating all of the xij s as constants as before, we have:
Estimation of the intercept and slope parameters is still performed using least squares
estimation. The LSEs βb0 , βb1 , . . . , βbp are obtained by minimising:
n p
!2
X X
yi − β0 − βj xij
i=1 j=1
Just as with the simple linear regression model, we can decompose the total variation of
y such that:
Xn Xn n
X
2 2
(yi − ȳ) = yi − ȳ) +
(b εb2i
i=1 i=1 i=1
or, in words:
Total SS = Regression SS + Residual SS.
An unbiased estimator of σ 2 is:
n p
!2
1 X X Residual SS
b2 =
σ yi − βb0 − βbj xij = .
n − p − 1 i=1 j=1
n−p−1
H0 : βi = 0 vs. H1 : βi 6= 0.
359
A. Linear regression (non-examinable)
βbi
T = ∼ tn−p−1
E.S.E.(βbi )
and we reject H0 if |t| > tα/2, n−p−1 . However, note the slight difference in the
interpretation of the slope coefficient βj . In the multiple regression setting, βj is the
effect of xj on y, holding all other independent variables fixed – this is unfortunately
not always practical.
It is also possible to test whether all the regression coefficients are equal to zero. This is
known as a joint test of significance and can be used to test the overall significance
of the regression model, i.e. whether there is at least one significant explanatory
(independent) variable, by testing:
Indeed, it is preferable to perform this joint test of significance before conducting t tests
of individual slope coefficients. Failure to reject H0 would render the model useless and
hence the model would not warrant any further statistical investigation.
Provided εi ∼ N (0, σ 2 ), under H0 : β1 = · · · = βp = 0, the test statistic is:
(Regression SS)/p
F = ∼ Fp, n−p−1 .
(Residual SS)/(n − p − 1)
n
X n
X 2
2
Regression SS = yi − ȳ) =
(b β1 (xi1 − x̄1 ) + · · · + βp (xip − x̄p ) .
b b
i=1 i=1
We now conclude the chapter with worked examples of linear regression using R.
Example 1.6 We illustrate the use of linear regression in R using the dataset
introduced in Example 1.1.
360
A.11. Regression using R
Call:
lm(formula = Sales ~ Student.population)
Residuals:
Min 1Q Median 3Q Max
-21.00 -9.75 -3.00 11.25 18.00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.0000 9.2260 6.503 0.000187 ***
Student.population 5.0000 0.5803 8.617 2.55e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
where F1, 8 .
Example 1.7 We apply the simple linear regression model to study the
relationship between two series of financial returns – a regression of Cisco Systems
stock returns, y, on S&P500 Index returns, x. This regression model is an example of
the capital asset pricing model (CAPM).
Stock returns are defined as:
current price − previous price current price
return = ≈ log
previous price previous price
when the difference between the two prices is small.
A dataset contains daily returns over the period 3 January – 29 December 2000 (i.e.
n = 252 observations). The dataset has 5 columns: Day, S&P500 return, Cisco
return, Intel return and Sprint return.
Daily prices are definitely not independent. However, daily returns may be seen as a
sequence of uncorrelated random variables.
361
A. Linear regression (non-examinable)
> summary(S.P500)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-6.00451 -0.85028 -0.03791 -0.04242 0.79869 4.65458
> summary(Cisco)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-13.4387 -3.0819 -0.1150 -0.1336 2.6363 15.4151
For the S&P500, the average daily return is −0.04%, the maximum daily return is
4.46%, the minimum daily return is −6.01% and the standard deviation is 1.40%.
For Cisco, the average daily return is −0.13%, the maximum daily return is 15.42%,
the minimum daily return is −13.44% and the standard deviation is 4.23%.
We see that Cisco is much more volatile than the S&P500.
Time
There is clear synchronisation between the movements of the two series of returns,
as evident from examining the sample correlation coefficient.
> cor.test(S.P500,Cisco)
362
A.11. Regression using R
Call:
lm(formula = Cisco ~ S.P500)
Residuals:
Min 1Q Median 3Q Max
-13.1175 -2.0238 0.0091 2.0614 9.9491
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.04547 0.19433 -0.234 0.815
S.P500 2.07715 0.13900 14.943 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
363
A. Linear regression (non-examinable)
iii. Variance is a simple measure (and one of the most frequently-used) of risk in
finance.
> summary(Foods)
LVOL PROMP FEAT DISP
Min. :13.83 Min. :3.075 Min. : 2.84 Min. :12.42
1st Qu.:14.08 1st Qu.:3.330 1st Qu.:15.95 1st Qu.:20.59
Median :14.24 Median :3.460 Median :22.99 Median :25.11
Mean :14.28 Mean :3.451 Mean :24.84 Mean :25.31
3rd Qu.:14.43 3rd Qu.:3.560 3rd Qu.:33.49 3rd Qu.:29.34
Max. :15.07 Max. :3.865 Max. :57.10 Max. :45.94
n = 156. The values of FEAT and DISP are much larger than LVOL.
364
A.11. Regression using R
15.0
14.8
14.6
LVOLts
14.4
14.2
14.0
13.8
0 50 100 150
Time
> plot(PROMP,LVOL,pch=16)
15.0
14.8
14.6
LVOL
14.4
14.2
14.0
13.8
PROMP
365
A. Linear regression (non-examinable)
> plot(FEAT,LVOL,pch=16)
15.0
14.8
14.6
LVOL
14.4
14.2
14.0
13.8
10 20 30 40 50
FEAT
> plot(DISP,LVOL,pch=16)
15.0
14.8
14.6
LVOL
14.4
14.2
14.0
13.8
15 20 25 30 35 40 45
DISP
366
A.11. Regression using R
Call:
lm(formula = LVOL ~ PROMP + FEAT)
Residuals:
Min 1Q Median 3Q Max
-0.32734 -0.08519 -0.01011 0.08471 0.30804
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.1500102 0.2487489 68.94 <2e-16 ***
PROMP -0.9042636 0.0694338 -13.02 <2e-16 ***
FEAT 0.0100666 0.0008827 11.40 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
367
A. Linear regression (non-examinable)
Consider now introducing DISP into the regression model to give three explanatory
variables:
y = β0 + β1 x1 + β2 x2 + β3 x3 + ε.
The reason for adding the third variable is that one would expect DISP to have an
impact on sales and we may wish to estimate its magnitude.
Call:
lm(formula = LVOL ~ PROMP + FEAT + DISP)
Residuals:
Min 1Q Median 3Q Max
-0.33363 -0.08203 -0.00272 0.07927 0.33812
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.2372251 0.2490226 69.220 <2e-16 ***
PROMP -0.9564415 0.0726777 -13.160 <2e-16 ***
FEAT 0.0101421 0.0008728 11.620 <2e-16 ***
DISP 0.0035945 0.0016529 2.175 0.0312 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
All the estimated coefficients have the right sign (according to commercial common
sense!) and are statistically significant. In particular, the relationship with DISP
seems real when the other inputs are taken into account. On the other hand, the
addition
√ of DISP to the
√ model has resulted in a very small reduction in σ b, from
0.0161 = 0.1268 to 0.0157 = 0.1253, and correspondingly a slightly higher R2
(0.7633, i.e. 76.33% of the variation of LVOL is explained by the model). Therefore,
DISP contributes very little to ‘explaining’ the variation of LVOL after the other
two explanatory variables, PROMP and FEAT, are taken into account.
Intuitively, we would expect a higher R2 if we add a further explanatory variable to
the model. However, the model has become more complex as a result – there is an
additional parameter to estimate. Therefore, strictly speaking, we should consider
the ‘adjusted R2 ’ statistic, although this will not be considered in this course.
368
A.12. Overview of chapter
369
A. Linear regression (non-examinable)
370
Appendix B
Non-examinable proofs
P (∅) = 0.
∞
X
P (∅) = P (∅ ∪ ∅ ∪ · · · ) = P (∅).
i=1
However, the only real number for P (∅) which satisfies this is P (∅) = 0.
n
! n
[ X
P Ai = P (Ai ).
i=1 i=1
∞
! ∞ n ∞ n
[ X X X X
P Ai = P (Ai ) = P (Ai ) + P (Ai ) = P (Ai )
i=1 i=1 i=1 i=n+1 i=1
371
B. Non-examinable proofs
∞
X
(starting from x = 1) = x (1 − π)x π
x=1
∞
X
= (1 − π) x (1 − π)x−1 π
x=1
∞
X
(using y = x − 1) = (1 − π) (y + 1)(1 − π)y π
y=0
∞ ∞
X y
X
y
= (1 − π)
y (1 − π) π + (1 − π) π
y=0 y=0
| {z } | {z }
= E(X) =1
= (1 − π) [E(X) + 1]
= (1 − π) E(X) + (1 − π)
from which we can solve:
1−π 1−π
E(X) = = .
1 − (1 − π) π
Suppose X is a random variable and a and b are constants, i.e. known numbers which
are not random variables. Therefore:
E(aX + b) = a E(X) + b.
Proof : We have:
X
E(aX + b) = (ax + b) p(x)
x
X X
= ax p(x) + b p(x)
x x
X X
=a x p(x) + b p(x)
x x
= a E(X) + b
where the last step follows from:
P
i. x p(x) = E(X), by definition of E(X)
x
P
ii. p(x) = 1, by definition of the probability function.
x
372
B.3. Chapter 5 – Multivariate random variables
Var(aX + b) = a2 Var(X).
Proof:
= E (aX − a E(X))2
= E a2 (X − E(X))2
= a2 E (X − E(X))2
= a2 Var(X).
Recall:
Cov(X, Y ) = E(XY ) − E(X) E(Y ).
Proof:
373
B. Non-examinable proofs
Cov(X, Y ) = Corr(X, Y ) = 0.
Proof:
Cov(X, Y ) = E(XY ) − E(X) E(Y ) = E(X) E(Y ) − E(X) E(Y ) = 0
since E(XY ) = E(X) E(Y ) when X and Y are independent.
Since Corr(X, Y ) = Cov(X, Y )/[sd(X) sd(Y )], Corr(X, Y ) = 0 whenever
Cov(X, Y ) = 0.
374
Appendix C
Solutions to Sample examination
questions
P (A ∩ B) P (A ∩ B c )
=
P (B) 1 − P (B)
or:
P (A ∩ B) (1 − P (B)) = (P (A) − P (A ∩ B)) P (B)
which implies P (A ∩ B) = P (A) P (B).
(c) False. Consider, for example, throwing a die and letting A be the event that
the throw results in a 1 and B be the event that the throw is 2.
2. There are 10
2
ways in which A and B can choose their seats, of which 9 are pairs
of adjacent seats. Therefore, using the total probability formula, the probability
that A and B are adjacent is:
9 1
10 = .
2
5
3. (a) Judge 1 can either correctly vote guilty a guilty defendant, or incorrectly vote
guilty a not guilty defendant. Therefore, the probability is given by:
375
C. Solutions to Sample examination questions
R
2. (a) Since f (x) dx = 1, we have:
1 1
kx3
Z
2 k
kx dx = = =1
0 3 0 3
and so k = 3.
(b) We have:
1 1 1
3x4
Z Z
3 3
E(X) = x f (x) dx = 3x dx = =
0 0 4 0 4
and:
1 1 1
3x5
Z Z
2 2 4 3
E(X ) = x f (x) dx = 3x dx = = .
0 0 5 0 5
Hence:
2
23 2 3 3
Var(X) = E(X ) − (E(X)) = − = = 0.0375.
5 4 80
3. (a) We compute:
Z 1
1 1 1
x2 (1 − x) dx = − =
0 3 4 12
so k = 12.
(b) We first find:
Z 1
1 1 1
E = 12 x(1 − x) dx = 12 − = 2.
X 0 2 3
Also: Z 1
1
E = 12 (1 − x) dx = 6.
X2 0
Therefore:
1
Var = 6 − 22 = 2.
X
376
C.3. Chapter 4 – Common distributions of random variables
3. We have that X ∼ N (1, 4). Using the definition of conditional probability, and
standardising with Z = (X − 1)/2, we have:
P (3 < X < 5) P (1 < Z < 2) 0.9772 − 0.8413
P (X > 3 | X < 5) = = = = 0.1391.
P (X < 5) P (Z < 2) 0.9772
(c) U and V are not independent since not all joint probabilities are equal to the
product of the respective marginal probabilities. For example, one sufficient
case to disprove independence is noting that P (U = 0, V = 0) = 0 whereas
P (U = 0) P (V = 0) > 0.
i.e. £0.9032.
3. (a) As the letters are delivered at random, the destination of letter 1 follows a
discrete uniform distribution among the six houses, i.e. the probability is equal
to 1/6.
(b) The random variable Xi is equal to 1 with probability 1/6 and 0 otherwise,
hence:
1 5 1
E(Xi ) = 1 × + 0 × = .
6 6 6
(c) If house 1 receives the correct letter, there are 5 letters still to be delivered.
Therefore, for example:
1 1 1
P (X1 = 1 ∩ X2 = 1) = × 6= = P (X1 = 1) P (X2 = 1)
6 5 36
hence X1 and X2 are not independent.
378
C.5. Chapter 6 – Sampling distributions of statistics
11
Xi2 ∼ χ211 .
P
since X12 ∼ N (0, 1) and
i=1
379
C. Solutions to Sample examination questions
(b) We have:
0.2 + 3.6 + 1.1
2 × x̄ = 2 × = 3.27.
3
The point estimate is not plausible since 3.27 < 3.6 = x2 which must be
impossible to observe if X ∼ Uniform[0, 3.27].
Due to the law of large numbers, sample moments should converge to the
corresponding population moments. Here, n = 3 is small hence poor
performance of the MME is not surprising.
2. (a) We have to minimise:
3
X
S= ε2i = (y1 − α − β)2 + (y2 − α − 2β)2 + (y3 − α − 4β)2 .
i=1
We have:
∂S
= −2(y1 − α − β) − 2(y2 − α − 2β) − 2(y3 − α − 4β)
∂α
= 2(3α + 7β − (y1 + y2 + y3 ))
and:
∂S
= −2(y1 − α − β) − 4(y2 − α − 2β) − 8(y3 − α − 4β)
∂β
= 2(7α + 21β − (y1 + 2y2 + 4y3 )).
380
C.6. Chapter 7 – Point estimation
The estimators α
b and βb are the solutions of the equations ∂S/∂α = 0 and
∂S/∂β = 0. Hence:
3b
α + 7βb = y1 + y2 + y3 and 7b
α + 21βb = y1 + 2y2 + 4y3 .
Solving yields:
−4y1 − y2 + 5y3 2y1 + y2 − y3
βb = and α
b= .
14 2
They are unbiased estimators since:
−4α − 4β − α − 2β + 5α + 20β
E(β)
b = =β
14
and:
2α + 2β + α + 2β − α − 4β
E(b
α) = = α.
2
(b) We have, by independence:
2 2
1 1 2 3
Var(b
α) = 1 + + = .
2 2 2
Differentiating:
n n
Xi − 2nλ2
P P
2 Xi 2
d i=1 i=1
l(λ) = − 2nλ = .
dλ λ λ
Setting to zero, we re-arrange for the estimator:
n 1/2
P
n
X i=1 Xi
2 b2 = 0
Xi − 2nλ ⇒ λ=
b = X̄ 1/2 .
n
i=1
b3 = X̄ 3/2 .
θb = λ
381
C. Solutions to Sample examination questions
1. We have:
X̄ − µ
1−α=P −tα/2, n−1 ≤ √ ≤ tα/2, n−1
S/ n
S S
=P −tα/2, n−1 × √ ≤ X̄ − µ ≤ tα/2, n−1 × √
n n
S S
=P −tα/2, n−1 × √ < µ − X̄ < tα/2, n−1 × √
n n
S S
=P X̄ − tα/2, n−1 × √ < µ < X̄ + tα/2, n−1 × √ .
n n
Hence an accurate 100 (1 − α)% confidence interval for µ, where α ∈ (0, 1), is:
S S
X̄ − tα/2, n−1 × √ , X̄ + tα/2, n−1 × √ .
n n
and:
250
!
1 X 1
s2 = x2i − 250x̄2 163 − 250 × (0.652)2 = 0.2278.
=
249 i=1
259
3. For a 90% confidence interval, we need the lower and upper 5% values from
χ2n−1 = χ29 . These are χ20.95, 9 = 3.325 (given in the question) and χ20.05, 9 = 16.92,
using Table 8 of the New Cambridge Statistical Tables. Hence we obtain:
!
(n − 1)s2 (n − 1)s2
9 × 21.05 9 × 21.05
, = , = (11.20, 56.98).
χ2α/2, n−1 χ21−α/2, n−1 16.92 3.325
382
C.8. Chapter 9 – Hypothesis testing
= 0.784.
(b) We have:
P (Type I error) = P (reject H0 | H0 ) = 1 − P (X ≤ 3 | π = 0.3)
3
X
=1− (1 − 0.3)x−1 × 0.3
x=1
= 0.343.
(c) The p-value is P (X ≥ 4 | π = 0.3) = 0.343 which, of course, is the same as the
probability of a Type I error.
2. The size is the probability we reject the null hypothesis when it is true:
1 1.5
P X > 1|λ = = 1 − e−0.5 − 0.5e−0.5 ≈ 1 − √ = 0.0902.
2 2.718
The power is the probability we reject the null hypothesis when the alternative is
true:
3
P (X > 1 | λ = 2) = 1 − e−2 − 2e−2 ≈ 1 − = 0.5939.
(2.718)2
3. The power of the test at σ 2 is:
β(σ) = Pσ (H0 is rejected) = Pσ (T > χ2α, n−1 )
(n − 1)S 2
2
= Pσ > χα, n−1
σ02
(n − 1)S 2 σ02
2
= Pσ > 2 × χα, n−1
σ2 σ
σ02
2
= P X > 2 × χα, n−1
σ
where X ∼ χ2n−1 . Hence here, where n = 10, we have:
2.00 2 2.00
β(σ) = P X > 2 × χ0.01, 9 = P X > 2 × 21.666 .
σ σ
With any given values of σ 2 , we may compute β(σ). For the σ 2 values requested,
we obtain the following.
σ2 2.00 2.56
2.00 × 21.666/σ 2 21.666 16.927
Approx. β(σ) 0.01 0.05
383
C. Solutions to Sample examination questions
and the mean sum of squares (MS) due to networks is 35.13/(3 − 1) = 17.57.
The degrees of freedom are 4 − 1 = 3, 3 − 1 = 2, (4 − 1)(3 − 1) = 6 and
4 × 3 − 1 = 11 for cities, networks, error and total sum of squares, respectively.
The SS for cities is 3 × 1.95 = 5.85. We have that the SS due to residuals is
given by 51.52 − 5.85 − 35.13 = 10.54 and the MS is 10.54/6 = 1.76.
The F -values are 1.95/1.76 = 1.11 and 17.57/1.76 = 9.98 for cities and
networks, respectively.
Here is the two-way ANOVA table:
384
C.9. Chapter 10 – Analysis of variance (ANOVA)
3. (a) The average time for all batteries is 41.5. Hence the sum of squares for
batteries is:
and the mean sum of squares due to batteries is 57.94/(4 − 1) = 19.31. The
degrees of freedom are 7 − 1 = 6, 4 − 1 = 3, (7 − 1)(4 − 1) = 18 and
7 × 4 − 1 = 27 for laptops, batteries, error and total sum of squares,
respectively.
The sum of squares for laptops is 6 × 26 = 156. We have that the sum of
squares due to residuals is given by:
385
C. Solutions to Sample examination questions
386
Appendix D
Examination formula sheet
Discrete distributions
1 k+1 k2 − 1
Uniform for all x = 1, 2, . . . , k
k 2 12
n
Binomial x
π x (1 − π)n−x for x = 0, 1, . . . , n nπ n π (1 − π)
1 1−π
Geometric (1 − π)x−1 π for x = 1, 2, 3, . . .
π π2
e−λ λx
Poisson for x = 0, 1, 2, . . . λ λ
x!
Continuous distributions
1 1
Exponential λ e−λx for x > 0 1 − e−λx for x > 0
λ λ2
1 2 /2σ 2
Normal √ e−(x−µ) for all x µ σ2
2πσ 2
387
D. Examination formula sheet
Sample quantities
n n
1 X 1 X 2
Sample variance s2 = (xi − x̄)2 = x − nx̄2
n − 1 i=1 n − 1 i=1 i
n n
1 X 1 X
Sample covariance (xi − x̄)(yi − ȳ) = xi yi − nx̄ȳ
n − 1 i=1 n − 1 i=1
n
P
xi yi − nx̄ȳ
i=1
Sample correlation s
n n
x2i yi2
P P
− nx̄2 − nȳ 2
i=1 i=1
Inference
σ2
Variance of sample mean
n
X̄ − µ
One-sample t statistic √
S/ n
s
n+m−2 X̄ − Ȳ − δ0
Two-sample t statistic ×p
1/n + 1/m 2
(n − 1)SX + (m − 1)SY2
388