Notes
Notes
STAT261
Statistical Inference Notes
Printed at the University of New England, October 4, 2007
Contents
1 Estimation
1.1 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.1 Examples of Statistics . . . . . . . . . . . . . .
1.2 Estimation by the Method of Moments . . . . . . . . .
1.3 Estimation by the Method of Maximum Likelihood . .
1.4 Properties of Estimators . . . . . . . . . . . . . . . . .
1.5 Examples of Estimators and their Properties . . . . . .
1.6 Properties of Maximum Likelihood Estimators . . . . .
1.7 Confidence Intervals . . . . . . . . . . . . . . . . . . .
1.7.1 Pivotal quantity . . . . . . . . . . . . . . . . . .
1.8 Bayesian estimation . . . . . . . . . . . . . . . . . . . .
1.8.1 Bayes theorem for random variables . . . . . .
1.8.2 Post is prior likelihood . . . . . . . . . . . . .
1.8.3 Likelihood . . . . . . . . . . . . . . . . . . . . .
1.8.4 Prior . . . . . . . . . . . . . . . . . . . . . . . .
1.8.5 Posterior . . . . . . . . . . . . . . . . . . . . . .
1.9 Normal Prior and Likelihood . . . . . . . . . . . . . . .
1.10 Bootstrap Confidence Intervals . . . . . . . . . . . . .
1.10.1 The empirical cumulative distribution function.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
5
8
12
15
19
21
22
25
28
28
28
31
31
32
34
36
36
2 Hypothesis Testing
2.1 Introduction . . . . . . . . . . . . . . . . . . .
2.2 Terminology and Notation. . . . . . . . . . . .
2.2.1 Hypotheses . . . . . . . . . . . . . . .
2.2.2 Tests of Hypotheses . . . . . . . . . . .
2.2.3 Size and Power of Tests . . . . . . . .
2.3 Examples . . . . . . . . . . . . . . . . . . . .
2.4 One-sided and Two-sided Tests . . . . . . . .
2.4.1 Case(a) Alternative is one-sided . . .
2.4.2 Case (b) Two-sided Alternative . . . .
2.4.3 Two Approaches to Hypothesis Testing
2.5 Two-Sample Problems . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40
40
41
41
41
42
43
47
48
48
50
53
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2.6
2.7
2.8
2.9
and CIs
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
55
57
58
58
58
61
61
62
65
3 Chisquare Distribution
3.1 Distribution of S 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Chi-Square Distribution . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Independence of X and S 2 . . . . . . . . . . . . . . . . . . . . . . .
3.4 Confidence Intervals for 2 . . . . . . . . . . . . . . . . . . . . . . .
3.5 Testing Hypotheses about 2 . . . . . . . . . . . . . . . . . . . . . .
3.6 2 and Inv-2 distributions in Bayesian inference . . . . . . . . . .
3.6.1 Non-informative priors . . . . . . . . . . . . . . . . . . . . .
3.7 The posterior distribution of the Normal variance . . . . . . . . . .
3.7.1 Inverse Chi-squared distribution . . . . . . . . . . . . . . . .
3.8 Relationship between 2 and Inv-2 . . . . . . . . . . . . . . . . . .
3.8.1 Gamma and Inverse Gamma . . . . . . . . . . . . . . . . . .
3.8.2 Chi-squared and Inverse Chi-squared . . . . . . . . . . . . .
3.8.3 Simulating Inverse Gamma and Inverse-2 random variables.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
67
67
69
75
75
77
79
79
80
81
82
82
82
82
4 F Distribution
4.1 Derivation . . . . . . . . . . . . . . . . . . . . .
4.2 Properties of the F distribution . . . . . . . . .
4.3 Use of F-Distribution in Hypothesis Testing . .
4.4 Pooling Sample Variances . . . . . . . . . . . .
4.5 Confidence Interval for 12 /22 . . . . . . . . . .
4.6 Comparing parametric and bootstrap confidence
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
intervals
. .
. .
. .
. .
. .
for
. . . .
. . . .
. . . .
. . . .
. . . .
12 /22
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
85
85
86
89
92
94
94
5 t-Distribution
5.1 Derivation . . . . . . . . . . . . . . . . . . .
5.2 Properties of the tDistribution . . . . . . .
5.3 Use of tDistribution in Interval Estimation
5.4 Use of t-distribution in Hypothesis Testing .
5.5 Paired-sample t-test . . . . . . . . . . . . .
5.6 Bootstrap T-intervals . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
96
. 96
. 97
. 99
. 104
. 109
. 111
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
115
115
115
125
125
129
131
134
7 Analysis of Variance
7.1 Introduction . . . . . . . . . . . . . . . . . . . . .
7.2 The Basic Procedure . . . . . . . . . . . . . . . .
7.3 Single Factor Analysis of Variance . . . . . . . . .
7.4 Estimation of Means and Confidence Intervals . .
7.5 Assumptions Underlying the Analysis of Variance
7.5.1 Tests for Equality of Variance . . . . . . .
7.6 Estimating the Common Mean . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
138
138
138
140
148
149
150
152
.
.
.
.
.
153
153
154
159
161
166
. .
. .
. .
Y
. .
.
.
.
.
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
Table
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The notes
Material for these notes has been drawn and collated from the following sources
Mathematical Statistics with Applications William Mendenhall, Dennis Wakerly, Richard
Schaeffer. Duxbury ISBN 0-534-92026-8
Bayesian Statistics an introduction, third edition Peter Lee. Hodder Arnold ISBN 0-34081405-5.
Bayesian Data Analysis Andrew Gelman, John Carlin, Hal Stern, Donald Rubin. Chapman
& Hall. ISBN 1-58488-388-X
An Introduction to the Bootstrap Bradley Effron, Robert Tibshirani. Chapman & Hall.
ISBN 0-412-04231-2
Introduction to Statistics through Resampling Methods and R/S-PLUS Phillip Good. Wiley
ISBN 0-471-71575-1
2
Not all topics include the 3 categories of inference because the Bayesian or Nonparametric counterpart does not always align with the frequentist methods. However, where there
exist sufficient alignment, an alternate method of inference is introduced. It is hoped that
this will stimulate students to explore the topics of Bayesian and Nonparametric statistics
more fully in later units.
Parametric, Frequentist
Both systematic and random components are represented by a mathematical model and
the model is a function of parameters which are estimated from the data. For example
yij = 0 + 1 xi + ij
ij N (0, 2 )
Bayesian
Whereas in frequentist inference the data are considered a random sample and the parameters fixed, Bayesian statistics regards the data as fixed and the parameters as random
samples. The exercise is that given the data, what are the distributions of the parameters
such that the observed sample from those distributions could give rise to the observed data.
Non-parametric
This philosophy does not assume that a mathematical form (with parameters) should be
imposed on the data and the model is determined by the data themselves. The techniques
include
premutation tests, bootstrap, Kolmogorov-Smirnov tests etc.
Kernel density estimation, kernel regression, smoothing splines etc.
This seems a good idea to not impose any predetermined mathematical form on the
data. However, the limitations are
the data are not summarized by parameters and so interpretation of the data requires
whole curves etc. There is not a ready formula to plug in values to derive estimates.
3
Requires sound computing skills and numerical methods.
The statistical method may be appropriate only when there is sufficient data to
reliably indicate associations etc. without the assistance of a parametric model.
Chapter 1
Estimation
The application of the methods of probability to the analysis and interpretation of data is
known as statistical inference. In particular, we wish to make an inference about a population based on information contained in a sample. Since populations are characterized by
numerical descriptive measures called parameters, the objective of many statistical investigations is to make an inference about one or more population parameters. There are two
broad areas of inference: estimation (the subject of this chapter) and hypothesis-testing
(the subject of the next chapter).
When we say that we have a random sample X1 , X2 , . . . , Xn from a random variable X
or from a population with distribution function F (x; ), we mean that X1 , X2 , . . . , Xn are
identically and independently distributed random variables each with c.d.f. F (x; ), that
is, depending on some parameter . We usually assume that the form of the distribution,
e.g., binomial, Poisson, Normal, etc. is known but the parameter is unknown. We wish
to obtain information from the data (sample) to enable us to make some statement about
the parameter. Note that, may be a vector, e.g., = (, 2 ). See WMS 2.12 for more
detailed comments on random samples.
The general problem of estimation is to find out something about using the information in the observed values of the sample, x1 , x2 , . . . , xn . That is, we want to choose a
function H(x1 , x2 , . . . , xn ) that will give us a good estimate of the parameter in F (x; ).
1.1
Statistics
We will introduce the technical meaning of the word statistic and look at some commonly
used statistics.
Definition 1.1
Any function of the elements of a random sample, which does
not depend on unknown parameters, is called a statistic.
CHAPTER 1. ESTIMATION
1.1.1
Examples of Statistics
CHAPTER 1. ESTIMATION
> source("SampleStats.R")
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.8
7.9
9.4
9.8
12.0 18.2
mean = 9.9
var = 9.5
sd
= 3.1
range = 1.8 18
median = 9.4
Second Moment = 106
If you are using Rcmdr, the menu for generating normal random variables is
Distributions Normal distribution Sample from a normal distribution
which appears like this:-
You must then supply a name for the data set (e.g. rn) and the parameters mu = 10,
sigma =3. Make the number of rows = 100 and the number of columns = 1.
CHAPTER 1. ESTIMATION
When you click OK a data set containing the numbers is produced. Here the name of
the data set is rn and this appears in Data set: rn in the top left of Rcmdr. Observe in
the script window that Rcmdr is using the input through the menus (GUIs) to produce a
script akin to that above. Rcmdr is an alternative of computing but it is not sufficiently
comprehensive do all our computing and usually requires augmenting with other scripts.
If there is an active data set, summary statistics are derived by
Statistics Summaries Active data set
CHAPTER 1. ESTIMATION
We have not previously encountered S 2 , Mr , R, etc. but we (should) already know the
following facts about X.
(i) It is a random variable with E(X) = , Var(X) = 2 /n, where = E(Xi ),
2 =Var(Xi ).
(ii) If X1 , X2 , . . . , Xn is from a normal distribution, then X is also normally distributed.
(iii) For large n and any distribution of the Xi for which a mean () and variance ( 2 )
2
(by the
exist, X is distributed approximately normal with mean and variance
n
Central Limit Theorem).
Next we will consider some general methods of estimation. Since different methods may
lead to different estimators for the same parameter, we will then need to consider criteria
for deciding whether one estimate is better than another.
1.2
Recall that, for a random variable X, the rth moment about the origin is 0r =E(X r )
and that for a random sample X1 , X2 , . . . , Xn , the rth sample moment about the origin is
defined by
n
X
Mr =
Xir /n, r = 1, 2, 3, . . .
i=1
n
X
xri /n .
i=1
Note that the first sample moment is just the sample mean, X.
We will first prove a property of sample moments.
Theorem 1.1
Let X1 , X2 , . . . , Xn be a random sample of X. Then
E(Mr ) = 0r , r = 1, 2, 3, . . .
Proof
1
E(Mr ) = E
n
n
X
i=1
!
Xir
1X 0
1X
=
E(Xir ) =
= 0r .
n i=1
n i=1 r
This theorem provides the motivation for estimation by the method of moments (with the
estimator being referred to as the method of moments estimator or MME). The sample
CHAPTER 1. ESTIMATION
1
x dx
=
2
Using the Method of Moments we proceed to estimate = /2 by m1 . Thus since m1
= x we have
=x
2
and,
= 2x.
Then, = 2 x and the MME of is 2 X.
Computer Exercise 1.2
Generate 100 samples of size 10 from a uniform distribution, U(0,) with = 10. Estimate
the value of from your samples using the method of moments and plot the results.
Comment on the results.
In this exercise, we know a priori that = 10 and have generated random samples
samples. the samples are analysed as if unknown and estimated by the method of
moments. Then we can compare the estimates with the known value.
Solution:
CHAPTER 1. ESTIMATION
10
density(x = theta.estimates)
0.10
0.00
Density
for (i in 1:nsimulations){
ru <- runif(n=sampsz,min=0,max=theta)
Xbar <- mean(ru)
theta.estimates[i] <- 2*Xbar
}
#
end of the i loop
0.20
10
12
14
plot(density(theta.estimates))
N = 100 Bandwidth = 0.605
= x, and 2 =
xi x2 .
n
P
The latter can also be written as 2 = n1 ni=1 (xi x)2
Computer Exercise 1.3
Generate 100 samples of size 10 from a normal distribution with = 14 and = 4.
Estimate and 2 from your samples using the method of moments. Plot the estimated
values of and 2 . Comment on your results.
Solution:
CHAPTER 1. ESTIMATION
0.30
0.20
0.00
0.10
Density
10
14
18
0.04
0.06
0.00
plot(density(mu.estimates))
plot(density(var.estimates))
0.02
Density
#_____________NormalMoments.R ___________
mu <- 14
sigma <- 4
sampsz <- 10
nsimulations <- 100
mu.estimates <- numeric(nsimulations)
var.estimates <- numeric(nsimulations)
for (i in 1:nsimulations){
rn <- rnorm(mean=mu,sd=sigma,n=sampsz)
mu.estimates[i] <- mean(rn)
var.estimates[i] <- mean( (rn -mean(rn))^2
}
# end of i loop
11
10
20
30
40
The plot you obtain for the row means should be centred around the true mean of
14. However, you will notice that the plot of the variances is not centred about the true
variance of 16 as you would like. Rather it will appear to be centred about a value less than
16. The reason for this will become evident when we study the properties of estimators in
section 1.5.
General Procedure
Let X1 , X2 , . . . , Xn be a random sample from F (x : 1 , . . . , k ). That is, suppose that
there are k parameters to be estimated. Let 0r , mr (r = 1, 2, . . . , k) denote the first k
population and sample moments respectively, and suppose that each of these population
moments are certain known functions of the parameters. That is,
01 = g1 (1 , . . . , k )
02 = g2 (1 , . . . , k )
..
.
0
k = gk (1 , . . . , k ) .
Solving simultaneously the set of equations,
0r = gr (1 , . . . , k ) = mr , r = 1, 2, . . . , k
gives the required estimates, 1 , . . . , k .
CHAPTER 1. ESTIMATION
1.3
12
First the term likelihood of the sample must be defined. This has to be done separately
for discrete and continuous distributions.
Definition 1.2
Let x1 , x2 , . . . , xn be sample observations taken on the random variables
X1 , X2 , . . . , Xn . Then the likelihood of the sample, L(|x1 , x2 , . . . , xn ), is defined as:
(i) the joint probability of x1 , x2 , . . . , xn if X1 , X2 , . . . , Xn are discrete, and
(ii) the joint probability density function of X1 , . . . , Xn evaluated at x1 , x2 ,
. . . , xn if the random variables are continuous.
In general the value of the likelihood depends not only on the (fixed) sample x1 , x2 , . . . , xn
but on the value of the (unknown) parameter . and can be thought of as a function of .
The likelihood function for a set of n identically and independently distributed (iid)
random variables, X1 , X2 , . . . , Xn , can thus be written as:
P (X1 = x1 ).P (X2 = x2 )...P (Xn = xn ) for X discrete
L(; x1 , . . . , xn ) =
(1.1)
f (x1 ; ).f (x2 ; )...f (xn ; )
for X continuous.
For the discrete case, L(; x1 . . . . , xn ) is the probability (or likelihood) of observing
(X1 = x1 , X2 = x2 , . . . , Xn = xn ) It would then seem that a sensible approach to selecting
an estimate of would be to find the value of which maximizes the probability of observing
(X1 = x1 , X2 = x2 , . . . , Xn = xn ), (the event which occured).
The maximum likelihood estimate (MLE) of is defined as that value of which
maximizes the likelihood. To state it more mathematically, the MLE of is that value of
, say such that
x1 , . . . , xn ) > L(0 ; x1 , . . . , xn ).
L(;
where 0 is any other value of .
Before we consider particular examples of MLEs, some comments about notation and
technique are needed.
Comments
1. It is customary to use to denote both estimator (random variable) and estimate
(its observed value). Recall that we used for the MME.
CHAPTER 1. ESTIMATION
13
2. Since L(; x1 , x2 , . . . , xn ) is a product, and sums are usually more convenient to deal
with than products, it is customary to maximize log L(; x1 , . . . , xn ) which we usually
abbreviate to l(). This has the same effect. Since log L is a strictly increasing
function of L, it will take on its maximum at the same point.
3. In some problems, will be a vector in which case L() has to be maximized by
differentiating with respect to 2 (or more) variables and solving simuiltaneously 2 (or
more) equations.
4. The method of differentiation to find a maximum only works if the function concerned
actually has a turning point.
Example 1.3
Given X is distributed bin(1, p) where p (0, 1), and a random sample x1 , x2 , ..., xn , find
the maximum likelihood estimate of p.
Solution: The likelihood is,
L(p; x1 , x2 , . . . , xn ) = P (X1 = x1 )P (X2 = x2 )...P (Xn = xn )
n
Y
1 xi
=
p (1 p)1xi
x
i
i=1
= px1 +x2 ++xn (1 p)nx1 x2 xn
P
P
= p xi (1 p)n xi
So
log L(p) =
xi log p + (n
xi ) log(1 p)
dp
p
1p
P
P
P
This is equal to zero when
xi (1 p) = p(n xi ), that is, when p = xi /n.
This estimate is denoted by p.
Thus, if the random variable X is distributed bin(1, p), the MLE of p derived from a sample
of size n is
p = X.
(1.2)
Example 1.4
Given x1 , x2 , . . . , xn is a random sample from a N (, 2 ) distribution, where both and
2 are unknown, find the maximum likelihood estimates of and 2 .
CHAPTER 1. ESTIMATION
14
1
2
2
e(xi ) /2
2
i=1
Pn
1
2
2
=
e i=1 (xi ) /2
2
n/2
(2 )
L(, ; x1 , . . . , xn ) =
So
X
n
n
(xi )2 /2 2
log L(, ) = log(2) log 2
2
2
i=1
2
To maximize this w.r.t. and 2 we must solve simultaneously the two equations
log L(, 2 )/ = 0
(1.3)
log L(, 2 )/ 2 = 0.
(1.4)
i=1
(1.5)
Pn
2
n
i=1 (xi )
+
=0
(1.6)
2 2
2 4
Pn
From (1.5) we obtain i=1 xi = n, so that
= x. Using this in equation (1.6), we obtain
2 =
n
X
(xi x)2 /n .
i=1
=X
and
2 =
n
X
(1.7)
i=1
Note that these are the same estimators as obtained by the method of moments.
Example 1.5
Given random variable X is distributed uniformly on [0, ], find the MLE of based on a
sample of size n.
Solution: Now f (xi ; ) = 1/, xi [0, ], i = 1, 2, . . . , n. So the likelihood is
n
Y
L(; x1 , x2 , . . . , xn ) =
(1/) = 1/n .
i=1
CHAPTER 1. ESTIMATION
15
L ()
1 n
When we come to find the maximum of this function we note that the slope is not zero
dL()
d log L()
anywhere, so there is no use finding
or
.
d
d
Note however that L() increases as 0. So L() is maximized by setting equal to the
smallest value it can take. If the observed values are x1 , . . . , xn then can be no smaller
than the largest of these. This is because xi [0, ] for i = 1, . . . , n. That is, each xi
or each xi .
Thus, if X is distributed U(0, ), the MLE of is
= max(Xi ).
(1.8)
Comment
The Method of Moments was first proposed near the turn of the century by the British
statistician Karl Pearson. The Method of Maximum Likelihood goes back much further.
Both Gauss and Daniel Bernoulli made use of the technique, the latter as early as 1777.
Fisher though, in the early years of the twentieth century, was the first to make a thorough
study of the methods properties and the procedure is often credited to him.
1.4
Properties of Estimators
Using different methods of estimation can lead to different estimators. Criteria for deciding
which are good estimators are required. Before listing the qualities of a good estimator, it
is important to understand that they are random variables. For example, suppose that we
take a sample of size 5 from a uniform distribution and calculate x. Each time we repeat
CHAPTER 1. ESTIMATION
16
the experiment we will probably get a different sample of 5 and therefore a different x. The
behaviour of an estimator for different random samples will be described by a probability
distribution. The actual distribution of the estimator is not a concern here and only its
mean and variance will be considered. As a first condition it seems reasonable to ask that
the distribution of the estimator be centered around the parameter it is estimating. If not
it will tend to overestimate or underestimate . A second property an estimator should
possess is precision. An estimator is precise if the dispersion of its distribution is small.
These two concepts are incorporated in the definitions of unbiasedness and efficiency below.
In the following, X1 , X2 , . . . , Xn is a random sample from the distribution F (x; ) and
H(X1 , . . . , Xn ) = will denote an estimator of (not necessarily the MLE).
Definition 1.3 Unbiasedness
An estimator of is unbiased if
= for all .
E()
(1.9)
(1.10)
There may be large number of unbiased estimators of a parameter for any given distribution and a further criterion for choosing between all the unbiased estimators is needed.
Definition 1.4 Efficiency
Let 1 and 2 be 2 unbiased estimators of with variances Var(1 ), Var(2 )
respectively, We say that 1 is more efficient than 2 if
Var(1 ) < Var(2 ) .
That is, 1 is more efficient than 2 if it has a smaller variance.
Definition 1.5 Relative Efficiency
The relative efficiency of 2 with respect to 1 is defined as
efficiency = Var(1 )/Var(2 ) .
(1.11)
CHAPTER 1. ESTIMATION
17
0.6
0.5
moment
ML
0.4
0.3
0.2
0.1
0.0
plot(density(moment.estimates),
6
xlab=" ",ylab=" ",main=" ",ylim=c(0,0.6),las=1) 4
abline(v=theta,lty=3)
lines(density(ML.estimates),lty=2)
legend(11,0.5,legend=c("moment","ML"),lty=1:2,cex=0.6)
10
12
14
You should see that the Method of Moments gives unbiased estimates of which many
are not in the range space as noted in Computer Example 1.3. The maximum likelihood
estimates are all less than 10 and so are biased.
It will now be useful to indicate that the estimator is based on a sample of size n by
denoting it by n .
CHAPTER 1. ESTIMATION
18
(1.12)
This is a large-sample or asymptotic property. Consistency has to do only with the limiting
behaviour of an estimator as the sample size increases without limit and does not imply
that the observed value of is necessarily close to for any specific size of sample n. If
only a relatively small sample is available, it would seem immaterial whether a consistent
estimator is used or not.
The following theorem (which will not be proved) gives a method of testing for consistency.
Theorem 1.2
If, lim E(n ) = and lim Var(n ) = 0, then n is a consistent estimator of .
n
10.0
9.8
9.6
9.4
9.2
20
40
60
80
As n increases .
The final concept of sufficiency requires some explanation before a formal definition
is given. The random sample X1 , X2 , . . . , Xn drawn from the distribution with F (x; )
contains information about the parameter . To estimate , this sample is first condensed
100
CHAPTER 1. ESTIMATION
19
1.5
In this section we will consider the sample mean X and the sample variance S 2 and examine
which of the above properties they have.
Theorem 1.3
Let X be a random variable with mean and variance 2 . Let X be the sample mean
based on a random sample of size n. Then X is an unbiased and consistent estimator
of .
Proof
Now E(X) = , no matter what the sample size is, and Var(X) = 2 /n. The
latter approaches 0 as n , satisfying Theorem 1.2.
It can also be shown that of all linear functions of X1 , X2 , . . . , Xn , X has minimum
variance. Note that the above theorem is true no matter what distribution is sampled.
Some applications are given below.
For a random sample X1 , X2 , . . . , Xn , X is an unbiased and consistent estimator of:
(i) when the Xi are distributed N (, 2 );
CHAPTER 1. ESTIMATION
20
Sample Variance
Recall that the sample variance is defined by
2
S =
n
X
i=1
Theorem 1.4
Given X1 , X2 , . . . , Xn is a random sample from a distribution with mean and variance
2 , then S 2 is an unbiased estimator of 2 .
Proof
(n 1)E(S 2 ) = E
n
X
(Xi X)2
i=1
= E
= E
= E
= E
n
X
[Xi (X )]2
"i=1n
X
" i=1
n
X
(Xi )2 2(X )
n
X
#
(Xi ) + n(X )2
i=1
#
(Xi )2 2n(X )2 + n(X )2
i=1
n
X
(Xi )2 nE(X )2
i=1
n
X
Var(Xi ) nVar(X)
i=1
= n 2
So E(S 2 ) = 2 .
n.
2
n
= (n 1) 2
(1.13)
CHAPTER 1. ESTIMATION
21
(ii) The number in the denominator of S 2 , that is, n1, is called the number of degrees
of freedom. The numerator is the sum of n deviations (from the mean) squared
but the deviations
are not independent. There is one constraint on them, namely the
P
fact that (Xi X) = 0. As soon as n 1 of the Xi X are known, the nth one is
determined.
(iii) In calculating the observed value of S 2 , s2 , the following form is usually convenient.
P
X
( xi )2
2
2
s =
xi
/(n 1)
(1.14)
n
or, equivalently,
x2i nx2
n1
The equivalence of the two forms is easily seen:
X
X
X
X
(x2i 2xxi + x2 ) =
x2i 2x
xi + nx2
(xi x)2 =
2
s =
P 2
(
xi )
x2i
.
n
4 322
222
n
n1
(1.15)
1.6
The following four properties are the main reasons for recommending the use of Maximum
Likelihood Estimators.
(i) The MLE is consistent.
(ii) The MLE has a distribution that tends to normality as n .
(iii) If a sufficient statistic for exists, then the MLE is sufficient.
CHAPTER 1. ESTIMATION
22
1.7
Confidence Intervals
In the earlier part of this chapter we have been considering point estimators of a parameter. By point estimator we are referring to the fact that, after the sampling has been done
and the observed value of the estimator computed, our end-product is the single number
which is hopefully a good approximation for the unknown true value of the parameter. If
the estimator is good according to some criteria, then the estimate should be reasonably
close to the unknown true value. But the single number itself does not include any indication of how high the probability might be that the estimator has taken on a value close
to the true unknown value. The method of confidence intervals gives both an idea of
the actual numerical value of the parameter, by giving it a range of possible values, and a
measure of how confident we are that the true value of the parameter is in that range. To
pursue this idea further consider the following example.
Example 1.6
Consider a random sample of size n for a normal distribution with mean (unknown) and
known variance 2 . Find a 95% confidence interval for the unknown mean, .
Solution: We know that the best estimator of is X and the sampling distribution of X
2
). Then from the standard normal,
is N(,
n
|X |
< 1.96 = .95 .
P
/ n
The event
|X |
< 1.96 is equivalent to the event
/ n
1.96
1.96
< X < + ,
n
n
CHAPTER 1. ESTIMATION
Hence
23
= .95
(1.16)
P X 1.96 < < X + 1.96
n
n
The two statistics X 1.96 , X +1.96 are the endpoints of a 95% confidence interval
n
n
for . This is reported
as
1.96 , X
+ 1.96
The 95% CI for is X
n
n
Computer Exercise 1.6
Generate 100 samples of size 9 from a N(0,1) distribution. Find the 95% CI for for each
of these samples and count the number that do (dont) contain zero. (You could repeat
this say 10 times to build up the total number of CIs generated to 1000.) You should
observe that about 5% of the intervals dont contain the true value of (= 0).
Solution: Use the commands:
#___________ ConfInt.R __________
sampsz <- 9
nsimulations <- 100
non.covered <- 0
for (i in 1:nsimulations){
rn <- rnorm(mean=0,sd=1,n=sampsz)
Xbar <- mean(rn)
s <- sd(rn)
CI <- qnorm(mean=Xbar,sd=s/sqrt(sampsz),p=c(0.025,0.975) )
non.covered <- non.covered + (CI[1] > 0) + (CI[2] < 0)
}
cat("Rate of non covering CIs",100*non.covered/nsimulations," % \n")
> source("ConfInt.R")
Rate of non covering CIs 8 %
This implies that 8 of the CIs dont contain 0. With a larger sample size we would expect
that about 5% of the CIs would not contain zero.
We make the following definition:
Definition 1.8
An interval, at least one of whose endpoints is a random variable is called a
random interval.
CHAPTER 1. ESTIMATION
24
In (1.16), we are saying that the probability is 0.95 that the random interval
,
(1.17)
x 1.96 , x + 1.96
n
n
but the statement
is either true or false. The parameter is a constant and either the interval contains it in
which case the statement is true, or it does not contain it, in which case the statement is
false. How then is the probability 0.95 to be interpreted? It must be considered in terms
of the relative frequency with which the indicated event will occur in the long run of
similar sampling experiments.
Each time we take a sample of size n, a different x, and hence a different interval (1.17)
would be obtained. Some of these intervals will contain as claimed, and some will not. In
fact, if we did this many times, wed expect that 95 times out of 100 the interval obtained
would contain . The measure of our confidence is then 0.95 because before a sample
is drawn there is a probability of 0.95 that the confidence interval to be constructed will
cover the true mean.
A statement such as P (3.5 < < 4.9) = 0.95 is incorrect and should be replaced by :
A 95% confidence interval for is (3.5, 4.9).
We can generalize the above as follows: Let z/2 be defined by
(z/2 ) = 1 (/2) .
(1.18)
That is, the area under the normal curve above z/2 is /2. Then
P
z/2
So a 100(1 )% CI for is
X
< z/2
<
/ n
= 1 .
x z/2 , x + z/2
n
n
.
(1.19)
x z2/3 , x + z/3
n
n
CHAPTER 1. ESTIMATION
25
is also a 100(1 )% CI for . Likewise, we could have one-sided CIs for . For example,
or x z , .
, x + z
n
n
X
< z = 1 . ]
[The second of these arises from considering P
/ n
We could also have a CI based on say, the sample median instead of the sample mean.
Methods of obtaining confidence intervals must be judged by their various statistical properties. For example, one desirable property is to have the length (or expected length) of a
100(1 )% CI as short as possible. Note that for the CI in (1.19), the length is constant
for given n.
1.7.1
Pivotal quantity
We will describe a general method of finding a confidence interval for from a random
sample of size n. It is known as the pivotal method as it depends on finding a pivotal
quantity that has 2 characteristics:
(i) It is a function of the sample observations and the unknown parameter , say
H(X1 , X2 , . . . , Xn ; ) where is the only unknown quantity,
(ii) It has a probability distribution that does not depend on .
Any probability statement of the form
P (a < H(X1 , X2 , . . . , Xn ; ) < b) = 1
will give rise to a probability statement about .
Example 1.7
Given X1 , X2 , . . . , Xn1 from N(1 ,12 ) and Y1 , Y2 , . . . , Yn2 from N(2 , 22 ) where 12 , 22 are
known, find a symmetric 95% CI for 1 2 .
Solution: Consider 1 2 (= , say) as a single parameter. Then X is distributed N(1 ,
12 /n1 ) and Y is distributed N(2 , 22 /n2 ) and further, X and Y are independent. It follows
that X Y is normally distributed, and writing it in standardized form,
X Y (1 2 )
p
is distributed as N (0, 1) .
(12 /n1 ) + (22 /n2 )
So we have found the pivotal quantity which is a function of 1 2 but whose distribution
does not depend on 1 2 . A 95% CI for = 1 2 is found by considering
!
X Y
P 1.96 < p 2
< 1.96 = .95 ,
(1 /n1 ) + (22 /n2 )
CHAPTER 1. ESTIMATION
26
s
s
2
2
2
2
x y 1.96 1 + 2 , x y + 1.96 1 + 2 .
n1 n2
n1 n2
(1.20)
Example 1.8
In many problems where we need to estimate proportions, it is reasonable to assume that
sampling is from a binomial population, and hence that the problem is to estimate p in
the bin(n, p) distribution, where p is unknown. Find a 100(1 )% CI for p, making use
of the fact that for large sample sizes, the binomial distribution can be approximated by
the normal.
Solution: Given X is distributed as bin(n, p), an unbiased estimate of p is p = X/n. For
n large, X/n is approximately normally distributed. Then,
E(
p) = E(X)/n = p ,
and
Var(
p) =
so that
1
1
p(1 p)
Var(X)
=
np(1
p)
=
,
n2
n2
n
p p
p
p(1 p)/n
[Note that we have found the required pivotal quantity whose distribution does not depend
on p.]
An approximate 100(1 )%CI for p is obtained by considering
!
p p
P z/2 < p
< z/2 = 1 .
(1.21)
p(1 p)/n
where z/2 is defined in (1.18).
Rearranging (1.21), the confidence limits for p are obtained as
q
2
2
2n
p + z/2
z/2 4n
p(1 p) + z/2
2
2(n + z/2
)
(1.22)
A simpler expression can be found by dividing both numerator and denominator of (1.22)
by 2n and neglecting terms of order 1/n. That is, a 95% CI for p is
p
p
p 1.96 p(1 p)/n , p + 1.96 p(1 p)/n .
(1.23)
Note that this is just the expression we would have used if we replaced Var(
p) = p(1 p)/n
d
in (1.21) by Var(
p) = p(1 p)/n. In practice, confidence limits for p are generally obtained
by means of specially constructed tables which makes it possible to find confidence intervals
when n is small.
CHAPTER 1. ESTIMATION
27
Example 1.9
Construct an appropriate 90% confidence
P interval for in the Poisson distribution. Evaluate this if a sample of size 30 yields
xi = 240.
Solution: Now X is an unbiased estimator of for this problem, so can be estimated by
= and Var()
=Var(X)= 2 /n = /n. By the Central Limit Theorem,
= x with E()
(1.24)
CHAPTER 1. ESTIMATION
1.8
28
Bayesian estimation
(1.25)
(1.26)
The results at (1.25) is Bayes theorem and in this form shows how we can invert
probabilities, getting P (Hn |E) from P (E|Hn ).
When Hn consists of exclusive and exhaustive events,
P (Hn )P (E|Hn )
m P (Hm )P (E|Hm )
P (Hn |E) = P
1.8.1
1.8.2
(1.27)
continuous:
1
p(x)
discrete:
1
p(x)
1
p(x|y)p(y)dy
1
=P
y p(x|y)p(y)dy
=R
(1.28)
CHAPTER 1. ESTIMATION
29
From (1.28),
p(|X) p(X|) p()
The term, p(X|) may be considered as a function of X for fixed , i.e. a density of X
which is parameterized by .
We can also consider the same term as a function of for fixed X and then it is termed
the likelihood function,
`(|X) = p(X|)
These are the names given to the terms of (1.28),
p() is the prior
`(|X) is the likelihood
p(|X) is the posterior
and Bayes theorem is
posterior likelihood prior
The function p() is not the same in each instance but is a generic symbol to represent the
density appropriate for prior, density of the data given the parameters, and the posterior.
The form of p is understood by considering its arguments, i.e. p(), p(x|) or p(|x).
A diagram depicting the relationships amongst the different densities is shown in Figure 1.2
Figure 1.2: Posterior distribution
1.0
P(x|)
0.8
0.6
P(|x)
0.4
P()
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
The posterior is a combination of the likelihood, where information about comes from
the data X and the prior p() where the information is the knowledge of independent of
CHAPTER 1. ESTIMATION
30
X. This knowledge may come from previous sampling, (say). The posterior represents an
update on P () with the new information at hand, i.e. x.
If the likelihood is weak due to insufficient sampling or wrong choice of likelihood
function, the prior can dominate so that the posterior is just an adaptation of the prior.
Alternatively, if the sample size is large so that the likelihood function is strong, the
prior will not have much impact and the Bayesian analysis is the same as the maximum
likelihood.
The output is a distribution, p(|X) and we may interpret it using summaries such as
the median and an interval where the true value of would lie with a certain probability.
The interval that we shall use is the Highest Density Region, or HDR. This is the interval
for which the density of any point within it is higher than the density of any point outside.
Figure 1.3 depicts a density with shaded areas of 0.9 in 2 cases. In frame (a), observe
that there are quantiles outside the interval (1.37, 7.75) for which the density is greater
than quantiles within the interval. Frame (b) depicts the HDR as (0.94, 6.96).
0.20
0.20
0.15
0.15
p(
|X)
p(
|X)
Figure 1.3: Comparison of a 2 types of regions, a Confidence Interval and a Highest Density
Region
(a) Confidence Interval
(b) A HDR
0.10
0.10
0.05
0.05
0.00
0.00
0 1.37
7.75 10
15
00.94
5 6.94
10
15
CHAPTER 1. ESTIMATION
31
maximum likelihood estimates were biassed because we could only use the maximum value and had no information regarding future samples which might exceed the
maximum.
The Bayesian philosophy attempts to address these concerns.
For this example and further work, we require the indicator function,
IA (x) =
1.8.3
1 (x A)
0 (x
/ A)
Likelihood
0<x<
p(x|) =
0
otherwise
(1.29)
n > x
0
otherwise
and if M = max(x1 , x2 , . . . , xn ),
`(|x) =
n > M
0
otherwise
n I(M,) ()
1.8.4
Prior
CHAPTER 1. ESTIMATION
1.8.5
32
Posterior
By Bayes rule,
p(|x) p() `(|x)
1 I(,) () n I(M,) ()
{z
} |
{z
}
|
prior
likelihood
(+n)1 I(0 ,) ()
where 0 = max(M, ).
Thus we combine the information gained from the data with of our prior beliefs to get
a distribution of s.
In this exercise, there is a fixed lower endpoint which is zero, X U (0, ).
The prior chosen is a Pareto distribution with = 0, = 0 so that
p() = 1 I(,) ()
This is chosen so that the prior does not change very much over the region in which
the likelihood is appreciable and does not take on large values outside that region. It is
said to be locally uniform. We defer the theory about this, for now you may just accept
that it is appropriate for this exercise.
The posterior density, p(|X) is
p(|X) n1 I(0 ,) ()
The HDR will be as in Figure 1.4
Figure 1.4: The 90% HDR for p(|X)
5e11
p(|X)
4e11
3e11
2e11
1e11
0e+00
9
10
11
10.8
12
13
14
15
The lower end-point is M = max(x1 , x2 , . . . , xn ). This is the MLE and in that setting,
there was no other information that we could use to address the point that M ; M
CHAPTER 1. ESTIMATION
33
had to do the job but we were aware that it was very possible that the true value of was
greater than the maximum value of the sample.
The upper end-point is found from the distribution function. We require such that
Z
p(|X)d = 0.9
0
M n
1( )
= 0.9
M
=
1
0.1 n
Likewise, we can compute the median of the posterior distribution,
Q0.5 =
M
1
0.5 n
The following R program was used to estimate the median and HDR of p(|X).
#_________ UniformBayes.R _____
theta <- 10
sampsz <- 10
nsimulations <- 10
for (i in 1:nsimulations){
xi <- max(runif(n=sampsz,min=0,max=theta) )
Q0.9 <- xi/(0.1^(1/sampsz) )
Q0.5 <- xi/(0.5^(1/sampsz) )
cat("simulation ",i,"median = ",round(Q0.5,2),"90% HDR = (",round(xi,2),round(Q0.9,2),")\n")
}
simulation
simulation
simulation
simulation
simulation
simulation
simulation
simulation
simulation
simulation
1
2
3
4
5
6
7
8
9
10
median
median
median
median
median
median
median
median
median
median
=
=
=
=
=
=
=
=
=
=
10.65
10.09
8.92
10.64
9.86
9.88
8.4
8.66
10.41
9.19
90%
90%
90%
90%
90%
90%
90%
90%
90%
90%
HDR
HDR
HDR
HDR
HDR
HDR
HDR
HDR
HDR
HDR
=
=
=
=
=
=
=
=
=
=
(
(
(
(
(
(
(
(
(
(
9.94 12.51 )
9.42 11.85 )
8.32 10.48 )
9.93 12.5 )
9.2 11.59 )
9.22 11.61 )
7.84 9.87 )
8.08 10.18 )
9.71 12.22 )
8.57 10.79 )
CHAPTER 1. ESTIMATION
1.9
34
This section is included to demonstrate the process for modelling the posterior distribution
of parameters and the notes shall refer to it in an example in Chapter 2.
x N (, 2 )
N (0 , 02 )
1 (x )2
exp
p(x|) =
2 2
2 2
1
1 ( 0 )2
p() = p
exp
2
02
202
1
p(|x) = p(x|)p()
1 (x )2
1
1 ( 0 )2
1
exp
p
=
exp
2 2
2
02
2 2
202
1
1
x
1
0
exp 2
+ 2 +
+ 2
2
2
2
0
0
(1.30)
(1.31)
1
2
0 =
1
02
+
02 2
02 2
(1.32)
(1.33)
(1.34)
0
Then (1.31) can be expressed as
1
p(|x) exp 1 + 1 1
2
CHAPTER 1. ESTIMATION
35
Add into the exponent the term 12 1 21 which is a constant as far as is concerned.
Then
1
2
p(|x) exp 1 ( 1 )
2
(
2 )
1
1
1
= (212 ) 2 exp
2
1
1
The last result containing the normalising constant (212 ) 2 comes from
p(|x)dx =
1.
Thus the posterior density is |x N (1 , 12 ), where
1
1
1
+ 2
=
2
2
1
0
0
1 = 0
+x
0 +
+
0
0
x
= 12
+ 2
2
0
Posterior mean is weighted mean of prior mean and datum value. The weights are
proportional to their respective precisions.
Example 1.10
Suppose that 0 N (370, 202 ) and that x| N (421, 82 ). What is p(|x)?
1 =
12 =
1
1
+ 2
2
20
8
1
= 55
1
1 = 55
370 421
+ 2
202
8
|x N (413, 55)
= 413
CHAPTER 1. ESTIMATION
1.10
36
The Bootstrap is a Monte-Carlo method which uses (computer) simulation in lieu of mathematical theory. It is not necessarily simpler. Exercises with the bootstrap are mostly
numerical although the underlying theory follows much of the analytical methods.
1.10.1
(0, x1 )
(x1 , x2 )
0
(n+1)
1
(n+1)
...
...
(xn , )
1
(x(n1) , xn )
n
(n+1)
1
(n+1)
F (x)
1
n
n+1
..
.
4
n+1
3
n+1
2
n+1
1
n+1
x(1)
x(2)
x(3)x(4)
x(5)
...
x(n1)
x(n)
k
.
(n+1)
CHAPTER 1. ESTIMATION
37
7 14 20 13 12 12 15 20 17 11 14 16 12 17 17
7 14 15 16 20 17 20 15 22 26 25 27 18 23 26 18 15 20 24
25 23 20 24 20 16 21 20 18 20 18 24 27 27 21 21 22 28 38
27 26 32 19 33 23 38 30 30 27 25 33 34 16 17 22 17 26 21
30 31 27 43 40 28 31 24 15 22 31
A plot of the ecdf shown in Figure 1.6 is generated with the following R code,
Figure 1.6: An empirical distribution function for typhoon data.
1.0
0.8
0.6
Fn(x)
0.4
0.2
0.0
10
20
30
40
5 10
5 10
8 12
8 11
8 10
8 10
6 10
xi = 240
=8
X
i=1
We shall use this example to illustrate (a) resampling, and (b) the bootstrap distribution.
The sample, x1 , x2 , . . . , xn , are independently and identically distributed (i.i.d.) as
Poisson() which means that each observation is as important as any other for providing
information about the population from which this sample is drawn. That infers we can
replace any number by one of the others and the new sample will still convey the same
information about the population.
This is demonstrated in Figure 1.7. Three new samples have been generated by
taking samples of size n = 30 with replacement from x. The ecdf of x is shown in bold and
8 14
CHAPTER 1. ESTIMATION
38
the ecdfs of the new samples are shown with different line types. There is little change
in the empirical distributions or estimates of quantiles. If a statistic (e.g. a quantile) were
estimated from this process a large number of times, it would be a reliable estimate of the
population parameter. The new samples are termed bootstrap samples.
Figure 1.7: Resampling with replacement from original sample.
1.0
0.8
^
F(x)
0.6
0.4
0.2
0.0
0
9 10
12
14 15
This is the bootstrap procedure for the CI for in the current example.
1. Nominate the number of bootstrap samples that will be drawn, e.g. nBS=99.
2. Sample with replacement from x a bootstrap sample of size n, x?1 .
?.
3. For each bootstrap sample, calculate the statistic of interest,
1
0.8
0.6
Fn(x)
0.4
0.2
0.0
7.0
7.22
7.5
8.0
8.5
8.73
9.0
CHAPTER 1. ESTIMATION
39
The bootstrap estimate of the 95% CI for is (7.22, 8.73). Note that although there
is a great deal of statistical theory underpinning this (the ecdf, iid, a thing called order
statistics etc.), there is no theoretical formula for the CI and it is determined numerically
from the sample.
This is R code to generate the graph in Figure 1.8.
x <- c(8,6,5,10,8,12,9,9,8,11,7,3,6,7,5,8,10,7,8,8,10,8,5,10,8,6,10,6,8,14)
n <- length(x)
nBS <- 99
# number of bootstrap simulations
BS.mean <- numeric(nBS)
i <- 1
while (i < (nBS+1) ){
BS.mean[i] <- mean(sample(x,replace=T,size=n))
i <- i + 1
}
# end of the while() loop
Quantiles <- quantile(BS.mean,p = c(0.025,0.975))
cat(" 95\% CI = ",Quantiles,"\n")
plot(ecdf(BS.mean),las=1)
The boot package in R has functions for bootstrapping. The following code uses that
to get the same CI as above,
library(boot)
mnz <- function(z,id){mean(z[id])}
# user must supply this
bs.samples <- boot(data=x,statistic=mnz,R=99)
boot.ci(bs.samples,conf=0.95,type=c("perc","bca"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 99 bootstrap replicates
CALL :
boot.ci(boot.out = bs.samples, conf = 0.95)
Intervals :
Level
Percentile
BCa
95%
( 7.206, 8.882 )
( 7.106, 8.751 )
It seems that the user must supply a function (e.g. mnz here) to generate the bootstrap
samples. The variable id is recogised by R as a vector 1:length(z) so that it can draw
the samples.
Chapter 2
2.1
Hypothesis Testing
Introduction
40
41
considered did not detect the difference between the actual and hypothesized values of the
parameter.
Before putting hypothesis testing on a more formal basis, let us consider the following
questions. What is the role of statistics in testing hypotheses? How do we decide whether
the sample value disagrees with the scientists hypothesis? When should we reject the
hypothesis and when should we withhold judgement? What is the probability that we
will make the wrong decision? What function of the sample measurements should be used
to reach a decision? Answers to these questions form the basis of a study of statistical
hypothesis testing.
2.2
2.2.1
2.2.2
Tests of Hypotheses
42
2.2.3
There are two types of errors that can occur. If we reject H when it is true, we commit
a Type I error. If we fail to reject H when it is false, we commit a Type II error. You
may like to think of this in tabular form.
Actual H0 is true
situation H0 is not true
Our decision
do not reject H0
reject H0
correct decision
Type I error
Type II error
correct decision
(2.1)
(2.2)
The probability is sometimes referred to as the size of the critical region or the significance level of the test, and the probability 1 as the power of the test.
The roles played by H0 and H1 are not at all symmetric. From consideration of potential
losses due to wrong decisions, the decision-maker is somewhat conservative for holding the
hypothesis as true unless there is overwhelming evidence from the data that it is false. He
believes that the consequence of wrongly rejecting H is much more severe to him than of
wrongly accepting it.
For example, suppose a pharmaceutical company is considering the marketing of a
newly developed drug for treatment of a disease for which the best available drug on the
market has a cure rate of 80%. On the basis of limited experimentation, the research
division claims that the new drug is more effective. If in fact it fails to be more effective,
or if it has harmful side-effects, the loss sustained by the company due to the existing drug
becoming obsolete, decline of the companys image, etc., may be quite severe. On the other
hand, failure to market a better product may not be considered as severe a loss. In this
problem it would be appropriate to consider H0 : p = .8 and H1 : p > .8. Note that H0
is simple and H1 is composite.
43
Ideally, when devising a test, we should look for a decision function which makes probabilities of Type I and Type II errors as small as possible, but, as will be seen in a later
example, these depend on one another. For a given sample size, altering the decision rule
to decrease one error, results in the other being increased. So, recalling that the Type I
error is more serious, a possible procedure is to hold fixed at a suitable level (say = .05
or .01) and then look for a decision function which minimizes . The first solution for
this was given by Neyman and Pearson for a simple hypothesis versus a simple alternative.
Its often referred to as the Neyman-Pearson fundamental lemma. While the formulation
of a general theory of hypothesis testing is beyond the scope of this unit, the following
examples illustrate the concepts introduced above.
2.3
Examples
Example 2.1
Suppose that random variable X has a normal distribution with mean and variance 4.
Test the hypothesis that = 1 against the alternative that = 2, based on a sample of
size 25.
Solution: An unbiased estimate of is X and we know that X is distributed normally
with mean and variance 2 /n which in this example is 4/25. We note that values of x
close to 1 support H whereas values of x close to 2 support A. We could make up a decision
rule as follows:
If x > 1.6 claim that = 2,
If x 1.6 claim that = 1.
The diagram in Figure fig.CRUpperTail shows the sample space of x partitioned into
(i) the critical region, R= {x : x > 1.6}
(ii) the acceptance region, R = {x : x 1.6}
Here, 1.6 is the critical value of x.
We will find the probability of Type I and Type II error,
2
P (X > 1.6| = 1, = ) = .0668. ( pnorm(q=1.6,mean=1,sd=0.4,lower.tail=F))
5
This is
P(H0 is rejected|H0 is true) = P(Type I error) =
Also
= P(Type II error) = P(H0 is not rejected|H0 is false)
2
= P (X 1.6| = 2, = )
5
= .1587
(pnorm(q=1.6,mean=2,sd=0.4,lower.tail=T))
44
mean=2
mean=1
1.6
-2
critical region 4
To see how the decision rule could be altered so that = .05, let the critical value be c.
We require
2
P (X > c| = 1, = ) = 0.05
5
c = 1.658
2
P (X < c| = 2, = ) = 0.196
5
(qnorm(p=0.05,mean=1,sd=0.4,lower.tail=T))
(pnorm(q=1.658,mean=2,sd=0.4,lower.tail=T))
This value of c gives an of 0.05 and a of 0.196 illustrating that as one type of error
() decreases the other () increases.
45
Example 2.2
Suppose we have a random sample of size n from a N(,4) distribution and wish to test
H0 : = 10 against H1 : = 8. The decision rule is to reject H0 if x < c . We wish to
find n and c so that = 0.05 and 1.
Solution:In Figure 2.2 below, the left curve is f (x|H1 ) and the right curve is f (x|H0 ).
The critical region is {x : x < c}, so is the left shaded area and is the right shaded
area.
Figure 2.2: Critical Region Lower Tail
mean=8
mean=10
10
12
critical region
Now
2
= 0.05 = P (X < c| = 10, = )
n
2
= 0.1 = P (X c| = 8, = )
n
(2.3)
(2.4)
(2.5)
We need to solve (2.3) and (2.4) simultaneously for n as shown in Figure 2.3
46
Critical value
9.5
= 0.1
9.0
8.5
8.0
= 0.05
7.5
7.0
4
10
12
sample size
A sample size n = 9 and critical value c = 8.9 gives 0.05 and 0.1.
2.4
47
Consider the problem where the random variable X has a binomial distribution with
P(Success)=p. How do we test the hypothesis p = 0.5. Firstly, note that we have an
experiment where the outcome on an individual trial is success or failure with probabilitites p and q respectively. Let us repeat the experiment n times and observe the number
of successes.
Before continuing with this example it is useful to note that in most hypothesis testing
problems we will deal with, H0 is simple, but H1 on the other hand, is composite, indicating that the parameter can assume a range of values. Examples 1 and 2 were more
straightforward in the sense that H1 was simple also.
If the range of possible parameter values lies entirely on the one side of the hypothesized
value, the aternative is said to be one-sided. For example, H1 : p > .5 is one-sided but
H1 : p 6= .5 is two-sided. In a real-life problem, the decision of whether to make the
alternative one-sided or two-sided is not always clear cut. As a general rule-of-thumb, if
parameter values in only one direction are physically meaningful, or are the only ones that
are possible, the alternative should be one-sided. Otherwise, H1 should be two-sided. Not
all statisticians would agree with this rule.
The next question is what test statistic we use to base our decision on. In the above
problem, since X/n is an unbiased estimator of p, that would be a possibility. We could even
use X itself. In fact the latter is more suitable since its distribution is known. Recall that,
the principle of hypothesis testing is that we will assume H0 is correct, and our position will
change only if the data show beyond all reasonable doubt that H1 is true. The problem
then is to define in quantitative terms what reasonable doubt means. Let us suppose that
n = 18 in our problem above. Then the range space for X is RX = {0, 1, . . . , 18} and
E(X)=np= 9 if H0 is true. If the observed number of successes is close to 9 we would be
obliged to think that H was true. On the other hand, if the observed value of X was 0
or 18 we would be fairly sure that H0 was not true. Now reasonable doubt does not
have to be as extreme as 18 cases out of 18. Somewhere between x-values of 9 and 18 (or
9 and 0), there is a point, c say, when for all practical purposes the credulity of H0 ends
and reasonable doubt begins. This point is called the critical value and it completely
determines the decision-making process. We could make up a decision rule
If x c, reject H0
If x < c, conclude that H0 is probably correct.
(2.6)
2.4.1
48
In the above problem, suppose that the alternative is H1 : p > .5. Only values of x much
larger than 9 would support this alternative and a decision rule such as (2.6) would be appropriate. The actual value of c is chosen to make , the size of the critical region, suitably
small. For example, if c = 11, then P (X 11) = .24 and this of course
Clearly
large.
P18 is too
18
18
we should look for a value closer to 18. If c = 15, P (X 15) = x=15 x (.5) = 0.004,
on calculation. We may now have gone too far in the other extreme. Requiring 15 or more
successes out of 18 before we reject H0 : p = 0.5 means that only 4 times in a thousand
would we reject H0 wrongly. Over the years, a reasonable consensus has been reached
as to how much evidence against H0 is enough evidence. In many situations we define
the beginning of reasonable doubt as the value of the test statistic that is equalled or
exceeded by chance 5% of the time when H0 is true. According to this criterion, c should
be chosen so that P (X c|H0 is true) = 0.05. That is c should satisfy
P (X c|p = 0.5) = 0.05 =
18
X
18
x=c
(0.5)18 .
A little trial and error shows that c = 13 is the appropriate value. Of course because of
the discrete nature of X it will not be possible to obtain an of exactly 0.05.
Defining the critical region in terms of the x-value that is exceeded only 5% of the
time when H0 is true is the most common way to quantify reasonable doubt, but there are
others. The figure 1% is frequently used and if the critical value is exceeded only 1% of the
time we say there is strong evidence against H0 . If the critical value is only exceeded
.1% of the time we may say that there is very strong evidence against H0 .
So far we have considered a one-sided alternative. Now well consider the other case
where the alternative is two-sided.
2.4.2
Consider now the alternative H1 : p 6= 0.5. Values of x too large or too small would
support this alternative. In this case there are two critical regions (or more correctly, the
critical region consists of two disjoint sets), one in each tail of the distribution of X. For
a 5% critical region, there would be two critical values c1 and c2 such that
P (X c1 |H0 is true) 0.025 and P (X c2 |H0 is true) 0.025.
This can be seen in Figure 2.4 below, where the graph is of the distribution of X when H0
is true. (It can be shown that c1 = 4 and c2 = 14 are the critical values in this case.)
Tests with a one-sided critical region are called one-tailed tests, whereas those with
a two-sided critical region are called two-tailed tests.
49
0.20
0.15
0.10
0.05
0.00
0
c1
c2
18
Computer Exercise 2.1 Use a simulation approach to estimate a value for c in (2.6)
above.
Solution: Use the commands
#Generate 1000 random variables fron a bin(18,0.5) distribution.
rb <- rbinom(n=1000,size=18,p=0.5)
table(rb)
#Tabulate the results
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15
1
1
3 11 28 77 125 174 187 166 126 63 22 11
5
0.8
Fn(x)
0.6
0.4
0.2
0.0
10
x
15
50
This would indicate the onesided critcal value should be c = 13 as the estimate of
P (X 13) is 0.038. For a two sided test the estimated critical values are c1 = 4 and
c2 = 13.
These results from simulation are in close agreement with theoretical results obtained
in 2.4.1 and 2.4.2.
2.4.3
51
occurrence. Hence, if such an event did occur, wed doubt the hypothesis and conclude
that there is evidence that p > 1/3.
Approach (ii), quantiles
Clearly, large values of X support H1 , so wed want a critical region of the form x c
where c is chosen to give the desired significance level, . That is, for = 0.05, say, the
upper tail 5% quantile of the binomial distribution with p = 13 and N = 9000 is 3074.
(qbinom(size=N,prob=px,p=0.05,lower.tail=F))
The observed value 3164 exceeds this and thus lies in the critical region [c, ]. So we
reject H0 at the 5% significance level. That is, we will come to the conclusion that
p > 1/3, but in so doing, well recognize the fact that the probability could be as large as
0.05 that weve rejected H0 wrongly.
The 2 methods are really the same thing. Figure 2.6 shows the distribution function
for Bin(9000, 13 ) with the observed quantile 3164 and associated with it is P (X > 3164)
The dashed lines show the upper = 0.05 probability and the quantile C1 . The event
that X > C1 has a probability p < .
The rejection region can be defined either by the probabilities or the quantiles.
Figure 2.6: using either quantiles or probability to test the null hypothesis
1.0
probability
0.8
0.6
0.4
0.2
C
0.0
2800
2900
3000
3100
3200
quantiles
In doing this sort of problem it helps to draw a diagram, or at least try to visualize the
partitioning of the sample space as suggested in Figure 2.7.
If x R it seems much more likely that the actual distribution of X is given by a curve
similar to the one on the right hand side, with mean somewhat greater than 3000.
52
probability
0.8
H0:p=
0.6
H1:p >
1
3
0.4
0.2
0.0
2800
3000
3200
3400
quantiles
14
22
19
24
16
23
16
22
22
24
23
21
15
18
26
18
28
18
13
35
Solution:
x <- c(18,14,23,23,18,21,22,16,21,28,12,19,22,15,18,28,24,22,18,13,18,16,24,26,35)
xbar <- mean(x)
n <- length(x)
> xbar
[1] 20.56
pnorm(q=xbar, mean=23,sd=5/sqrt(n))
[1] 0.007
qnorm(p=0.05,mean=23,sd=5/sqrt(n))
[1] 21
We can now use approach (i). For a two sided alternative calculated probability is
P = 0.015(= 2 0.00734) so that the hypothesis is unlikely to be true.
For approach (ii) with = 0.05 the critical value is 21. The conclusion reached would
therefore be the same by both approaches.
For testing = 23 against the one sided alternative < 23 , P = 0.0073
2.5
53
Two-Sample Problems
In this section we will consider problems involving sampling from two populations where
the hypothesis is a statement of equality of two parameters. The two problems are:
(i) Test H0 : 1 = 2 where 1 and 2 are the means of two normal populations.
(ii) Test H0 : p1 = p2 where p1 and p2 are the parameters of two binomial populations.
Example 2.4
Given independent random samples X1 , X2 , . . . , Xn1 from a normal population with unknown mean 1 and known variance 12 and Y1 , Y2 , . . . , Yn2 from a normal population with
unknown mean 2 and known variance 22 , derive a test for the hypothesis H: 1 = 2
against one-sided and two-sided alternatives.
Solution: Note that the hypothesis can be written as H : 1 2 = 0. An unbiased
estimator of 1 2 is X Y so this will be used as the test statistic. Its distribution is
given by
12 22
+
X Y N 1 2 ,
n1 n2
or, in standardized form, if H0 is true
X Y
p
(12 /n1 )
+ (22 /n2 )
N (0, 1).
(12 /n1 )
+ (22 /n2 )
>
c for H1 : 1 2 > 0
(2.8)
(2.9)
where c = 1.645 for = .05, c = 2.326 for = .01, etc. Can you see what modification
to make to the above rejection regions for testing H0 : 1 2 = 0 , for some specfified
constant other than zero?
Example 2.5
Suppose that n1 Bernoulli trials where P (S) = p1 resulted in X successes and that n2
Bernouilli trials where P (S) = p2 resulted in Y successes. How do we test H : p1 = p2 (=
p, say)?
54
X
n1
(2.10)
Y
n1 p1 n2 p2
X
E
= 0 under H0 , and
n1 n2
n1
n2
X
Y
n 1 p 1 q1 n 2 p 2 q2
1
1
Var
=
+
= p(1 p)
+
under H0
n1 n2
n21
n22
n1 n2
In (2.10) the variance is unknown, but we can replace it by an estimate and it remains to
decide what is the best estimate to use. For the binomial distribution, the MLE of p is
p =
number of successes
X
=
.
n
number of trials
In our case, we have 2 binomial distributions with the same probability of success under
H0 , so intuitively it seems reasonable to pool the 2 samples so that we have X + Y
successes in n1 + n2 trials. So we will estimate p by
p =
x+y
.
n1 + n2
Using this in (2.10) we can say that to test H0 : p1 = p2 against H1 : p1 6= p2 at the 100%
significance level, H0 is rejected if
s
|(x/n1 ) (y/n2 )|
> z/2 .
x+y
x+y
n1 + n2
1
n1 + n2
n1 + n2
n1 n2
Of course the appropriate modification can be made for a one- sided alternative.
(2.11)
2.6
55
(2.12)
That is, if
|x 0 |
> 1.96.
/ n
Or, using the P-value, if x > 0 we calculate the probability of a value as extreme or
more extreme than this, in either direction. That is, calculate
x 0
P = 2 P (X > x) = 2 PN Z >
.
/ n
If P < .05 the result is significant at the 5% level. This will happen if
x 0
< 1.96, as
/ n
in (2.11).
(b) A symmetric 95% confidence interval for is x1.96/ n which arose from considering
the inequality
x
1.96 < < 1.96
/ n
which is the event complementary to that in (2.11).
So, to reject H0 at the 5% significance level is equivalent to saying that the hypothesized value is not in the 95% CI. Likewise, to reject H0 at the 1% significance level is
equivalent to saying that the hypothesized value is not in the 99% CI, which is equivalent
to saying that the P-value is less than 1%.
If 1% < P < 5% the hypothesized value of will not be within the 95% CI but it will
lie in the 99% CI.
This approach is illustrated for the hypothesis-testing situation and the confidence
interval approach below.
Computer Exercise 2.3
Using the data in Computer Exercise 2.2, find a 99% CI for the true mean, .
56
Solution:
#Calculate the upper and lower limits for the 99% confidence interval.
CI <- qnorm(mean=xbar,sd=5/sqrt(25),p=c(0.005,0.995) )
> CI
[1] 18 23
Hypothesised Mean
Critical value
Observed value
*
95% CI
Figure 2.9: Relationship between Significant Hypothesis Test and Confidence Interval
Hypothesised Mean
Observed value
Critical value
*
99% CI
2.7
57
Summary
We have only considered 4 hypothesis testing problems at this stage. Further problems
will be dealt with in later chapters after more sampling distributions are introduced. The
following might be helpful as a pattern to follow in doing examples in hypothesis testing.
1. State the hypothesis and the alternative. This must always be a statement about
the unknown parameter in a distribution.
2. Select the appropriate statistic (function of the data). In the problems considered
so far this is an unbiased estimate of the parameter or a function of it. State the
distribution of the statistic and its particular form when H0 is true.
Alternative Procedures
1.
Find the critical region using the
appropriate value of (.05 or .01 usually).
2.
Find the observed value of the
statistic (using the data).
3.
Draw conclusions. If the calculated value falls in the CR, this provides evidence against H0 . You could
say that the result is significant at the
5% (or 1% or .1% level).
.
2. Calculate P , the probability associated with values as extreme or more
extreme than that observed. For a 2sided H1 , youll need to double a probability such as P (X k).
3.
Draw conclusions. For example, if P < .1% we say that there is
very strong evidence against H0 . If
.1% < P < 1% we say there is strong
evidence. If 1% < P < 5% we say there
is some evidence. For larger values of
P we conclude that the event is not an
unusual one if H0 is true, and say that
this set of data is consistent with H0 .
2.8
2.8.1
58
2.8.2
Bayesian approach
0
1
p0
.
p1
59
Bayes factor
The Bayes factor, B, is the odds in favour of H0 against H1 ,
B=
p0 /p1
p0 1
=
0 /1
p1 0
(2.13)
The posterior probability p0 of H0 can calculated from its prior probability and the
Bayes factor,
1
1
p0 =
=
1
[1 + (1 /0 )B ]
[1 + {(1 0 )/0 } B1 ]
Simple Hypotheses
= {0 }
= {1 }
p0 0 p(x|0 ) p1 1 p(x|1 )
p0
p1
0 p(x|0 )
1 p(x|1 )
p(x|0 )
p(x|1 )
Example 2.6
Consider the following prior distribution, density and null hypothesis,
N (82.4, 1.12 )
x| N (82.1, 1.72 )
H0 : x < 83.0
From the results in section 1.9,
0 =
=
1 =
1 =
=
1
1
= 0.83
=
2
0
1.12
1
1
=
= 0.35
2
1.72
0 + 1 = 1.18
12 = (1.18)1 = 0.85
0
+
0
1
1
0.83
0.35
+ 82.1
= 82.3
82.4
1.18
1.18
60
For H0 : x < 83, and with 0 , p0 being the prior and posterior probabilities under H0 ,
0 = P (x < 83|0 = 82.4, 0 = 1.1) = 0.71
Use pnorm(mean=82.4,sd=1.1,q=83)
0
0.71
=
= 2.45
1 0
0.29
3.35
p0 1
=
= 1.4
p1 0
2.45
The data has not altered the prior beliefs about the mean, B 1.
2.9
61
Figure 2.10 shows 2 ways in which distributions differ. The difference depicted in Figure 2.10 (a) is a shift in location (mean) and in Figure 2.10 (b) there is a shift in the scale
(variance).
1.0
1.0
0.8
0.8
0.6
0.6
F(x)
F(x)
Figure 2.10: Distributions that differ due to shifts in (a) location and (b) scale.
0.4
0.4
0.2
0.2
0.0
0.0
0
2.9.1
10
15
10
15
Kolmogorov-Smirnov (KS)
The KS test is a test of whether 2 independent samples have been drawn from the same
population or from populations with the same distribution. It is concerned with the agreement between 2 cumulative distribution functions. If the 2 samples have been drawn from
the same population, then the cdfs can be expected to be close to each other and only
differ by random deviations. If they are too far apart at any point, this suggests that the
samples come from different populations.
The KS test statistics is
D = max |F1 (x) F2 (y)|
(2.14)
Exact sampling distribution
The exact sampling distribution of D under H0 : F1 = F2 can be enumerated.
If H0 is true, then [(X1 , X2 , . . . , Xm ), (Y1 , Y2 , . . . , Yn )] can be regarded as a random
sample from the same population with actual realised samples
[(x1 , x2 , . . . , xm ), (y1 , y2 , . . . , yn )]
Thus (under H0 ) an equally likely sample would be
[(y1 , x2 , . . . , xm ), (x1 , y2 , . . . , yn )]
where x1 and y1 were swapped.
62
2.9.2
Asymptotic distribution
If m and n become even moderately large, the enumeration is huge. In that case we can
utilize the large sample approximation that
2 =
4D2 (nm)
n+m
Table 2.1: Wavelet energies of the sway signals from normal subjects and subjects with
whiplash injury.
Normal
Whipl
33
1161
269
2462
211
1420
352
2780
284
1529
386
2890
545
1642
1048
4081
570
1994
1247
5358
591
2329
1276
6498
602
2682
1305
7542
786
2766
1538
13791
945
3025
2037
23862
951
13537
2241
34734
63
Figure 2.11: The ecdfs of sway signal energies for N & W groups
1.0
Fn(x)
0.8
0.6
0.4
0.2
0.0
n
n
n
n
n
n
w
n
w
n
w
n
w
n
n w
n w
n w
nw
nw
nw
w
n
w
n
nw
0
n
w
5000
10000
15000
energy
# ______
the Asymptotic distribution __________
D
<- KS$statistic
Chi <- 4*(KS$statistic^2)*m*n/(m+n)
P
<- pchisq(q=Chi,df=2,lower.tail=F)
> cat("X2 = ",round(Chi,2),"P( > X2) = ",P,"\n")
X2 = 4.9 P( > X2) = 0.08629
64
2.9.3
65
The link between confidence intervals and hypothesis tests also holds in a bootstrap setting.
The bootstrap is an approximation to a permutation test and a strategic difference is that
bootstrap uses sampling with replacement.
A permutation test of whether H0 : F1 (x) = F2 (y) is true relies upon the ranking of the
combined data set (x, y). The data were ordered smallest to largest and each permutation
was an allocation of the group labels to each ordered datum. In 1 permutation, the label
x was ascribed to the first number and in another, the label y is given to that number and
so on.
The test statistic can be a function of the data (it need not be an estimate of a parameter) and so denote this a t(z).
The principle of bootstrap hypothesis testing is that if H0 is true, a probability atom
1
of m+n
can be attributed to each member of the combined data z = (x, y).
The empirical distribution function of z = (x, y), call it F0 (z), is a non-parametric
estimate of the common population that gave rise to x and y, assuming that H0 is true.
Bootstrap hypothesis testing of H0 takes these steps,
1. Get the observed value of t, e.g. tobs = x y.
2. Nominate how many bootstrap samples (replications) will be done, e.g. B = 499.
3. For b in 1:B, draw samples of size m + n with replacement from z. Label the first m
of these x?b and the remaining n be labelled yb? .
4. Calculate t(zb? ) for each sample. For example, t(zb? ) = x?b yb?
number of t(z?b ) tobs
5. Approximate the probability of tobs or greater by
B
66
Example
The data in Table 2.1 are used to demonstrate bootstrap hypothesis testing with the
test statistic,
y x
t(z) = q
m1 + n1
The R code is written to show the required calculations more explicitly but a good
program minimises the variables which are saved in the iterations loop.
#_____________ Bootstrap Hypothesis Test ____________________
N.energy <- c(33,211,284,545,570,591,602,786,945,951,1161,1420,
1529,1642,1994,2329,2682,2766,3025,13537)
W.energy <- c(269,352,386,1048,1247,1276,1305,1538,2037,2241,2462,2780,
2890,4081,5358,6498,754,1379,23862,34734)
Z <- c(N.energy,W.energy)
m <- length(N.energy)
n <- length(W.energy)
T.obs <- (mean(W.energy) - mean(N.energy))/(sd(Z)*sqrt(1/m + 1/n))
nBS <- 999
T.star <- numeric(nBS)
for (j in
z.star <w.star <n.star <T.star[j]
1:nBS){
sample(Z,size=(m+n))
z.star[(m+1):(m+n)]
z.star[1:m]
<- ( mean(w.star) - mean(n.star) )/( sd(z.star) * sqrt(1/m + 1/n) )
}
p1 <- sum(T.star >= T.obs)/nBS
cat( "P(T > ",round(T.obs,1),"|H0) = ",round(p1,2),"\n")
Chapter 3
Chisquare Distribution
Distribution of S 2
3.1
S =
n
X
i=1
for n = 3,
3
X
2=
(Xi X)
i=1
2
X
j=1
for n = 4,
4
X
i=1
2=
(Xi X)
3
X
j=1
67
68
since they are normally distributed (being sums of normal random variables), they are
independent. Also,
E(Y1 ) = 0 = E(Y2 ) = E(Y3 )
and,
1
(Var(X1 ) + Var(X2 )) = 2
2
1
4
1
Var(Y2 ) = Var(X1 ) + Var(X2 ) + Var(X3 ) = 2 .
6
6
6
2
Similarly, Var(Y3 ) = .
In general the sum of n squares P
involving the Xs can be expressed as the sum of n 1
squares involving the Y s. Thus ni=1 (Xi X)2 can be expressed as
Var(Y1 ) =
n
X
(Xi X) =
i=1
n1
X
Yj2
j=1
Yj2
j=1
X1 + X2 + + Xj jXj+1
p
,
j(j + 1)
j = 1, 2, , n 1.
The random variables Y1 , Y2 , . . . , Y each have mean zero and variance 2 . So each
Yj N (0, 2 ) and the Yj0 s are independent.
P
2
j=1 Yj
2
and recall that
Now write S =
(X )2
1
2
Gamma
(i) If X N(, ) then
, [Statistics 260, (8.16)]
2
2
2
P
2
j=1 (Xj )
2
(ii) If X1 , X2 , . . . , X are independent N(, ) variates, then
is dis2 2
[Statistics 260, section 7.4].
tributed as Gamma
2
Yj2
1
Applying this to the Yj where = 0,
Gamma
and
2
2
2
1 X Yj2
V =
is distributed as Gamma
.
2 j=1 2
2
1 ( 1) v
v 2 e ,
( 2 )
v (0, )
(3.1)
69
S =
j=1
Yj2
2 2 V
or
S2
(3.2)
2
2
Now V is a strictly monotone function of S 2 so, by the change-of-variable technique, the
pdf of S 2 is
V =
g(s2 ) = f (v)|dv/ds2 |
(/2)1
2
2
es /2
2
s2
.
=
, s (0, )
(/2) 2 2
2 2
o
/2
n
1
2 ( 2 1)
2
s
(s
)
=
exp
( 2 )
2 2
2 2
(3.3)
3.2
Chi-Square Distribution
w/2 (/2)1
e
w
=
, w [0, ].
(3.4)
/2
2 (/2)
A random variable W with this pdf is said to have a chi-square distribution on
degrees of freedom (or with parameter )and we write W 2 .
Notes: (a) W/2 (/2).
(b) This distribution can be thought of as a special case of the generalized gamma
distribution.
(c) When = 2, (3.4) becomes h(w) = 21 ew/2 , w [0, ], which is the exponential
distribution.
70
0.3
fx
for (d in 2:4){
fx <- dchisq(x,df=d)
if(d==2) plot(fx ~ x,type=l,
ylim=c(0,0.5),las=1)
else lines(x,fx,lty=(d-1))
}
# end of d loop
legend(6,0.4,
expression(chi[2]^2,chi[3]^2,chi[4]^2),
lty=1:3)
0.2
0.1
0.0
0
10
71
Figure 3.1: Area corresponding to the 100P percentile of the 2 random variable w.
f(w)
1
P
1P
w1P
W
The R function for calculating tail area probabilities for given quantiles is
pchisq(q= , df = ,lower.tail= T (or F) )
and for calculating quantiles corresponding to a probability, qchisq(p = , df = )
These functions are included in the Rcmdr menus.
The following example requires us to find a probability.
Example 3.1
A random sample of size 6 is drawn from a N (, 12) distribution.
Find P(2.76 < S 2 < 22.2).
Solution:
We wish to express this as a probability statement about the random variable W . That is,
S 2
5
5
2.76 <
<
22.2)
2
12
12
= P (1.15 < W < 9.25) where W 25
= P (W < 9.25) P (W < 1.15)
Solution:
#___ Pint.R _______
Q <- c(2.76,22.2)*5/12
Pint <- diff( pchisq(q=Q,df=5))
cat("P(2.76 < S2 < 22.2) = ",Pint,"\n")
> source("Pint.R")
P(2.76 < S2 < 22.2) =
0.85
72
F(w)
0.6
P(1.15 < W < 9.25)
0.4
0.2
0.05
0.0
(1.15<W<9.25)
10
15
20
Moments
As V (defined in (3.2)) has a gamma distribution its mean and variance can be written
down. That is, V (/2), so that
E(V)= /2 and Var(V)= /2.
Then since W is related to V by W = 2V
E(W ) = 2(/2) =
Var(W ) = 4(/2) = 2.
(3.5)
MW (t) = (1 2t)/2 .
(3.6)
Exercise: Find the MGF of W directly from the pdf of W . (Hint: Use the substitution
u = w(1 2t)/2 when integrating.)
73
MW (t) = 1 + .2t +
+1
+
+1
+2
+
2
2 2
2!
2 2
2
3!
t2
t3
= 1 + t + ( + 2)
+ ( + 2)( + 4) +
2!
3!
Moments can be read off as appropriate coefficents here. Note that 01 = and 02 =
( + 2). The cumulant generating function is
22 t2 23 t3 24 t4
= 2t
2
2
3
4
2t2
8t3
48t4
= t +
+
+
+
2!
3!
4!
so the cumulants are
1 = , 2 = 2, 3 = 8, 4 = 48.
We will now use these cumulants to find measures of skewness and kurtosis for the chisquare distribution.
1 = 3 /2
0 as
That is, the 2 distribution becomes symmetric for .
(ii) Coefficient of kurtosis,
2 = 4 /22 for any distribution
48
=
for the 2 distribution
4 2
0 as .
This is the value 2 has for the normal distibution.
74
Additive Property
Let W1 21 and W2 (independent of W1 ) 22 . Then from (3.6) W1 + W2 has moment
generating function
MW1 +W2 (t) = MW1 (t)MW2 (t) = (1 2t)1 /2 (1 2t)2 /2
= (1 2t)(1 +2 )/2
This is also of the form (3.6); that is, we recognize it as the MGF of a 2 random variable
on (1 + 2 ) degrees of freedom.
Thus if W1 21 and W2 22 and W1 and W2 are independent then
W1 + W2 21 +2
The result can be extended to the sum of k independent 2 random variables.
If W1 , . . . , Wk are independent
21 , . . . , 2k
then
k
X
Wi 2
(3.7)
i=1
P
where = i . Note also that a 2 variate can be decomposed into a sum of independent
chi-squares each on 1 d.f.
Y2
(1/2)
2 2
Z 2 21 .
(3.8)
Summary
You may find the following summary of relationships between 2 , gamma, S 2 and normal
distributionsP
useful.
Define S 2 = ni=1 (Xi X)2 /(n 1), the Xi being independent N(, 2 ) variates, then
(i) W = S 2 / 2 2 where = n 1,
(ii)
1
W
2
= S 2 /2 2 (/2),
(iii) If Zi =
Xi
, (that is, Zi N (0, 1)) then
3.3
75
Independence of X and S 2
When X and S 2 are defined for a sample from a normal distribution, X and S 2 are
statistically independent. This may seem surprising as the expression for S 2 involves X.
Consider P
again the transformation from Xs P
to Ys given in 3.1. Weve seen that
n
2
2
(n 1)S = i=1 (Xi X) can be expressed as j=1 Yj2 where the Yj defined by
Yj =
X1 + X2 + + Xj jXj+1
p
,
j(j + 1)
j = 1, 2, , n 1,
have zero means and variances 2 . Note also that the sample mean,
X=
1
1
1
X1 + X2 + , + Xn
n
n
n
3.4
We will use the method indicated in 1.8 to find a confidence interval for 2 in a normal
distribution, based on a sample of size n. The two cases (i) unknown; (ii) known must
be considered separately.
Case (i)
Let X1 , X2 , . . . , Xn be a random sample from N(, 2 ) where both and 2 are unknown.
It has been shown that S 2 is an unbiased estimate of 2 (Theorem 1.4) and we can find
a confidence interval for 2 using the 2 distribution. Recall that W = S 2 / 2 2 . By
way of notation, let w, be defined by P (W > w, ) = , where W 2 .
76
f(w)
1-
,
We find two values of W , w,/2 and w,1(/2)
, such that
P w,1(/2)w < W < w,/2 = 1 .
Figure 3.4: Upper and lower values for w
f(w)
/2
, 1(/2)
, /2
w
< W < w,/2 occurs if and only if the events
(3.9)
Example 3.2
For a sample of size n = 10 from a normal distribution s2 was calculated and found to be
6.4. Find a 95% CI for 2 .
77
Case (ii)
Suppose now that X1 , X2 , . . . , Xn is a random sample from N(, 2 ) where is known
and we wish to find a CI for the unknown 2 . Recall (Assignment 1, Question 4) that the
maximum likelihood estimator of 2 (which well denote by S 2 ) is
S
n
X
(Xi )2 /n.
i=1
1
= n 2 = 2
n
n
i=1
P
The distribution of S 2 is found by noting that nS 2 / 2 = ni=1 (Xi )2 / 2 is the sum
of squares of n independent N(0,1) variates and is therefore distributed as 2n (using (3.8)
and (3.7)). Proceeding in the same way as in Case (i) we find
ns2
ns2
2
,
(3.10)
A 100(1 )% CI for when is known is
wn,/2 wn,1(/2)
2
E(S ) =
3.5
Again the cases (i) unknown; and (ii) known are considered separately.
Case (i)
Let X1 , X2 , . . . , Xn be a random sample from a N(, 2 ) distribution where is unknown,
and suppose we wish to test the hypothesis
H : 2 = 02 against A : 2 6= 02 .
Under H, S 2 /02 2 and values of s2 /02 too large or too small would support A. For
= .05, say, and equal-tail probabilities we have as critical region
s2
s2
2
R= s :
> w,.025 or
< w,.975 .
02
02
78
f(w)
0.05
, 0.025
, 0.975
When testing at the 5% level, there is evidence that the standard deviation is greater
than 7.5.
Case (ii)
Let X1 , X2 , . . . , Xn be a random sample from N(, 2 ) where is known, and suppose
we wishPto test H: 2 = 02 . Again we use the fact that if H is true, nS 2 /02 2n where
S 2 = ni=1 (Xi )2 /n, and the rejection region for a size- 2-tailed test, for example,
would be
ns2
ns2
2
> wn,/2 or
< wn,1(/2)
s :
02
02
79
3.6
3.6.1
Non-informative priors
A prior which does not change very much over the region in which the likelihood is appreciable and does not take very large values outside that region is said to be locally uniform.
For such a prior,
p(|y) p(y|) = `(|y)
The term pivotal quantity was introduced in section 1.7.1 and now is defined for (i)
location parameter and (ii) scale parameter.
(i) If the density of y, p(y|), is such that p(y |) is a function that is free of y and
, say f (u) where u = y , then y is a pivotal quantity and is a location
parameter.
Example. If (y|, 2 ) N (, 2 ), then (y |, 2 ) N (0, 2 ) and y is a
pivotal quantity.
(ii) If p( y |) is a function free of and y, say g(u) where u =
quantity and is a scale parameter.
y
N (0, 1).
Example. If (y|, 2 ) N (, 2 ),then
y
,
then u is a pivotal
or
p(u|y) p()p(u|)
(3.11)
(3.12)
(The LHS of (3.11) is the posterior of a parameter say = y and the RHS is the
density of a scaled variable y = y . Both sides are free of y and .)
du
p(y|) = p(u|) =
dy
du
p(|y) = p(u|y) =
d
1
p(u|)
y
p(u|y)
2
80
y
p(y|)
3.7
(3.13)
(3.14)
To get the marginal posterior distribution of the variance, integrate with respect to ,
Z
2
(3.15)
p( |y) =
p(, 2 |y)d
Z
=
p( 2 |, y)p(|y)d
(3.16)
Choose the prior
p(, 2 ) p()p( 2 )
p(, 2 ) ( 2 )1
( 2 )
(p() Const.)
)
n
X
1
p(, 2 ) n2 exp 2
(y )2
2 i=1
" n
(
#)
X
1
= n2 exp 2
(yi y)2 + n(
y )2
2 i=1
1
n2
2
2
y )
=
exp 2 (n 1)S + n(
2
P
(yi y)2
2
where S =
(n 1)
(3.17)
exp 2 (n 1)S + n(
y )
2
Z
1
1
2
n2
2
exp 2 (
y ) d
=
exp 2 (n 1)S
2
2 /n
p
1
n2
2
=
exp 2 (n 1)S
2 2 /n
2
(n 1)S 2
2 n+1
= ( ) 2 exp
2 2
81
(3.18)
s2
(s2 )( 2 1) exp 2
2
n1 n1
distribution.
with = (n 1) and this is a Gamma
,
2
2 2
2
3.7.1
Its Bayesian counterpart at (3.18) is a Scaled Inverse Chi-squared distribution. Since the
prior was uninformative, similar outcomes are expected.
The inverse 2 distribution has density function
1
p( |) =
( 2 )
2
2 2 +1
1
1
1
exp 2 I(0,) ( 2 )
2
2
2
1
The prior p( 2 ) 2 can be said to be an inverse chi-squared distribution on = 0
degrees of freedom or sample size n = 1. Is there any value in it? Although uninformative,
it ensures a mathematical smoothness and numerical problems are reduced.
The posterior density is Scaled Inverse Chi- squared with degrees of freedom = (n1)
and scale parameter s.
3.8
82
1
,
.
Recall that
is Ga
2 2
The Inverse-Gamma distribution is also prominent in Bayesian statistics so we examine
it first.
2v
3.8.1
()
1 (+1)
Gamma p(|, ) =
1 (1)
exp
2
=
()
1 (+1)
exp
=
()
3.8.2
3.8.3
s2
.
X
83
Example
Give a 90% HDR for the variance of the population from which the following sample
is drawn.
4.17
5.58
5.18
6.11
4.50
4.61
5.17
4.53
5.33
5.14
S 2 = 0.34
p( 2 |, S 2 ) = 0.342
9
The 90% CI for 2 is (0.18, 0.92). The mode of the posterior density of 2 is 0.28 and
the 90% HDR for 2 is (0.13, 0.75).
The HDR was calculate numerically in this fashion,
1. Calculate the posterior density, (3.20)
2. Set an initial value for the horizon, estimate the abscissas (left and right of the
mode) whose density is at the horizon. Call these xl and xr
3. Integrate the density function over (xl , xr ).
4. Adjust the horizon until this is 0.9. The HDR is then (xl , xr ) at the current values.
(2)
0.34
9
p(2, |,, , s)
2.5
2.0
1.5
0.9
1.0
0.5
horizon
0.0
0.0 xl
0.5
1.0
xr
2
1.5
84
Chapter 4
4.1
F Distribution
Derivation
Definition 4.1
Suppose S12 and S22 are the sample variances for two samples of sizes n1 , n2
drawn from normal populations with variances 12 and 22 , respectively. The
random variable F is then defined as
F = S12 /S22 .
(4.1)
(4.2)
where the middle term is the ratio of 2 independent 2 variates on 1 , 2 degrees of freedom,
or equivalently, the ratio of 2 independent gamma variates with parameters 12 1 , 12 2 .
1 F
Thus , Y =
has a derived beta distribution with parameters 12 2 , 12 1 . (Statistics
2
260 study guide, section 7.3.1.) Then (Example 7.5, from Statistics 260 study guide), Y
has p.d.f.
y (1 /2)1
f (y) =
, y [0, )
(1 + y)(1 +2 )/2 B( 12 1 , 21 2 )
dy
and g(F ) = f (y) . So
dF
g(F ) =
g(F ) =
1 1 2 2 F (1 /2)1
, F [0, )
B( 12 1 , 12 2 )(2 + 1 F )(1 +2 )/2
Thus
/2 /2
85
(4.3)
CHAPTER 4. F DISTRIBUTION
86
This is the p.d.f. of a random variable with an F-distribution. A random variable F which
can be expressed as
W1 /1
F =
(4.4)
W2 /2
where W1 21 , W2 22 and W1 , and W2 are independent random variables, is said to
be distributed as F(1 , 2 ), or sometimes as F1 ,2 . [Note that we have departed from the
procedure of using a capital letter for the random variable and the corresponding small
letter for its observed value, and will use F in both cases here.]
4.2
Mean
R
The mean could be found in the usual way, E(F ) = 0 F g(F ) dF , but the rearrangement
of the integrand to get an integral that can be recognized as unity, is somewhat messy, so
we will use another approach.
1
For W 2 , E(W)= and we will show that E(W 1 ) =
.
2
Z 1 w/2 (/2)1
w e
w
dw
1
E(W ) =
1
2(/2) 2
0
( 12 1)
=
2( 21 )
Z
0
ew/2 w(/2)11 dw
2(/2)1 ( 12 1)
( 12 1)
2( 12 1)( 21 1)
1
.
2
W1 /1
2 W1
=
.
W2 /2
1 W2
Then,
E(F ) =
2
E(W1 )E(W21 )
1
2 1
1 2 2
2
, for 2 > 2.
2 2
(4.5)
CHAPTER 4. F DISTRIBUTION
87
2
.
2 2
(4.6)
Notes:
1. The mean is independent of the value of 1 and is always greater than 1.
2. As 2 , E(F) 1.
Mode
By differentiating g(F ) with respect to F it can verified that the mode of the F distribution
is at
2 (1 2)
(4.7)
F =
1 (2 + 2)
which is always less than 1.
g(F)
CHAPTER 4. F DISTRIBUTION
88
Solution:
0.8
0.6
f(x)
F53
F55
F510
0.4
0.2
0.0
0
Now plot the density function for 2 = 10 and 1 = 3, 5, 10 again overlaying the plots
on the same axes.
0.8
0.6
f(x)
F310
F510
F1010
0.4
0.2
0.0
0
CHAPTER 4. F DISTRIBUTION
89
Reciprocal of an F-variate
Let the random variable F F (1 , 2 ) and let Y = 1/F . Then Y has p.d.f.
dF
f (y) = g(F )
dy
( /2)
/2
2 2 1 1 y (2 /2)1
, y [0, ).
=
B( 12 2 , 21 1 )(1 + 2 y)(1 +2 )/2
Thus if F F (1 , 2 ) and Y = 1/F then Y F (2 , 1 ).
4.3
(4.8)
Let S12 and S22 be the sample variances of 2 samples of sizes n1 and n2 drawn from normal
populations with variances 12 and 22 . Recall that (from (4.1), (4.2)) it is only if 12 = 22
(= 2 , say) that S12 /S22 has an F distribution. This fact can be used to test the hypothesis
H: 12 = 22 .
If the hypothesis H is true then,
S12 /S22 F (1 , 2 ) where 1 = n1 1, 2 = n2 1.
For the alternative
A : 12 > 22
only large values of the ratio s21 /s22 would tend to support it, so a rejection region {F :
F > F.01P } is used (Fig 4.2).
Since only the right hand tail areas of the distribution are tabulated it is convenient to
always use s2i /s2j > 1. That is, always put the larger sample variance in the numerator.
CHAPTER 4. F DISTRIBUTION
90
g(F)
P/100
.01P
F
Example 4.1
For two samples of sizes 8 and 12, the observed variances are .064 and .024 respectively.
Let s21 = .064 and s22 = .024.
CHAPTER 4. F DISTRIBUTION
91
% point of F -distribution
2
,
2 1
g(F)
/2
% point of F -distribution
2
1 2
g(F)
/2
F = 1/F
1
(4.9)
CHAPTER 4. F DISTRIBUTION
92
Example 4.2
Given s21 = 3.22 , n1 = 11, s22 = 3.02 , n2 = 17, test the hypothesis H: 12 = 22 against A:
12 6= 22 , at the 5% significance level.
Solution: Under H, S12 /S22 F (10, 16).
From tables F2.5% (10, 16) = 2.99 = F2 .
The lower 2.5% critical point is then found by F1 = 1/F2.5% (16, 10) = 1/3.5 = .29.
The calculated value of the statistic is 3.22 /3.02 = 1.138 which does not lie in the rejection
region, and so is not significant at the 5% level. Thus the evidence supports the hypothesis
that 12 = 22 .
Of course, so long as we take s21 /s22 to be greater than 1, we dont need to worry about
the lower critical value. It will certainly be less than 1.
Computer Exercise 4.2
Use R to find the critical points in example 4.5.
Solution: We use the qf command.
> qf(p=c(0.975,0.025),df1=10,df2=16)
[1] 3.0 0.29
> pf(q=1.138,df1=10,df2=16,lower.tail=F)
[1] 0.27
4.4
Given 2 unbiased estimates of 2 , s21 and s22 , it is often useful to be able to combine them
to obtain a single unbiased estimate. Assume the new estimator, S 2 , is linear combination
of s21 and s22 so that S 2 has the smallest variance of all such linear, unbiased estimates (that
is it is said to have minimum variance). Let
S 2 = a1 S12 + a2 S22 , where a1 , a2 are positive constants.
Firstly, to be unbiased,
E(S 2 ) = a1 E(S12 ) + a2 E(S22 ) = 2 (a1 + a2 ) = 2
which implies that
a1 + a2 = 1.
Secondly, if it is assumed that S12 and S22 are independent then
Var(S 2 ) = a21 Var(S12 ) + a22 Var(S22 )
= a21 Var(S12 ) + (1 a1 )2 Var(S22 ) using (4.10)
(4.10)
CHAPTER 4. F DISTRIBUTION
93
V (S22 )
V (S12 )
,
a
=
2
V (S12 ) + V (S22 )
V (S12 ) + V (S22 )
(4.11)
In the case where the Xi are normally distributed, V (Sj2 ) = 2 4 /(nj 1) (see Assignment
3, Question 1). Then the pooled sample variance is
s2 =
(4.12)
where 1 = n1 1, 2 = n2 1.
The above method can be extended to pooling k unbiased estimate s2 of 2 . That is,
s2 =
(4.13)
P
where S 2 is on ki=1 i (= , say) degrees of freedom, and S 2 / 2 is distributed as 2 .
Also the theory applies more generally to pooling unbiased estimates 1 , 2 , . . . , k of a
parameter .
1
2
k
+
+ +
V (1 ) V (2 )
V (k )
=
.
(4.14)
1
1
1
+
+ +
V (1 ) V (2 )
V (k )
The estimator thus obtained is unbiased and has minimum variance.
Note the following
(i) s2 = 12 (s21 + s22 ) if 1 = 2 ;
(ii) E(S 2 ) = 2 ;
(iii) The s2 in (4.11) is on 1 + 2 degrees of freedom.
CHAPTER 4. F DISTRIBUTION
4.5
94
Given s21 , s22 are unbiased estimates of 12 , 22 derived from samples of size n1 , n2 respectively,
from two normal populations, find a (1 %) confidence interval for 12 /22 .
Now 1 S12 /12 and 2 S22 /22 are distributed as independent 21 , 22 variates, and
S22 /22
W2 /2
F (2 , 1 ).
2
2
S1 /1
W1 /1
So
P
That is
P
S22 12
F1 2 (2 , 1 ) < 2 2 < F 2 (2 , 1 ) = 1 .
S1 2
S12
12
S12
F1 2 (2 , 1 ) < 2 < 2 F 2 (2 , 1 ) = 1 .
S22
2
S2
4.6
(4.15)
Example 4.3
These data are sway signal energies (1000) from subjects in 2 groups; Normal and
2
Whiplash injured. The data and a program to calculate the confidence interval for 12 ,
2
defined at equation (4.15), are listed in Table 4.1
Although the data are presented in 2 blocks, you imagine them in a single file called
test1E.txt with the W data under the N data in columns.
We find that using (4.15)
12
P 0.52 < 2 < 3.6 = 0.95
2
The bootstrap CI is calculated in R by the script in Table 4.2.
By this method,
12
P 0.26 < 2 < 7.73 = 0.95
2
The CIs are much larger because the method has not relied upon the assumptions of
(4.15) and uses only the information contained in the data.
CHAPTER 4. F DISTRIBUTION
95
Table 4.1: Confidence interval for variance ratio using the F quantiles
category D1
N
0.028
N
0.036
N
0.041
N
0.098
N
0.111
N
0.150
N
0.209
N
0.249
N
0.360
N
0.669
N
0.772
N
0.799
N
0.984
N
1.008
N
1.144
N
1.154
N
2.041
N
3.606
N
4.407
N
5.116
category D1
W
0.048
W
0.057
W
0.113
W
0.159
W
0.214
W
0.511
W
0.527
W
0.635
W
0.702
W
0.823
W
0.943
W
1.474
W
1.894
W
2.412
W
2.946
W
3.742
W
3.834
E1 <- read.table("test1E.txt",header=T)
Sigmas <- tapply(E1$D1,list(E1$category),var)
nu <- table(E1$category) -1
VR <- Sigmas[1]/Sigmas[2]
Falpha <- qf(p=c(0.975,0.025),df1=nu[1],df2=nu[2] )
CI <- VR/Falpha
> CI
[1] 0.52 3.60
12
22
library(boot)
var.ratio <- function(E1,id){
# a user supplied function
yvals <- E1[[2]][id]
# to calculate the statistic of interest
vr <- var(yvals[E1[[1]]=="N"]) /
var(yvals[E1[[1]]=="W"])
return(vr)
} # end of the user supplied function
doBS <- boot(E1,var.ratio,999)
bCI <- boot.ci(doBS,conf=0.95,type=c("perc","bca"))
print(bCI)
> bCI
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 999 bootstrap replicates
CALL :
boot.ci(boot.out = boot(E1, var.ratio, 999), conf = 0.95, type = c("perc","bca") )
Intervals :
Level
Percentile
BCa
95%
( 0.26, 7.73 )
( 0.58, 22.24 )
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable
Chapter 5
5.1
t-Distribution
Derivation
Z=
/ n
is distributed N(0,1).
In practice 2 is usually not known and is replaced by its unbiased estimate S 2 so that in
place of Z we define
X
,
T =
S/ n
We need to find the probability distribution of this random variable. T can be written
as
(X )/
Z
n
T = p
=p
,
S 2 / 2
W/
(5.1)
96
CHAPTER 5. T-DISTRIBUTION
97
Definition 5.1
A random variable has a t-distribution on degrees
pof freedom (or with parameter ) if it can be expressed as the ratio of Z to W/ where Z N (0, 1)
and W (independent of Z) 2 .
Theorem 5.1
A random variable T which has a t-distribution on d.f. has pdf
f (t) =
[ 21 (1 + )]
, t (, ).
(/2)[1 + (t2 /)](1+)/2
(5.2)
and
dF
1
g(F )
f (t) =
2
dt
1 2
=
g(t ).2t
2
/2 t1 t
=
B( 12 , 2 )( + t2 )(1+)/2
=
But ( 21 ) =
5.2
[ 12 ( + 1)]
.
( 21 ) ( 2 ) [1 + (t2 /)](1+)/2
Graph
The graph of f(t) is symmetrical about t = 0 since f (t) = f (t), unimodal, and f (t) 0
as t . It resembles the graph of the normal distribution but the tails are lower and
the central peak higher than for a normal curve of the same mean and variance. This is
illustrated in the figure 5.1 for = 4.
Note: The density functions in Figure 5.1 were found and plotted using R.
CHAPTER 5. T-DISTRIBUTION
98
0.3
t
normal
0.2
0.1
0.0
4
Special Cases
(i) A special case occurs when = 1. This is called the Cauchy distribution and it has
pdf
1
, t (, ).
f (t) =
(1 + t2 )
Check that the mean and variance of this distribution do not exist.
2
2 /2
= 1.et
+1
lim (
)/ (/2) = (2)1/2 .
CHAPTER 5. T-DISTRIBUTION
99
Example 5.3
For T t8 find tc such that P (|T | > tc ) = .05.
> qt(p=0.025,df=8,lower.tail=F)
[1] 2.3
5.3
In Chapter 2 we studied the problems of getting a confidence interval for the mean , and
testing hypotheses about when 2 was assumed known. In practice 2 is usually not
known and must be estimated from the data and it is the tdistribution that must be
used to find a confidence interval for and test hypotheses about .
In this section we will derive a 100(1 )% confidence interval for the unknown parameter .
CHAPTER 5. T-DISTRIBUTION
100
One-sample Problem
Given X1 , X2 , . . . , Xn is a random sample from a N(, 2 ) distribution where 2 is unknown
, then
X
tn1 .
T =
(5.3)
S/ n
Then defining t, as
P (T > t, ) = where T t ,
we have
P
t, 2
X
< t,1 2
<
S/ n
(5.4)
= 1 .
That is,
S
S
P (X t,/2 < < X + t,/2 ) = 1 .
n
n
(5.5)
(5.6)
Note how in (5.6), the upper tail quantile is subtraced from the sample mean to calculate
the lower limit and the lower tail quantile is subtracted to calculate the upper limit. This
arose by reversing the inequalities when making the transform .
By the symmetry of the t-distribution, t, 2 = t,1 2 and the lower tail quantile is a
negative number, the upper tail quantile is the same magnitude but positive. So you would
get the same result as (5.6) if you calculated
s
s
x t,1 2 , x + t,1 2
n
n
CHAPTER 5. T-DISTRIBUTION
101
which is often how we think of it. However, it is very important that the true relationship be understood and known because it will be a critical point when we examine the
bootstrap-t where the symmetry does not hold.
Example 5.4
The length (in cm) of skulls of 10 fossil skeletons of an extinct species of bird were measured
with the following results.
5.22, 5.59, 5.61, 5.17, 5.27, 6.06, 5.72, 4.77, 5.57, 6.33.
Find a 95% CI for the true mean length of skulls of this species.
Solution: Computer Solution:(Ignore ttest output except for confidence interval.)
skulls <- data.frame(length=c(5.22,5.59,5.61,5.17,5.27,6.06,5.72,4.77,5.57,6.33) )
t.test(skulls$length,alternative="two.sided",mu=0,conf.level=0.95)
data: skulls$length
t = 38.69, df = 9, p-value = 2.559e-11
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
5.20 5.85
sample estimates:
mean of x
5.53
Two-sample Problem
Let us now consider the two-sample problem where X1 , X2 , . . . , Xn1 and Y1 , Y2 , . . . , Yn2 are
2
independent random samples from N(1 , 2 ) and
N(2 , ) distributions
respectively. Now
1
2 1
the random variable X Y is distributed as N 1 2 , ( n1 + n2 ) . That is,
X Y (1 2 )
q
N (0, 1).
2 ( n11 + n12 )
CHAPTER 5. T-DISTRIBUTION
102
X Y (1 2 )
r
S 2 n11 + n12
(5.7)
where
1 S12 + 2 S22
(1 + 2 )S 2
, and
21 +2 .
(5.8)
1 + 2
2
q
Rewriting T with a numerator of (X Y (1 2 ))/ 2 ( n11 + n12 ) and a denominator
p
of S 2 / 2 , we see that T can be expressed as the ratio of a N(0,1) variate to the square
root of an independent chi-square variate divided by its degree of freedom. Hence it has a
tdistribution with 1 + 2 = n1 + n2 2 degrees of freedom.
We will now use (5.5) to find a confidence intervals for 1 2 .
Given X1 , X2 , . . . , Xn1 is a random sample from N(1 , 2 ) and Y1 , Y2 , . . . , Yn2 is an
independent sample from N(2 , 2 ), and with t, defined as in (5.3), we have
X Y (1 2 )
q
< t,/2 = 1 .
P t,/2 <
1
1
2
S ( n1 + n2 )
S2 =
(5.9)
CHAPTER 5. T-DISTRIBUTION
103
Example 5.5
The cholesterol levels of seven male and 6 female turtles were found to be:
Male
226 228 232 215 223 216 223
Female 231 231 218 236 223 237
Find a 99% CI for m f .
Solution:
It will be assumed variances are equal. See chapter 4 for method of testing using R.
x <- c(226,228,232,215,223,216,223)
y <- c(231,231,218,236,223,237)
t.test(x,y,var.equal=T,conf.level=0.99)
Two Sample t-test
data: x and y
t = -1.6, df = 11, p-value = 0.1369
alternative hypothesis: true difference in means is not equal to 0
99 percent confidence interval:
-17.8
5.7
sample estimates:
mean of x mean of y
223
229
Make that data frame active and then use Statistics Means Independent samples
t-test
CHAPTER 5. T-DISTRIBUTION
5.4
104
One-sample Problem
Given X1 , X2 , . . . , Xn is a random sample from N(, 2 ) where both parameters are unknown, we wish to test the hypothesis, H : = 0 . Using (5.2) we can see that
(a) for the alternative, H1 : 6= 0 , values of x close to 0 support the hypothesis
being true while if |x 0 | is too large there is evidence the hypothesis may be
incorrect. That is, reject H0 at the 100% significance level if
|x 0 |
> t,/2 .
s/ n
,/2
/2
/2
-t
,/2
,/2
(b) For H1 : > 0 , only large values of (x 0 ) tend to caste doubt on the hypothesis.
That is, reject H0 at the 100% significance level if
x 0
> t, .
s/ n
An alternative H1 : < 0 , would be treated similarly to (b) but with lower critical
value t, .
CHAPTER 5. T-DISTRIBUTION
105
Example 5.6
A certain type of rat shows a mean weight gain of 65 gms during the first 3 months of
life. A random sample of 12 rats were fed a particular diet from birth. After 3 months the
following weight gains were recorded: 55, 62, 54, 57, 65, 64, 60, 63, 58, 67, 63, 61. Is there
any reason to believe that the diet has resulted in a change of weight gain?
Solution: Let X be the weight gain in 3 months and assume that X N (, 2 ). The
hypothesis to be tested is H : = 65.0 and the appropriate alternative, H1 : 6= 65.0
Then, x = 60.75, s2 = 16.38 and.
60.75 65.0
= 3.64.
t=
16.38/ 12
For a 2-tailed test with = .05, t11,.025 ' 2.20
wt <- c(55,62,54,57,65,64,60,63,58,67,63,61)
xbar <- mean(wt);
s <- sd(wt);
n <- length(wt)
tT <- (xbar-65)/(s/sqrt(n) );
cat("t = ",tT,"\n")
t = -3.6
qt(p=0.025,df=(n-1) )
[1] -2.2
pt(q=tT,df=(n-1))
[1] 0.0020
Our calculated value is less than t11,.025 and so is significant at the 5% level. Furthermore, t11,.005 ' 3.11 and our calculated value lies in the 1% critical region for the two-tailed
test, so H is rejected at the 1% level. A better (and more modern) way to say this is that
CHAPTER 5. T-DISTRIBUTION
106
if the hypothesis is true then the probability of an observed t-value as extreme (in either
direction) as the one obtained is less than 1% . Thus there is strong evidence to suggest
that the hypothesis is incorrect and that this diet has resulted in a change in the mean
weight gained.
> t.test(wt,alternative="two.sided",mu=65)
One Sample t-test
data: wt
t = -3.6, df = 11, p-value = 0.003909
alternative hypothesis: true mean is not equal to 65
95 percent confidence interval:
58 63
sample estimates:
mean of x
61
Comment
The procedure adopted in the above example is a generally accepted one in hypothesis
testing problems. That is, it is customary to start with = .05, and if the hypothesis
is rejected at the 5% level (this is equivalent to saying that the observed value of the
statistic is significant at the 5% level), then consider = .01. If the observed value is
right out in the tail of the distribution, it may fall in the 1% critical region (one- or twotailed, whichever is appropriate). To make a conclusion, claiming significance at the 1%
level carries more weight than one claiming significance at the 5% level. This is because
in the latter case we are in effect saying that, on the basis of the data we have, we will
assert that H is not correct. In making such a statement we admit that 5 times in 100
we would reject H0 wrongly. In the former case however, (significance at the 1% level)
we realize that there is only 1 chance in 100 that we have rejected H wrongly. The commonly accepted values of to consider are .05, .01, .001. For the t, F and 2 distributions,
critical values can be read from the tables for both 1- and 2-tailed tests for these values of .
Two-sample Problem
Given X1 , X2 , . . . , Xn1 and Y1 , Y2 , . . . , Yn2 are independent random samples from N(1 , 12 )
and N(2 , 22 ) respectively, we may wish to test H : 1 2 = 0 , say. Using (5.3) we can
see that, under H0 ,
X Y 0
q
tn1 +n2 2 .
S n11 + n12
So H0 can be tested against one- or two-sided alternatives.
Note however, that we have assumed that both populations have the same variance
2 , and this in general is not known. More generally, let X1 , X2 , . . . , Xn1 be a random
CHAPTER 5. T-DISTRIBUTION
107
sample from N(1 , 12 ) and Y1 , Y2 , . . . , Yn2 be an independent random sample from N(2 ,
22 ) where 1 , 2 , 12 , 22 are unknown, and suppose we wish to test H : 1 2 = 0 . From
the samples of sizes n1 , n2 we can determine x, y, s21 , s22 . We first test the preliminary
hypothesis that 12 = 22 and if evidence supports this, then we regard the populations as
having a common variance 2 . So the procedure is:
(i) Test H0 : 12 = 22 (= 2 ) against H1 : 12 6= 22 , using the fact that under H0 , S12 /S22
F1 ,2 . [This is often referred to as testing sample variances for compatibility.] A twosided alternative and a two-tailed test is always appropriate here. We dont have any
prior information about the variances. If this test is survived (that is, if H0 is not
rejected), proceed to (ii).
(ii) Pool s21 and s22 using s2 =
degrees of freedom.
1 s21 +2 s22
1 +2
(iii) Test H0 : 1 2 = 0 against the appropriate alternative using the fact that, under
H0 ,
X Y 0
q
t1 +2 .
S n11 + n12
Example 5.7
A large corporation wishes to choose between two brands of light bulbs on the basis of
average life. Brand 1 is slightly less expensive than brand 2. The company would like to
buy brand 1 unless the average life for brand 2 is shown to be significantly greater. Samples
of 25 lights bulbs from brand 1 and 17 from brand 2 were tested with the following results:
Brand 1 (X):
997, 973, 977, 1051, 1029, 934, 1007, 1020, 961, 948, 954, 939, 987, 956, 874, 1042, 1010,
942, 1011, 962, 993, 1042, 1058, 992, 979
Brand 2 (Y):
973, 970, 1018, 1019, 1004, 1009, 983, 1013, 968, 1025, 935, 1018, 1033, 992, 1037, 964, 1067
We want to test H0 : 1 = 2 against H1 : 1 < 2 where 1 , 2 are the means from
brands 1 and 2 respectively.
Solution: For the above data, x = 985.5 hours, y = 1001.6 hours, s1 = 43.2, s2 = 32.9.
(i) Firstly test H0 : 12 = 22 against a two-sided alternative noting that under H0 ,
S12 /S22 F24,16 .
Then, s21 /s22 = 1.72 and from the F-tables, the critical value for a two-tailed test with
= .05 is F2.5% (24, 16) = 2.63. The calculated value is not significant (that is, does
not lie in the critical region) so there is no reason to doubt H0 .
CHAPTER 5. T-DISTRIBUTION
108
1 s21 + 2 s22
24 1866.24 + 16 1.82.41
=
= 1552.71.
1 + 2
24 + 16
qXY
1
+ n1
n
1
16.1
1001.6 + 985.5
=
t=
= 1.30.
12.387
1552.71 .098824
For a 1-tailed test with = .05 (with left- hand tail critical region), the critical value
is t40,.95 = t40,.05 = 1.68. The observed value is not in the critical region so is
not significant at the 5% level and there is insufficient evidence to caste doubt on
the truth of the hypothesis. That is, the average life for brand 2 is not shown to be
significantly greater than that for brand 1.
Computer Solution:
x <- c(997, 973, 977, 1051, 1029, 934, 1007, 1020, 961, 948, 954, 939, 987,
956, 874, 1042, 1010, 942, 1011, 962, 993, 1042, 1058, 992, 979)
y <- c(973, 970, 1018, 1019, 1004, 1009, 983, 1013, 968, 1025, 935, 1018, 1033,
992, 1037, 964, 1067)
t.test(x,y,var.equal=T)
Two Sample t-test
data: x and y
t = -1.3, df = 40, p-value = 0.2004
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-41.2
8.9
sample estimates:
mean of x mean of y
986
1002
Notice here the actual P -value is given as 20%. The 95% confidence interval for 1 2
is also given. Both the ttest and the confidence interval are based on the pooled standard
deviation which is not reported and would have to be calculated separately if needed.
Comment
When the population variances are unequal and unknown, the methods above for finding confidence intervals for 1 2 or for testing hypotheses concerning 1 2 are not
appropriate. The problem of unequal variances is known as the Behrens-Fisher problem, and various approximate solutions have been given but are beyond the scope of this
course. One such approximation (the Welch ttest) can be obtained in R by omitting the
var.equal=T option from t.test.
CHAPTER 5. T-DISTRIBUTION
5.5
109
Paired-sample t-test
n
X
Di /n and
s2d
Pn
=
d)2
.
n1
i=1 (d1
i=1
d 0
with the appropriate critical value from
sd / n
tables.
One important reason for using paired observations is to eliminate effects in which
there is no interest. Suppose that two teaching methods are to be compared by using 50
students divided into classes with 25 in each. One way to conduct the experiment is to
assign randomly 25 students to each class and then compare average scores. But if one
group happened to have the better students, the results may not give a fair comparison
of the two methods. A better procedure is to pair the students according to ability (as
measured by IQ from some previous test) and assign at random one of each pair to each
class. The conclusions then reached are based on differences of paired scores which measure
the effect of the different teaching methods.
When extraneous effects (for example, students ability) are eliminated, the scores on
which the test is based are less variable. If the scores measure both ability and difference
in teaching methods, the variance will be larger than if the scores reflect only teaching
method, as each score then has two sources of variation instead of one.
CHAPTER 5. T-DISTRIBUTION
110
Example 5.8
The following table gives the yield (in kg) of two varieties of apple, planted in pairs at eight
(8) locations. Let Xi and Yi represent the yield for varieties 1, 2 respectively at location
i = 1, 2, . . . , 8.
i
xi
yi
di
1
2
3
4
5
6
7
114 94 64 75 102 89 95
107 86 70 70 90 91 86
7
8 6 5 12 2 9
8
80
77
3
Test the hypothesis that there is no difference in mean yields between the two varieties,
that is test: H0 : X Y = 0 against H1 : X Y > 0.
Solution:
Paired t-test
x <-c(114,94,64,75,102,89,95,80)
y <-c(107,86,70,70,90,91,86,77)
t.test(x,y,paired=T,
alternative="greater")
data: x and y
t = 2.1, df = 7, p-value = 0.03535
alternative hypothesis:
true difference in means is greater than 0
95 percent confidence interval:
0.5 Inf
sample estimates:
mean of the differences
4.5
The probability of observing T > 2.1 under H0 is 0.035 which is sufficiently small a
probability to reject H0 and conclude that the observed difference is due to variety and
not just random sampling.
In Rcmdr, the data are organised in a data frame in pairs,
> apples <- data.frame(yld1=x,yld2=y)
> apples
yld1 yld2
1 114 107
2
94
86
3
64
70
4
75
70
5 102
90
6
89
91
7
95
86
8
80
77
Make the data set active and then use Statistics Means paired t-test.
CHAPTER 5. T-DISTRIBUTION
111
Comments
1. Note that for all the confidence intervals and tests in this chapter, it is assumed the
samples are drawn from populations that have a normal distribution.
2. Note that the violation of the assumptions underlying a test can lead to incorrect
conclusions being made.
5.6
Bootstrap T-intervals
We can make accurate intervals without depending upon the assumption of the assumption
of normality made at (5.1) by using the bootstrap. The method is named bootstrap-t.
The procedure is as follows:1. Estimate the statistic (e.g. the sample mean) and its standard error, se,
and
determine the sample size, n.
2. Nominate the number of bootstrap samples B, e.g. B = 199.
3. Loop B times
Generate a bootstrap sample x?(b) by taking a sample of size n with replacement.
?
Calculate the bootstrap sample statistic, (b)
, and its standard error, se?(b)
Calculate
?
T(b)
?
(b)
se
?(b)
?
?
?
4. Estimate the bootstrap-t quantiles from T(1)
, T(2)
, . . . , T(B)
. Denote these as t2 and
t1 2 .
t1 2 se
, t2 se
The point made at equation (5.6) about selecting the correct quantiles to make the
confidence limits now becomes important because the symmetry no longer holds.
CHAPTER 5. T-DISTRIBUTION
112
Example 5.9
In example 5.4 the 95% CI for the mean was calculated to be (5.2, 5.9). We now calculate
the CI using bootstrap-t.
x <- c(5.22,5.59,5.61,5.17,5.27,6.06,5.72,4.77,5.57,6.33)
n <- length(x)
skulls <- data.frame(length=c(5.22,5.59,5.61,5.17,5.27,6.06,5.72,4.77,5.57,6.33) )
heta <- mean(skulls$length)
se.theta <- sd(skulls$length)/sqrt(n)
nBS <- 199
Tstar <- numeric(nBS)
i <- 1
while( i < (nBS+1)){
# looping 1 to nBS
x.star <- sample(skulls$length,size=n,replace=T)
Tstar[i] <- (mean(x.star) -theta ) / ( sd(x.star)/sqrt(n) )
i <- i+1
}
# end of the while loop
bootQuantiles <- round(quantile(Tstar,p=c(0.025,0.975)),2)
cat("Bootstrap T quantiles = ",bootQuantiles,"\n")
CI <- theta - se.theta*rev(bootQuantiles)
cat("CI = ",CI,"\n")
Bootstrap T quantiles =
CI = 5.2 5.83
-2.37 2.41
Note in the code the use of the rev() function to reverse the quantiles for calculating
the CI. Also observe that the quantiles are not of the same magnitude, symmetry is absent.
However, the asymmetry is not much.
Example 5.10
q
1
In example 5.7 the 95% CI for the mean difference was determined as 16239.4 25
+
(41, 9). What is the bootstrap-t CI for mean difference?
1
17
The mndiff function in the code below computes the mean and variance of the difference. The user must supply the variance calculations for boot.ci to calculate the
bootstrap-t (or Studentized) CIs.
CHAPTER 5. T-DISTRIBUTION
x <- c(997, 973, 977,
956, 874, 1042,
y <- c(973, 970, 1018,
992, 1037, 964,
113
1051, 1029, 934, 1007, 1020, 961, 948, 954, 939, 987,
1010, 942, 1011, 962, 993, 1042, 1058, 992, 979)
1019, 1004, 1009, 983, 1013, 968, 1025, 935, 1018, 1033,
1067)
Intervals :
Level
Studentized
Percentile
95%
(-55.81, -7.99 )
(-24.34, 24.42 )
Calculations and Intervals on Original Scale
Collating the results for the CI of the mean difference (and including the CI for variances
not equal),
t with variances equal
(41, 9)
t with variances unequal (40, 7.7)
bootstrap-t
(56, 8)
percentile-t
(24, 24)
Although the findings from each technique are the same, that there is insufficient evidence to conclude that the means are different, nevertheless there are disparities which we
may attempt to understand.
Density plots of the 2 samples give some clues as to why the results might differ,
Figure 5.3
Although an F-test supported the null hypothesis H0 : 12 = 22 , the plots do indicate that the difference in variances might be an issue. Further, although the densities
seem to be approximately normal, each could also be viewed as displaying skewness. The
assumptions of normality and equal variances may not be justified.
CHAPTER 5. T-DISTRIBUTION
114
0.012
0.010
0.008
0.006
0.004
0.002
0.000
0.012
0.010
0.008
0.006
0.004
0.002
0.000
Density
Density
Y
N = 25 Bandwidth = 19.4
800
900
Example 5.11
N = 17 Bandwidth = 16.81
The densities of the data used in Example 2.7 are used to demonstrate how the assumption
of normality is a strong condition for parametric t-tests.
The densities of the energies from 2 groups (N & W) are plotted in Figure 5.4.
Figure 5.4: Densities of energies of sway signals from Normal and whiplash subjects
W
4e04
Density
3e04
2e04
1e04
0e+00
N
0
1000
2000
3000
4000
5000
6000
The confidence intervals of mean difference are:parametric t with variances unequal (1385, 224)
bootstrap-t
(2075, 302)
percentile-t
(815, 791)
Only the bootstrap-t suggests that the mean difference is unlikely to be zero.
Parametric-t loses out because the underlying assumptions of normality do not hold.
The percentile bootstrap is unreliable for small sample sizes (< 100).
Chapter 6
6.1
Introduction
This chapter deals with hypothesis testing problems where the data collected is in the form
of frequencies or counts.
In section (6.2) we study a method for testing the very general hypothesis that a
probability distribution takes on a certain form, for example, normal, Poisson, exponential,
etc. The hypothesis may or may not completely prescribe the distribution. That is, it may
specify the value of the parameter(s), or it may not. These are called Goodness-of-Fit
Tests. Section (6.3) is then concerned with the analysis of data that is classified according
to two attributes in a Contingency Table. Of interest here is whether the two attributes
are associated.
6.2
GoodnessofFit Tests
Consider the problem of testing if a given die is unbiased. The first step is to conduct
an experiment such as throwing the die n times and counting how many 1s, 2s, . . . , 6s
occur. If Y is the number that shows when the die is thrown once and if the die is unbiased
then Y has a rectangular distribution. That is,
P (Y = i) = pi = 1/6, i = 1, 2, 3, 4, 5, 6.
Testing the hypothesis the die is unbiased is then equivalent to testing the hypothesis
H0 : p1 = p2 = p3 = p4 = p5 = p6 = 1/6.
Let Ai be the event: i occurs (on a given throw). Then P (Ai ) = pi = 1/6, under H0
(that is, if and only if, H0 is true). The random variables Yi , i = 1, . . . , 6 are now defined
as follows. Let Yi be the number of times in the n throws that Ai occurs. If the die is
unbiased the distribution of (Y1 , . . . , Y6 ) is then multinomial with parameters p1 , . . . , p6 all
equal to 1/6.
Example 6.1
Suppose that the die is thrown 120 times and the observed frequencies in each category
(denoted by o1 , . . . , o6 ) are
115
116
1 2 3 4
15 27 18 12
5
25
6
23
n!
py1 py2 . . . pykk
y1 ! . . . yk ! 1 2
where
of times in n trials that event Ai (which has probability pi ) occurs,
Pk Yi is the
Pnumber
k
2
i=1 pi = 1,
i=1 yi = n, then the random variable X defined by
P (Y1 = y1 , . . . , Yk = yk ) =
X =
k
X
(Yi npi )2
i=1
npi
(Y2 np2 )2
np2
(n Y1 n(1 p1 ))2
+
n(1 p1 )
1
1
+
p1 1 p1
+
(Y1 np1 )2
21
np1 q1
(6.1)
117
Comments
1. Note that the multinomial distribution only arises in the above problem in a secondary way when we count the number of occurrences of various events, A1 , A2 , etc.
where {A1 , . . . , Ak } is a partition of the sample space.
2. If the underlying distribution is discrete, the event Ai usually corresponds to the
random variable taking on a particular value in the range space (see MSW, example
14.2). When the underlying distribution is continuous the events {Ai } have to be
defined by subdividing the range space (see Examples 6.3 and 6.4 that follow). The
method of subdivision is not unique but in order for the chisquare approximation to
be reasonable the cell boundaries should be chosen so that npi 5 for all i. We
want enough categories to be able to see whats happening, but not so many that
npi < 5 for any i.
A case can be made for choosing equalprobability categories but this is only one
possibility.
3. The fact that the X 2 defined in (6.1) has approximately a chi-square distribution with
k 1 degrees of freedom under H0 , is only correct if the values of the parameters
are specified in stating the hypothesis. (See MSW, end of 14.2.) If this is not so, a
modification has to be made to the degrees of freedom. In fact, we can still say that
2
X =
k
X
(Yi npi )2
i=1
npi
approximately as 2
118
Example 6.1(cont) Let us return to the first example and test the hypothesis, H0 is
p1 = p2 = . . . = 16 against the alternative H1 that pi 6= 16 for some i.
Solution: The observed frequencies (oi ) and those expected under H (ei ) are
i
oi
ei
1 2 3 4 5 6
15 27 18 12 25 23
20 20 20 20 20 20
x =
6
X
(oi ei )2
i=1
ei
Now the parameters pi , . . . , p6 are postulated by the hypothesis as 1/6 so the df for 2 is
6 1 = 5. Under H0 , X 2 25 and the hypothesis would be rejected for large values of x2 .
The upper 5%ile is 25,.05 = 11.1 ( qchisq(df=5,p=0.05,lower.tail=F) ) so the
calculated value is not significant at the 5% level. There is insufficient evidence to caste
doubt on the hypothesis so that we conclude the die is most likely unbiased.
119
Example 6.2
Merchant vessels of a certain type were exposed to risk of accident through heavy weather,
ice, fire, grounding, breakdown of machinery, etc. for a period of 400 days. The number
of accidents to each vessel, say Y , may be considered as a random variable. For the data
reported below, is the assumption that Y has a Poisson distribution justified?
Number of accidents (y)
Number of vessels with y accidents
0
1448
1
805
2
206
3
34
4
4
5
2
6
1
Solution: Note that the parameter in the Poisson distribution is not specified and we
= y, which is the average number of accidents per vessel.
have to estimate it by its mle,
Thus,
y =
(0 1448) + (1 805) + . . . + (5 2) + (6 1)
1448 + 805 + . . . + 2 + 1
1351
2500
= .5404.
We now evaluate P (Y = y) = e.5404 (.5404)x /x! for y = 0, 1, . . . , to obtain
p0 = P (Y = 0) = e.5404 = .5825
p1 = P (Y = 1) = .5404 e.5404 = .3149
Similarly, p2 = .0851, p3 = .0153, p4 = 0.0021, p5 = .00022, p6 = .00002
Recall that the 2 approximation is poor if the expected frequency of any cell is less than
about 5. In our example, E(Y5 ) = 2500 0.00022 = 0.55 and X(Y6 ) = 0.05. This
means that the last 3 categories should be grouped into a category called Y 4 for which
p4 = P (Y 4) = .0022.
The expected frequencies (under H0 ) are then given by E(Yi ) = 2500 pi and are tabulated
below.
observed
expected
1448
805
206
34
7
1456.25 787.25 212.75 38.25 5.50
X (o e)2
e
(1448 1456.25)2
(1.5)2
+ ... +
= 1.54.
1456.25
5.5
Since there are 5 categories and we estimated one parameter, the random variable X 2 is
distributed approximately as a 23 . The upper 5% critical value is 7.81,
120
> qchisq(p=0.05,df=3,lower.tail=F)
[1] 7.81
so there is no reason to doubt the truth of H0 and we would conclude that a Poisson
distribution does provide a reasonable fit to the data.
Computer Solution: First enter the number of accidents into x and the observed frequencies into counts.
x <- 0:6
counts <- c(1448,805,206,34,4,2,1)
# Calculate rate
lambda <- sum(x*counts)/sum(counts)
#
Merge cells with E(X) < 5
counts[5] <- sum(counts[5:7])
#
Poisson probabilities
probs <- dpois(lam=lambda,x=0:4)
#
ensure that the probabilities sum to 1, no rounding error
probs[5] <- 1- sum(probs[1:4])
#
Chi square test of frequency for Poisson probabilities
chisq.test(counts[1:5],p= probs)
Chi-squared test for given probabilities
data: counts[1:5]
X-squared = 1.4044, df = 4, p-value = 0.8434
Notice the value for x2 is slightly different. R uses more accurate values for the probabilities and also retains more decimal places for its calculations and so has less rounding
error than we managed with a calculator. The conclusions reached are however the same.
121
Example 6.3
Let random variable T be the length of life of a certain brand of light bulbs. It is hypothesised that T has a distribution with pdf
f (t) = et , t > 0.
Suppose that a 160 bulbs are selected at random and tested, the time to failure being
recorded for each. That is, we have t1 , t2 , . . . , t150 . Show how to test that the data comes
from an exponential distribution.
Solution: A histogram of the data might be used to give an indication of the distribution.
Suppose this is as shown below.
Figure 6.1: Time to Failure
60
50
40
30
20
10
0
25
75
125
175
225
275
325
375
425
475
The time axis is divided into 10 categories, with cell barriers at 50, 100, . . . , 500,
and we might ask what are the expected frequencies associated with these categories, if T
does have an exponential distribution. Let p1 , p2 , . . . , p8 denote the probabilities of the
categories, where
Z 50
p1 =
et dt = 1 e50
Z0 100
et dt = e50 e100 , etc.
p2 =
50
122
If is not known it has to be estimated from the data. The mle of is 1/t and
we use this value to calulate the pi and hence the ei . The degrees of freedom now
become k 2 since one parameter, has been estimated.
050
60
250300
10
50100
31
300350
3
100150
19
350400
5
150200
18
400450
2
200250
9
>450
3
Test the hypothesis that the failure times follow an exponential distribution.
1
. Thus, = 1/t, the parameter for the exponential
4.2
0.52
Since was estimated, X 2 has a chisquare distribution on 5 df and P (X 2 > 4.2) = 0.52.
Hence it is likely the length of life of the light bulbs is distributed exponentially.
123
Example 6.4
Show how to test the hypothesis that a sample comes from a normal distribution.
Solution: If the parameters are not specified they must be estimated by
= x,
2 =
n
X
(xi x)2 /n
i=1
For example, Figure 6.2 depicts a histogram, observed frequencies and the postulated
normal distribution. The bins are chosen such that the expected probability under the
normal distribution for each bin interval is 81 .
Figure 6.2: Partition with Equal Probabilities in Each Category
0.12
Density
0.10
0.08
0.06
0.04
0.02
0.00
0
10
15
124
Since two parameters are estimated, the degrees of freedom for the 2 is k r 1 =
8 2 1 = 5.
The above output does has not taken into account that 2 parameters have been estimated so the correct 2 test needs to be done.
corrected.df <- length(obsv.freq)-3
pchisq(q=CHI.test$statistic,df=corrected.df,lower.tail=F)
X-squared
0.52
6.3
125
Contingency Tables
sweet
not sweet
6.3.1
Method
Assume that in the population there is a probability pij that an individual selected at
random will fall in both categories Ai and Bj . The probabilities are shown in the following
table.
A1
A2
Sums
B1
p11
p21
p.1
P (B1 )
B2
p12
p22
p.2
P (B2 )
B3
Sum
p13
p1. = P (A1 )
p23
p2. = P (A2 )
p.3
p.. = 1
P (B3 )
pij
pij
126
(6.2)
n12
n22
n.2
n13
n23
n.3
n1.
n2.
n.. = n
Now the set of random variables {Nij } have a multinomial distribution and if the pij
are postulated, the expected frequencies will be eij = E(Nij ) = npij and
k
2
3
X
(oij eij )2 X X (Nij npij )2
X =
=
.
eij
npij
i=1 j=1
l=1
2
X =
ni. n.j
n n
ni. n.j 2
n
ni. n.j
n
2 X
3
X
Nij
i=1 j=1
(6.3)
Now consider degrees of the freedom. Once three of the expected frequencies eij , have been
determined the other expected frequencies can be determined from the marginal totals since
they are assumed fixed. Thus the degrees of freedom is given by 6 1 3 = 2.
In the more general case of r rows and c columns, the number of parameters to be estimated is (r 1)+(c1) so the degrees of freedom is rc1(r 1+c1) = (r 1)(c1).
127
Example 6.5(cont)
Test the hypothesis H0 :colour and sweetness are independent.
Solution: The expected frequencies are:
sweet
not sweet
light
100
50
medium dark
66.67
33.33
33.33
16.67
The hypothesis can be stated as: P(sweet orange) is the same whether the orange is light,
medium or dark.
Then,
X (oi ei )2
(115 100)2
(20 16.67)2
x2 =
=
+ +
= 13.88.
ei
100
16.67
The probability of getting 2 at least as large as 13.9 is
> pchisq(q=13.9,df=2,lower.tail=F)
[1] 0.00096
which indicates that the data suggest strongly (p < 0.001) that colour and sweetness
are not independent.
Computer Solution: The observed values are entered into an data frame and the xtabs
command used to make the 2-way table.
In making this 2-way table, the 2 test of independence of rows and columns is also
calculated and saved in the summary.
#__________ Oranges.R __________
Oranges <- expand.grid(sweet=c("Y","N"), colour=c("light","medium","dark"))
Oranges$frequencies <- c(115,35, 55,45, 30,20)
orange.tab <- xtabs(frequencies ~ sweet + colour ,data=Oranges)
print(summary(orange.tab))
colour
sweet light medium dark
Y
115
55
30
N
35
45
20
Call: xtabs(formula = frequencies ~ sweet + colour, data = Oranges)
Number of cases in table: 300
Number of factors: 2
Test for independence of all factors:
Chisq = 14, df = 2, p-value = 0.001
128
The Rcmdr menus are also very convenient for getting the chi2 test of independence
for factors in contingency tables.
Choose Statistics Contingency tables Enter and analyze two-way table
Change the numbers of rows and columns and provide row and column names.
A script similar to the above is generated and the output is the 2 test
light medium dark
sweet
115
55
30
not sweet
35
45
20
> .Test <- chisq.test(.Table, correct=FALSE)
Pearsons Chi-squared test
data: .Table
X-squared = 13.875, df = 2, p-value = 0.0009707
129
6.4
While the 2 2 table can be dealt with as indicated in 6.3 for an r c contingency table, it
is sometimes treated as a separate case because the x2 statistic can be expressed in a simple
form without having to make up a table of expected frequencies. Suppose the observed
frequencies are as follows:
A1
A2
B1
a
c
a+c
B2
b
a+b
d
c+d
b+d
n
Under the hypothesis that the methods of classification are independent, the expected
frequencies are
A1
A2
B1
(a + b)(a + c)
n
(a + c)(c + d)
n
B2
(a + b)(b + d)
n
(c + d)(b + d)
n
n
n
n
n
x2 =
+
+
+
(a+c)(a+b)
(a+b)(b+d)
(a+c)(c+d)
(c+d)(b+d)
n
(ad bc)2
=
n
=
1
1
1
1
+
+
+
(a + c)(a + b) (a + b)(b + d) (a + c)(c + d) (c + d)(b + d)
(ad bc)2 .n
, on simplification.
(a + b)(a + c)(b + d)(c + d)
(6.4)
130
1
2
b 12
d + 12
Male
Female
Passed
70
35
105
Failed
75
20
95
145
55
200
Solution: Now
x2 =
(ad bc)2 n
(1400 2625)2 200
=
= 3.77.
(a + b)(c + d)(a + c)(b + d)
105 95 145 55
But if the continuity correction is used, we get x2c = 3.2. Since P (21 > 3.2) = 0.07, our
result is not significant and we conclude that there is no significant difference between the
proportions of male and female students passing the examination.
Computer Solution
#______________ Class.R _________
Class <- expand.grid(Gender=c("M","F"),Grade=c("P","F") )
Class$freq <- c(70,35, 75,20)
Two.way <- xtabs(freq ~ Gender + Grade,data=Class)
print(chisq.test(Two.way,correct=F))
print(chisq.test(Two.way,correct=T))
Pearsons Chi-squared test
data: Two.way
X-squared = 3.8, df = 1, p-value = 0.05209
Pearsons Chi-squared test with Yates continuity correction
data: Two.way
X-squared = 3.2, df = 1, p-value = 0.07446
Note that in all this we assume that n individuals are chosen at random, or we have n
independent trials, and then we observe in each trial which of the r c events has occurred.
6.5
131
The method for 2 2 contingency tables in 6.4 is really only appropriate for n large and
the method described is this section, known as Fishers exact test should be used for
smaller values of n, particularly if a number of the expected frequencies are less than 5.
(A useful rule of thumb is that no more than 10% of expected frequencies in a table should
be less than 5 and no expected frequency should be less than 1.)
Consider now all possible 2 2 contingency tables with the same set of marginal totals,
say a + c, b + d, a + b and c + d, where a + b + c + d = n.
A1
A2
B1
a
c
a+c
B2
b
a+b
d
c+d
b+d
n
We can think of this problem in terms of the hypergeometric distribution as follows. Given
n observations which result in (a + c) of type B1 [and (b + d) of type B2 ]; (a + b) of type
A1 [and (c + d) of type A2 ], what is the probability that the frequencies in the 4 cells will
be
a b
c d
This is equivalent to considering a population of size n consisting of 2 types: (a + c)
B1 s and (b + d) B2 s. If we choose a sample of size a + b, we want to find the probability
that the sample will consist of a B1 s and b B2 s. That is
a+c b+d
(a + c)!(b + d)!(a + b)!(c + d)!
a
b
=
,
(6.5)
P (a B1 s, b B2 s) =
n
a! b! c! d! n!
a+b
Now if the methods of classification are independent, the expected number of type A1 B1
is (a+b)(a+c)
. Fishers exact test involves calculating the probability of the observed set
n
of frequencies and of others more extreme, that is, further from the expected value. The
hypothesis H is rejected if the sum of these probabilities is significantly small. Due to the
calculations involved it is really only feasible to use this method when the numbers in the
cells are small.
132
Example 6.7
Two batches of experimental animals were exposed to infection under comparable conditions. One batch of 7 were inoculated and the other batch of 13 were not. Of the
inoculated group, 2 died and of the other group 10 died. Does this provide evidence of the
value of inoculation in increasing the chances of survival when exposed to infection?
Solution: The table of observed frequencies is
Not inoculated
Inoculated
Died
10
2
12
Survived
3
5
8
13
7
20
The expected frequencies, under the hypothesis that inoculation has no effect are
Not inoculated
Inoculated
Died
7.8
4.2
12
Survived
5.2
2.8
8
13
7
20
12 1
0 7
Using (6.5) to find the probability of the observed frequencies or others more extreme in
the one direction, we have
P =
=
=
=
12
X
12
8
20
/
x
13 x
13
x=10
12 8
12 8
8
20
+
+
/
10 3
11 2
1
13
404
7752
.052.
Thus, if H0 is true, the probability of getting the observed frequencies or others more
extreme in the one direction , is about 5 in 100. If we wished to consider the alternative
as twosided, we would need to double this probability.
133
Note that if the chi-square approximation (6.4) is used for a 2 2 contingency table,
then that accounts for deviations from expectation in both directions since the deviations
are squared. If we had used (6.4) in the above example we would expect to get a probability
of about .10. Carrying out the calculations we get x2 = 2.65 and from chi-square tables,
P (W > 2.65) is slightly more than 0.10 (where W 21 ).
Computer Solution
#____________ Fisher.R ___________
Infection <- expand.grid(Inoculated=c("N","Y"),Survive=c("N","Y") )
Infection$freq <- c(10,2, 3,5)
Ftab <- xtabs(freq ~ Inoculated + Survive,data=Infection)
print(fisher.test(Ftab))
Fishers Exact Test for Count Data
data: Ftab
p-value = 0.06233
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.74 117.26
sample estimates:
odds ratio
7.3
6.6
134
Parametric Bootstrap-X 2
The theme of the previous sections was whether the distribution of observed counts could
be considered as random samples from a multinomial distribution with known probabilities
and total sample size. The test statistic was a measure of the difference between observed
counts and the counts expected from the hypothesised multinomial distribution. This
statistic was regarded as coming from a 2 distribution,
X2 =
X
X O E)2
E
2
135
Example 6.8
Revisit Example 6.1 where observed counts of faces from 120 throws were:i
oi
1 2 3 4
15 27 18 12
5
25
6
23
H0 : 1 = 2 = 3 = 4 = 5 = 6 =
1
6
The results are shown in Figure 6.3. The plot indicates that P (X 2 > 8.8|H0 ) = 0.15
compared with 0.12 for the 2 test.
Figure 6.3: Bootstrap distribution of X 2 for testing unbiassedness of die throws.
1.0
0.95
0.8
0.6
0.4
0.2
0.0
0
X2
obs
10
X2
0.95
15
136
Example 6.9
The bootstrap test of independence of factors in a contingency table is illustrated using
the data in Example 6.5.
sweet
not sweet
The hypothesis of independence is H0 : pij = pi. p.j and the marginal probabilities are
estimated from the data,
pi. =
ni.
(row probabilities)
N
p.j =
n.j
(column probabilities)
N
Figure 6.4 displays the bootstrap distribution function of X 2 |H0 with the observed
value of X 2 and the 95%ile. This shows that P (X 2 > 14|H0 ) < 0.01 as before.
Comment These examples do not show any advantage of the bootstrap over the parametric 2 tests. However, understanding of the technique is a platform for Bayesian MonteCarlo Markov Chain methods (later on).
A Bayesian analysis is not presented here because the setting for that is called log-linear
models which require some more statistical machinery. This will be encountered in the unit
on Linear Models.
137
X2
0.95
10
X2
obs
15
Chapter 7
7.1
Analysis of Variance
Introduction
In chapter 5 the problem of comparing the population means of two normal distributions
was considered when it was assumed they had a common (but unknown) variance 2 . The
hypothesis that 1 = 2 was tested using the two sample ttest. Frequently, experimenters
face the problem of comparing more than two means and need to decide whether the
observed differences among the sample means can be attributed to chance, or whether they
are indicative of real differences between the true means of the corresponding populations.
The following example is typical of the type of problem we wish to address in this chapter.
Example 7.1
Suppose that random samples of size 4 are taken from three (3) large groups of students
studying Computer Science, each group being taught by a different method, and that these
students then obtain the following scores in an appropriate test.
Method A
Method B
Methods C
71
90
72
75
80
77
65 69
86 84
76 79
The means of these 3 samples are respectively 70, 85 and 76, but the sample sizes are
very small. Does this data indicate a real difference in effectiveness of the three teaching
methods or can the observed differences be regarded as due to chance alone?
Answering this and similar questions is the object of this chapter.
7.2
Let 1 , 2 , 3 be the true average scores which students taught by the 3 methods should
get on the test. We want to decide on the basis of the given data whether or not the
hypothesis
H : 1 = 2 = 3 against A : the i are not all equal.
is reasonable.
138
139
The three samples can be regarded as being drawn from three (possibly different) populations. It will be assumed in this chapter that the populations are normally distributed
and have a common variance 2 . The hypothesis will be supported if the sample means
are all nearly the same and the alternative will be supported if the differences among the
sample means are large. A precise measure of the discrepancies among the Xs is required
and the most obvious measure is their variance.
Two Estimates of 2
Since each population is assumed to have a common variance the first estimate of 2 is
obtained by pooling s21 , s22 , s23 where s2i is the ith sample variance. Recalling that we
have x1 = 70, x2 = 85, x3 = 76, then
s21 =
Similarly, s22 =
52
,
3
52
(71 70)2 + (75 70)2 + (65 70)2 + (69 70)2
= .
3
3
s23 =
26
.
3
Since this estimate is obtained from within each individual sample it will provide an
unbiased estimate of 2 whether the hypothesis of equal means is true or false since it
measures variation only within each population.
The second estimate of 2 is now found using the sample means. If the hypothesis
that 1 = 2 = 3 is true then the sample means can be regarded as a random sample
from a normally distributed population with common mean and variance 2 /4 (since
2
X N (, n )) where n is the sample size). Then if H is true we obtain,
d
sx2 = Var(X)
=
3
X
i=1
If the second estimate (based on variation between the sample means) is much larger
than the first estimate (which is based on variation within the samples, and measures
140
variation that is due to chance alone) then it provides evidence that the means do differ
and H should be rejected. In that case, the variation between the sample means would be
greater than would be expected if it were due only to chance.
The comparison of these 2 estimates of 2 will now be put on a rigorous basis in 7.2
where it is shown that the two estimates of 2 can be compared by an F-test. The method
developed here for testing H : 1 = 2 = 3 is known as Analysis of Variance (often
abreviated to AOV).
7.3
Observations
x11 x12 . . . x1n1
x21 x22 . . . x2n2
..
.
Totals
T1 = x1.
T2 = x2.
..
.
Means
x1.
x2.
..
.
No. of observations
n1
n2
..
.
Tk = xk.
xk.
nk
Notation
xi. =
ni
X
j=1
n =
x.. =
k
X
i,j
i=1
ni
k X
X
i=1 j=1
141
(xij x.. ) =
i=1 j=1
Proof
X
(xij xi. ) +
i=1 j=1
(xij x.. )2 =
i,j
ni
k X
X
k
X
ni (xi. x.. )2
(7.1)
i=1
i,j
i,j
(xij xi. )2 +
i,j
ni (xi. x.. )2 + 2
(xi. x.. )
(xij xi. )
{z
=0
i,j (xij
x.. )2 = Total sum of squares (of deviations from the grand mean)
P
2. Notice that j (xij xi. )2 is just the total sum of squares of deviations from the
mean in the ith sample and summing over these from all k groups then gives
X
SSW =
(xij xi. )2
i,j
This sum of squares is only affected by the variability of observations within each
sample and so is called the within subgroups sum of squares.
3. The third term is the sum of squares obtained from the deviations of the sample
means from the overall (grand) mean and depends on the variability between the
sample means. That is,
X
SSB =
ni (xi. x.. )2
i
142
(7.2)
(that is, there is no difference between group means) and the alternative,
A : the i are not all equal.
Note, this is not the same as saying 1 6= 2 6= . . . 6= k .
We will now find the probability distributions of SST , SSB , SSW under H.
(7.3)
P
(ii) If only the ith group is considered then s2i = j (xij xi. )2 /(ni 1) is also an unbiased
estimate of 2 . The k unbiased estimates, s21 , s22 , . . . , s2k , can then be pooled to obtain
another unbiased estimate of 2 , that is
X
XX
s2 =
(xij xi. )2 /(
ni k).
i
(7.4)
143
It can be shown that the random variables in (7.4) and (7.5) are independent. (The proof
is not given as it is beyond the scope of this unit.) Thus from (7.1) we have
k
1 X
1 X
1 X
2
2
(Xij X .. ) = 2
(Xij X i. ) + 2
ni (X i. X .. )2
2
i,j
i,j
i=1
ij
i.
i,j
That is if
SSB /(k 1)
Fk1,nk .
SSW /(n k)
(7.6)
df
k1
nk
n1
Mean Square
SSB /(k 1)
SSW /(n k)
F
SSB /(k1)
SSW /(nk)
Note that the term mean square (ms) is used to denote (sums of squares)/df.
Method of Computation
In order to calculate the sums of squares in the AOV table it is convenient to express the
sums of squares in a different form.
144
Total SS
SST =
(xij x.. ) =
i,j
X
x2ij
T2
==
x2ij
.
n
n
i,j
(7.7)
T2
.
n
(7.8)
P
x2ij
i,j
where
P
2
i,j xij is called the raw sum of squares and
T2
is called the correction term.
n
Between Groups SS
SSB =
X T2
i
ni
since
SSB =
ni (xi. x.. )2
ni x2i. 2 x..
ni xi. + x2..
ni
| i{z }
n
X
i
T2
T X
T2
Ti + n 2
ni i2 2
ni
n i
n
X T2
i
ni
T2
n
The same correction term is used here as appeared in the calculation of SST .
Within Groups SS
Since, SST = SSB + SSW , SSW is found by subtracting SSB from SST . Similarly the df
for within groups is found by subtracting k 1 from n 1.
145
X .. =
1X
Xij =
n i,j
X .. =
1X
ni X i .
n i
(7.9)
(7.10)
Theorem 7.2
With SSW and SSB as defined earlier.
(a) E(SSW ) = (n k) 2 .
(b) E(SSB ) = (k 1) 2 +
ni (i )2 .
Proof of (a)
E(SSW ) = E
XX
i
(Xij X i. )2
(Xij X i. )2
k
X
i=1
k
X
(ni 1) 2
i=1
Thus
!
E
(Xij X i. )
= (n k) 2 .
i,j
Proof of (b)
X
E(SSB ) = E[
ni (X i. X .. )2 ]
i
X
X
2
2
= E[
ni X i. 2X ..
ni X i. +nX .. ]
i
| i {z
nX
X
2
2
= E[
ni X i. n X .. ]
i
X
i
(7.11)
146
2
2
E(SSB ) =
ni
+ i n
+ = k 2 +
ni 2i 2 n2 .
ni
n
i
i
That is,
!
E
ni (X i. X)2..
= (k 1) 2 +
ni (i )2
(7.12)
Note that sometimes Theorem 7.2 is stated in terms of the expected mean squares instead
of expected sums of squares.
These results are summarized in the table below.
Source of Variation
Between gps
Within gps
Total
Sum of Squares(SS)
df
SSB
k1
SSW
nk
SST
n1
Mean Square(MS)
SSB /(k 1)
SSW /(n k)
E(Mean
Square)
P
2
n
i (i )
2 + i k1
2
SSB /(k 1)
Fk1,nk .
SSW /(n k)
However, if H is not true and the i are not all equal then
X
ni (i )2 /(k 1) > 0,
i
and the observed value of the F ratio will tend to be large so that large values of F will
tend to caste doubt of the hypothesis of equal means. That is if
SSB /(k 1)
> F (k 1, n k),
SSW /(n k)
where F (k 1, n k) is obtained from tables. The significance level , is usually taken
as 5%, 1% or .1%.
Note: The modern approach is to find the probability that the observed value of F (or
one larger) would have been obtained by chance under the assumption that the hypothesis
is true and use this probability to make inferences. That is find
P (F Fobserved )
147
and if it is small (usually less than 5%) claim that it provides evidence that the hypothesis
is false. The smaller the probability the stronger the claim we are able to make. To use this
approach ready access to a computer with suitable software is required. With tables we can
only approximate this procedure since exact probabilities cannot in general be obtained.
Comments
1. SSW /(n k), the within groups mean square, provides an unbiased estimate of 2
whether or not H is true.
2. When finding the F ratio in an AOV, the between groups mean square always forms
the numerator. This is because its expected value is always greater than or equal to
the expected value of the within groups mean square (see 7.12). This is one case where
one doesnt automatically put the larger estimate of variance in the numerator. If H is
true, both SSB and SSW are estimates of 2 and in practice either one may be the larger.
However small values of F always support the hypothesis so that if F < 1 it is always
non-signuficant.
Example 7.2
Suppose we have 4 kinds of feed (diets) and it is desired to test whether there is any
significant difference in the average weight gains by certain animals fed on these diets.
Twenty (20) animals were selected for the experiment and allocated randomly to the diets,
5 animals to each. The weight increases after a period of time were as follows.
P
Diet
Observations
Ti = j xij xi ni
A
7 8 8 10 12
45
9.0 5
B
5 5 6 6 8
30
6.0 5
C
7 6 8 9 10
40
8.0 5
D
5 7 7 8 8
35
7.0 5
T = 150
20
Solution: Let the random variable Xij be the weight of the jth animal receiving the ith
diet where Xij , j = 1, . . . , 5 N (i , 2 ).
Test the hypothesis that all diets were equally effective, that is
H : 1 = 2 = 3 = 4 (= , say).
148
Calculations
Total SS = SST =
x2ij
i,j
Between diets SS
= SSB =
T2
1502
= 72 + . . . + 82
= 63
n
20
(Ti2 /ni )
i
2
T2
n
from (7.8)
452 30
402 352 1502
+
+
+
= 25
5
5
5
5
20
Within diets SS = SSW = 63 25 = 38.
=
SS df MS
F
25
3 8.333 3.51*
38 16 2.375
63 19
The 5% critical value of F3,16 is 3.24, so the observed value of 3.51 is significant at the
5% level. Thus there is some reason to doubt the hypothesis that all the diets are equally
effective level and conclude that there is a significant difference in weight gain produced
by at least one of the diets when compared to the other diets.
Computer Solution:
#___________ Diets.R ________
Feed <- expand.grid(unit=1:5,Diet=LETTERS[1:4])
Feed$wtgain <- c(7,8,8,10,12,5,5,6,6,8,7,6,8,9,10,5,7,7,8,8)
weight.aov <- aov(wtgain ~ Diet,data=Feed)
print(summary(weight.aov) )
Diet
Residuals
---
R gives a P value of 0.04 which indicates significance at the 4% level confirming the result
(P < 5%) obtained above.
7.4
Having found a difference between the means our job is not finished. We want to try and find
exactly where the differences are. First we want to estimate the means and their standard errors.
149
It is useful to find confidence intervals for these means. The best estimate for i , the mean of the
ith group, is given by
P
j xij
xi =
, for i = 1, 2, . . . , k,
ni
where ni = number observations in the ith group. A 100(1 )% confidence interval for i is
then
s
xi t,/2
ni
where s2 is the estimate of 2 given by the within groups (residual) mean square (in the AOV
table) and is thus on = n k degrees of freedom.
For straightforward data such as these, the means and their standard errors are calculated
with model.tables(),
print(model.tables(weight.aov,type="means",se=T))
Tables of means
Grand mean
7.5
Diet
A B C D
9 6 8 7
Standard errors for differences of means
Diet
0.9747
replic.
5
Note
the output specifies that this is the standard error of the differences of means where
s.e.m. = 2 s.e.
7.5
The assumptions required for the validity of the AOV procedure are that:
(i) each of the k samples is from a normal population;
(ii) each sample can be considered as being drawn randomly from one of the populations;
(iii) samples are independent of each other;
(iv) all k populations have a common variance (homogeneity of variance).
If these assuptions are violated then conclusions made from the AOV procedure may be
incorrect. We need to be able to verify whether the assumptions are valid.
Assumption (i) may be tested using a chi-square goodnessoffit test (Chapter 6), while careful
planning of the experiment should ensure (ii) and (iii) holding.
There are several tests for (iv) three of which follow.
7.5.1
150
Bartletts Test
Let S12 , S22 , . . . , Sk2 be sample variances based on samples of sizes n1 , n2 , . . . , nk . The samples
are assumed to be drawn from normal distributions with variances 12 , 22 , . . . , k2 respectively.
Define
P
P
( i i ) loge S 2 i i loge Si2
hP
i
(7.13)
Q=
1
1
P1
1 + 3(k1)
i ( i )
i i
P
P
2
2
where S = i i Si / i .
Then under the hypothesis
H : 12 = 22 = . . . = k2 ,
Q is distributed approximately as 2k1 . The approximation is not very good for small ni .
The hypothesis is tested by calculating Q from (7.13) and comparing it with wk1, found
from tables of the chi-square distribution.
Example 7.3
Suppose we have 5 sample variances, 15.9, 6.1, 21.0, 3.8, 30.4, derived from samples of sizes 7, 8,
7, 6, 7 respectively. Test for the equality of the population variances.
Solution: Fmax = s2max /s2min = 30.4/3.8 = 8.0.
This is probably large enough to require further checking.
For Bartletts test, first pool the sample variances to get S 2 .
P
i S 2
2
S = P i = 15.5167
i
Then from (7.13) we obtain Q = 7.0571 which is distributed approximately as a chi-square on 4
df. This is non-significant (P = 0.13 using R). Hence we conclude that the sample variances are
151
compatible and we can regard them as 5 estimates of the one population variance, 2 .
It should be stressed that in both these tests the theory is based on the assumption that the k
random samples are from normal populations. If this is not true, a significant value of Fmax or Q
may indicate departure from normality rather than heterogeneity of variance. Tests of this kind
are more sensitive to departures from normality than the ordinary AOV. Levenes test (which
follows) does appear to be robust to the assumption of normality, particularly when medians
(instead of means as were used when the test was first proposed) are used in its definition.
Levenes Test
Let Vij = |Xij i | where i is the median of the ith group, i = 1, 2, . . . , k, j = 1, 2, . . . , ni . That
is Vij is the absolute deviation of the jth observation in the ith group from the median of the ith
group. To test the hypothesis, 12 = 22 = = k2 against the alternative they are not all equal
we carry out a oneway AOV using the Vij as the data.
This procedure has proven to be quite robust even for small samples and performs well even if
the original data is not normally distributed.
Computer Exercise 7.1
Use R to test for homogeneity of variance for the data in Example 7.2.
Solution:
Bartletts Test
print(bartlett.test(wtgain ~ Diet,data=Feed) )
Bartlett test of homogeneity of variances
data: wtgain by Diet
Bartletts K-squared = 1.3, df = 3, p-value = 0.7398
152
Levenes Test The first solution derives Levenes test from first principles.
#Calculate the medians for each group.
attach(Feed,pos=3)
med <- tapply(wtgain,index=Diet,FUN=median)
med <- rep(med, rep(5,4) )
> med
A A A A A B B B B B C C C C C D D D D D
8 8 8 8 8 6 6 6 6 6 8 8 8 8 8 7 7 7 7 7
# Find v, the absolute deviations of each observation from the group median.
v <- abs(wtgain-med)
# Analysis of variance using v (Levenes Test).
levene <- aov(v~diet)
summary(levene)
diet
Residuals
There is a function levene.test() which is part of the car library of R. It would be necessary
to download this library from CRAN to use the function.
library(car)
print(levene.test(Feed$wtgain,Feed$Diet))
Levenes Test for Homogeneity of Variance
Df F value Pr(>F)
group 3
0.37
0.78
16
Bartletts Test, (P = 0.744), and Levenes test, (P = 0.7776), both give non-significant re2 = 2 = 2 = 2 .
sults, so there appears no reason to doubt the hypothesis, A
B
C
D
7.6
If an AOV results in a non-significant F-ratio, we can regard the k samples as coming from
populations with the same mean (or coming from the same population). Then it isPdesirable
to find bothPpoint and interval estimates of . Clearly the best estimate of is x = i,j xij /n
where n = i ni . A 100(1 )% confidence interval for is
s
x t,/2 ,
n
where s2 is the estimate of 2 given by the within gps mean square (in the AOV table) and is
thus on = n k degrees of freedom.
Chapter 8
8.1
Introduction
Frequently an investigator observes two variables X and Y and is interested in the relationship
between them. For example, Y may be the concentration of an antibiotic in a persons blood and
X may be the time since the antibiotic was administered. Since the effectiveness of an antibiotic
depends on its concentration in the blood, the objective may be to predict how quickly the
concentration decreases and/or to predict the concentration at a certain time after administration.
In problems of this type the value of Y will depend on the value of X and so we will observe
the random variable Y for each of n different values of X, say x1 , x2 , . . . , xn , which have been
determined in advance and are assumed known. Thus the data will consist of n pairs, (x1 , y1 ),
(x2 , y2 ), . . . , (xn , yn ). The random variable Y is called the dependent variable while X is called
the independent variable or the predictor variable. (Note this usage of the word independent has
no relationship to the probability concept of independent random variables.) It is important to
note that in simple linear regression problems the values of X are assumed to be known and so
X is not a random variable.
The aim is to find the relationship between Y and X. Since Y is a random variable its value
at any one X value cannot be predicted with certainty. Different determinations of the value of
Y for a particular X will almost surely lead to different values of Y being obtained. Our initial
aim is thus to predict the E(Y ) for a given X. In general there are many types of relationship
that might be considered but in this course we will restrict our attention to the case of a straight
line. That is we will assume that the mean value of Y is related to the value of X by
Y = E(Y ) = + X
(8.1)
where the parameters and are constants which will in general be unknown and will need to
be determined from the data. Equivalently,
Y = + X +
(8.2)
where is a random variable and is the difference between the observed value of Y and its expected
value. is called the error or the residual and is assumed to have mean 0 and variance 2 for all
values of X.
Corresponding to xi the observed value of Y will be denoted by yi and they are then related
by
yi = + xi + i , i = 1, 2, . . . , n,
(8.3)
A graphical representation is given in figure 8.1.
153
154
(x , y )
i i
* (x ,+ x)
i
Note: ( = y x )
i
Now and are unknown parameters and the problem is to estimate them from the sample
of observed values (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ). Two methods of estimation, (least squares and
maximum likilihood), will be considered.
8.2
Estimation of and .
It is easy to recognize that is the intercept of the line with the yaxis and is the slope.
A diagram showing the n points {(xi , yi ), i = 1, 2, . . . , n} is called a scatter plot, or scatter
diagram. One simple method to obtain approximate estimates of the parameters is to plot the
observed values and draw in roughly the line that best seems to fit the data from which the
intercept and slope can be obtained. This method while it may sometimes be useful has obvious
deficiencies. A better method is required.
One such method is the method of least squares.
2i
n
X
(yi xi )2
(8.4)
i=1
is minimised by differentiating with respect to and and putting the resulting expressions
equal to zero. The results are stated in Theorem 8.1.
155
Theorem 8.1
The Least Squares Estimates of and are
P
(xi x)yi
= Pi
2
i (xi x)
(8.5)
= y bx.
(8.6)
yi = 0 + (xi x) + i ,
(8.7)
= 0 x,
(8.8)
Proof
For convenience we will rewrite (8.3) as
where
and the minimization will be with respect to 0 and . From (8.7)
n
X
2i
i=1
n
X
i=1
Pn
Pn
2
i=1 i
0
2
i=1 i
= 2
= 2
n
X
i=1
n
X
i=1
i x)] = 0
[yi
0 (x
(8.9)
i x)](xi x) = 0
[yi
0 (x
(8.10)
i=1
n
X
i=1
where
0 and are the solutions of the equations. Equations (8.9) and (8.10) are referred to as
the normal equations.
From (8.9) we have
n
n
X
X
yi = n
0 +
(xi x) .
|i=1 {z
i=1
=0
So
0
i yi
= y.
156
yi (xi x) =
0
i=1
(xi x) +
|i
(xi x)2 ,
{z
=0
giving
P
(xi x)yi
= Pi
.
2
i (xi x)
Then using (8.8) the estimate of is
= y bx.
Comments
1. No assumptions about the distribution of the errors, i , were made (or needed) in the proof
of Theorem 8.1. The assumptions will be required to derive the properties of the estimators
and for statistical inference.
2. For convenience, the least squares estimators,
,
0 and will sometimes be denoted by a,
a0 and b. This should not cause confusion.
3. The estimators of and derived here are the Best Linear Unbiased Estimators (known
as the BLUEs) in that they are
(i) linear combinations of y1 , y2 , . . . , yn ,
(ii) unbiased;
(iii) of all possible linear estimators they are best in the sense of having minimum variance.
L=
n
Y
1
e
2
i=1
1 i 2
2
.
#
Since i = yi 0 (xi x
) the likelihood can be written as
2
n
Y
1
e
L=
2
i=1
1
2
yi 0 (xi x
2 3
5
157
) 2
1 yi 0 (xi x
log(L) = n. log( 2)
2
i=1
Differentiating log(L) with respect to 0 , and 2 and setting the resultant equations equal to
zero gives
n h
X
i
ix
yi
0 (x
) = 0
i=1
n
X
h
i
ix
(xi x
) yi
0 (x
) = 0
i=1
Pn
n +
i=1 (yi
i )2
0 x
= 0
The first two of these equations are just the normal equations, (8.9) and (8.10) obtained previously
by the method of least squares. Thus the maximum likelihood estimates of and are identical
to the estimates (8.5) and (8.6) obtained by the method of least squares. The maximum likelihood
estimate of 2 ,
Pn
i )2
(yi
0 x
2
= i=1
.
n
This estimate is biased.
Comments
x) is called the regression line of Y on X.
1. The fitted line, E(Y ) =
0 + (x
2. The regression line passes through the point (x, y).
3. In our notation we will not distinguish between the random variables
and and their
observed values.
4. The estimate of 2 can be written as
2
Pn
2
i=1 ei
where
i
ei = yi
x
(8.11)
158
where all sums are over i = 1, 2, . . . , n. To verify that these are equivalent, note that
X
X
X
(xi x)(yi y) =
(xi x)yi y
(xi x)
i
|i
xi yi x
{z
=0
yi
Pi P
xi yi
xi yi
.
n
(8.13)
2
=P
Var(
0 ) = 2 /n, and Var()
,
(xi x)2
(8.14)
= 0.
cov(
0 , )
(8.15)
0 =
1
1
1
Y1 + Y2 + . . . + Yn
n
n
n
so that
E(
0 ) =
1X
X
1X 0
E(Yi ) =
( + (xi x)) = 0 +
(xi x) .
n
n
n
i
i
| i {z
}
=0
Secondly,
x1 x
x2 x
xn x
= P
Y1 + P
Y2 + . . . + P
Yn
2
2
(xi x)
(xi x)
(xn x)2
giving
=
E()
1
P
(x1 x)(0 + (x1 x)) + . . . + (xn x)(0 + (xn x))
2
(xi x)
X
1
0X
P
(xi x) +
(xi x)2 = .
2
(xi x)
|
{z
}
i
=0
159
Next,
2
1
.n 2 =
and
2
n
n
1
P
[(x1 x)2 2 + . . . + (xn x)2 2 ]
[ i (xi x)2 ]2
X
2
2
2
P
P
(x
.
x)
=
i
[ i (xi x)2 ]2
(xi x)2
Var(
0) =
=
Var()
=
= 2 P 1
cov(
0 , )
[(x1 x) + . . . + (xn x)] = 0.
{z
}
n i (xi x)2 |
=0
8.3
x2
1
+P
(xi x)2
n
Estimation of 2
Theorem 8.3
Assuming that E(Y ) = + X and Var(Y ) = 2 , then
2 =
X
1
(yi y b(xi x))2
(n 2)
i
is an unbiased estimate of 2 .
(8.16)
160
(8.17)
(ii) Yi = + xi + i , and
P
P
j j
i Yi
Y =
= + x +
so that
n
n
P
j j
Yi Y = (xi x) + i
(yi y) 22
2
(xi x) + 2
2
(yi y)
(xi x)2
i
2
(xi x) .
(8.18)
1 2
) + 2 (xi x)2 ]
n
i
X
= (n 1) 2 + 2
(xi x)2 .
=
[(1
Also
E(2
X
i
(xi x)2 ) =
+ (E())
2]
(xi x)2 [Var()
= 2 +
(xi x)2 2 .
i x))2 = (n 1) 2 2 = (n 2) 2 .
(Yi Y (x
Thus
given by (8.16) is an unbiased estimate of 2 .
8.4
161
and Y
Inference about ,
So far we have not assumed any particular probability distribution of the i or equivalently, the
Yi . To find confidence intervals for 0 , and Y let us now assume that the i are normally
and independently distributed. (with means 0 and variances 2 .) Since
Yi = 0 + (xi x) + i = + xi + i = Yi + ,
it follows that the Yi are independently distributed N( + xi , 2 ). Since
0 and are linear
0
then each of
combinations of the Yi , and both
and
Yi are linear combinations of
and ,
0
0
,
, and
Yi are normally distributed. The means and variances of
and are given in
Theorem 8.2. Means and variances for
are given in Corollary 8.2.1.
Now it can be shown that
0 and are independent of
2 given in (8.16), so hypotheses about
these parameters may be tested and confidence intervals for them found in the usual way, using
the tdistribution. Thus to test the hypothesis, H: = 0 , we use the fact that:
0
q
tn2 .
Var()
A 100(1 )% CI for is given by:
tn2,/2
Var()
0
p
tn2 .
Var(
)
A 100(1 )% CI for can be found using:
tn2,/2
For
Yi
Var(
)
E (
Yi ) = 0 + (xi x)
(8.19)
and,
since cov(
=0
0 , )
Var(
Yi ) = Var(
0 ) + (xi x)2 Var()
2
(xi x)2 2
=
+P
2
n
i (xi x)
1
(xi x)2
2
(8.20)
=
+ P
2
n
i (xi x)
p
so that a 100(1 %) confidence interval for it is given by
Yi tn2,/2 Var(
Yi ). That is
s
1
(xi x)2
x) tn2,/2
0 + (x
+ P
.
(8.21)
2
n
i (xi x)
162
Comment
i x) = ei is an estimate of the (true) error, so the firtst term is
Notice that in (8.18), yi y (x
P
called the Error Sum of Squares (Error SS).
The first term on the right hand side, (yi y)2 , is
P
the total variation in y (Total SS), and 2 (xi x)2 is the Sum of Squares due to deviations from
the regression line, which we will call the Regression SS. Thus in words, (8.18) can be expressed
in the form
Error SS = Total SS Regression SS.
This information can be summarised in an Analysis of Variance Table, similar in form to that
used in the single factor analysis of variance.
Source
Regression
df
1
Error SS
n2
Total
n1
SS
P
2 (xi x)2
P
P
(yi y)2 2 (xi x)2
P
(yi y)2
MS
P
2 (xi x)2
i
X
1 hX
(yi y)2 2
(xi x)2
n2
It can be shown that the ratio of the Regression MS to the Error MS has an F distribution on
1 and n 2 df and provides a test of the hypothesis, H: = 0 which is equivalent to the t test
above.
Question: Why should you expect these two tests to be equivalent?
Example 8.1
The following data refer to age (x) and bloodpressure (y) of 12 women.
x
y
56
147
42
125
72
160
36
118
63
149
47
128
55
150
49
145
38
115
42
140
68
152
60
155
x = 628,
y = 1684,
xy = 89894,
x2 = 34416,
x) are given by
(i) Regression coefficients in the equation
Y =
0 + (x
0 = y = 1684/12 = 140.33
1057552
89894
1764.667
12
= 1.138.
=
=
394384
1550.667
34416
12
The regression equation is
y 2 = 238822, and
163
2500.67 2008.19
= 49.26.
10
2 =
Hence
0
1.138
=
= 6.4.
estimated sd of b
49.25/ 1550.667
Comparing this with the critical value from the tdistribution on 10 degrees of freedom,
we see that our result is significant at the .1% level.
(iii) For x = 45,
Y = 80.778 + 1.138 45 = 132.00. Now the 95% confidence limits for the
mean blood pressure of women aged 45 years is
s
1
(45 52.33)2
132.00 2.228 49.25
+
= 132.00 5.37.
12
1550.67
Computer Solution: Assuming the data for age (x) and blood pressure (y) is in the text file,
bp.txt,
age
56
42
72
36
63
47
55
49
38
42
68
60
bloodpressure
147
125
160
118
149
128
150
145
115
140
152
155
164
Response: y
Df Sum Sq Mean Sq F value
Pr(>F)
1 2008.20 2008.20 40.778 7.976e-05 ***
10 492.47
49.25
age
Residuals
--Signif. codes:
***
0.001
**
0.01
0.05
0.1
2. We extract the coefficients for the regression line by using the summary() command.
> summary(bp.lm)
Call:
lm(formula = bloodpressure ~ age, data = bp)
Residuals:
Min
1Q Median
-9.02 -4.35 -3.09
3Q
6.11
Max
11.43
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
80.778
9.544
8.46 7.2e-06 ***
age
1.138
0.178
6.39 8.0e-05 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 7 on 10 degrees of freedom
Multiple R-Squared: 0.803,
Adjusted R-squared: 0.783
F-statistic: 40.8 on 1 and 10 DF, p-value: 7.98e-05
You will notice that the summary command provides us with t-tests of the hypotheses
= 0 and = 0 as well as the residual standard error (s) and R2 . The F -test of the
hypothesis = 0 is also reported and of course is identical to the F -test in the AOV table.
3. Confidence intervals for the regression coefficients
print(confint(bp.lm))
2.5 % 97.5 %
(Intercept) 59.51 102.0
age
0.74
1.5
4. the variance-covariance of the regression coefficients may be needed for further work.
VB <- bp.summ$sigma^2 * bp.summ$cov.unscaled
print(VB)
165
print(sqrt(diag(VB)) )
(Intercept)
age
(Intercept)
91.1 -1.662
age
-1.7 0.032
A quick check shows the connection between the variance-covariance matrix of the regression coefficients and the standard errors of the regression coefficients.
The diagonal of the matrix is (91.1, 0.032) and the square root of these numbers gives
(9.54, 0.18) which are the s.e.s of the regression coefficients.
5. We can now use our model to predict the blood pressure of 45 and 60 year old subjects.
When the model is ftted, there are estimates of the regression coefficients, i.e.
= 80.8 and
= 1.14. Given a new value of x (say x = 45), the predicted value is y = 80.8 + 1.14 45.
The standard error and CI for this predicted value is also able to be calculated.
This is achieved by supplying a new data frame of explanatory variables and calculating
predictions with predict() and appropriate arguments.
newdata <- data.frame(age=c(45,60) )
preds <- predict(bp.lm,new=newdata,interval="confidence")
newdata <- cbind(newdata,preds)
print(newdata)
age fit lwr upr
1 45 132 127 137
2 60 149 144 155
(You should make sure you can match this output up with the calculations made in example
8.1. For example
2 = 49.2 = Error MS. Also, from the AOV table, the F value is 40.78 = 6.392 ,
where 6.39 is the value of the t statistic for testing the hypothesis = 0).)
The fitted model and the code that does it looks like this:160
plot(bloodpressure ~ age,data=bp,las=1)
abline(lsfit(bp$age,bp$bloodpressure))
bloodpressure
150
140
130
120
35 40 45 50 55 60 65 70
age
8.5
166
Correlation
cov(X, Y )
and 1 1.
X Y
r n2
tn2 .
(8.23)
1 r2
Alternatively, a table of percentage points of the distribution of Pearsons correlation coefficient
r when = 0, may be used.
Example 8.2
Suppose a sample of 18 pairs from a bivariate normal distribution yields r = .32, test H0 : = 0
against H1 : > 0.
Solution: Now
.32 4
r n2
=
= 1.35.
2
1 .1024
1r
The probability of getting a value at least as large as 1.35 is determined from the t-distribution
on 18 degrees of freedom,
> pt(df=16,q=1.35,lower.tail=F)
[1] 0.098
so there is no reason to reject H.
167
Example 8.3
Suppose a sample of 10 pairs from a bivariate normal distribution yields r = .51. Test H : = 0
against A : > 0.
r n2
.896 10
= 6.39
=
1 0.8962
1 r2
Notice this is exactly the same tvalue as obtained in the R output (and in Example 8.1) for
testing the hypothesis, H: = 0. These tests are equivalent. That is if the test of = 0 is
significant (non-significant), the test of = 0 will also be significant (non-significant) with exactly
the same P value.