Sta2005s Ed
Sta2005s Ed
Sta2005s Ed
Design of Experiments
Course Notes for STA2005S
3 5 1 2 4
1 2 5 4 3
5 1 4 3 2
2 4 3 5 1
4 3 2 1 5
August 21, 2019
Contents
1 Introduction 6
1.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Treatment Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Randomisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Speed-reading Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1
The effect of different constraints on the solution to the normal equations . . . . 27
The general case: Parameter estimates for the single-factor completely ran-
domised design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Speed-reading Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
F test of H0 : α1 = α2 = · · · = αa = 0 . . . . . . . . . . . . . . . . . . . . . . . . 39
ANOVA table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Some Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Speed-reading Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Comments on Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Bonferroni Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Tukey’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2
Scheffé’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Estimation of µ, αi (i = 1, 2, . . . a) and βj (j = 1, 2 . . . b) . . . . . . . . . . . . . . 86
The Analysis of Variance Table for the Latin Square Design . . . . . . . . . . . . 103
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3
6 Power and Sample Size in Experimental Design 107
Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Why are factorial experiments better than experimenting with one factor at a
time? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4
Testing H0 : σa2 = 0 versus Ha : σa2 > 0 . . . . . . . . . . . . . . . . . . . . . . . . 145
A Tables 158
5
Chapter 1
Introduction
There are two fundamental ways to obtain information in research: by observation or by ex-
perimentation. In an observational study the observer watches and records information about
the subject of interest. In an experiment the experimenter actively manipulates the variables
believed to affect the response. Contrast the two great branches of science, Astronomy in which
the universe is observed by the astronomer, and Physics, where knowledge is gained through
the physicist changing the conditions under which the phenomena are observed.
In the biological world, an ecologist may record the plant species that grow in a certain area
and also the rainfall and the soil type, then relate the condition of the plants to the rainfall and
soil type. This is an observational study. Contrast this with an experimental study in which the
biologist grows the plants in a greenhouse in various soils and with differing amounts of water.
He decides on the conditions under which the plants are grown and observes the effect of this
manipulation of conditions on the response.
Both observational and experimental studies give us information about the world around us,
but it is only by experimentation that we can infer causality. In a carefully planned experiment
if a change in variable A, say, results in a change in the response Y, then we can be sure
that A caused this change, because all other factors were controlled and held constant. In
an observational study if we note that as variable A changes Y changes, we can say that A
is associated with a change in Y but we cannot be certain that A itself was the cause of the
change.
Both observational and experimental studies need careful planning to be effective. In this course
we concentrate on the design of experimental studies.
Exercise
6
1.2 Definitions
For the most part the experiments we consider are comparative. Their aim is to compare the
effects of a number of treatments. The treatments are carefully chosen and controlled by the
experimenter.
1. Experimental Unit: this is the entity to which a treatment is assigned. Note that the
experimental unit may differ from the observational or sampling unit, which is the entity
from which a measurement is taken. For example, one may apply the treatment of ‘high
temperature and low water level’ to a pot of plants containing 5 individual plants. Then
we measure the growth of each of these plants. The experimental unit is the pot, the
observational units are the plants. This distinction is very important because it is the ex-
perimental units which determine the (experimental) error variance, not the observational
units. In other words, the response here will be the average growth in the pot (one value
per experimental unit).
2. If there are no distinguishable differences between the experimental units prior to the
experiment the experimental units are said to be homogeneous. The more homogeneous
the experimental units are the smaller the experimental error variance (natural variation
between observations which have received the same treatment) will be. It is generally
desirable to have homogeneous experimental for experiments, because this allows us to
detect the differences between treatments more clearly.
3. If the experimental units are not homogeneous, but heterogeneous, we can group sets of
homogeneous experimental units and thereby account for differences between these groups.
This is called blocking.
5. In a single-factor experiment the treatments will correspond to the levels of the treatment
factor (e.g. for the water level experiment the treatments will be low, medium and high).
With more than one treatment factor, the treatments can be constructed by crossing all
factors: every possible combination of the levels of factor A and the levels of factor B is a
treatment. Experiments with crossed treatment factors are called factorial experiments).
More rarely in true experiments, factors can be nested (see Section 1.5).
6. In the experiments we will consider, each experimental unit receives one and only one
treatment. But each treatment can consist of a combination of factor levels, e.g. high
temperature combined with low water level can be one treatment, high temperature with
high water level another treatment.
7. The treatments are applied at random to the experimental units in such a way that each
unit is equally likely to receive a given treatment. The process of assigning treatments to
the experimental units in this way is called randomisation.
7
9. If a treatment is applied independently to more than one experimental unit it is said to
be replicated. Treatments must be replicated! Making more than one observation on the
same experimental unit is not replication. If the measurements on the same experimental
unit are taken over time there are methods for repeated measures, longitudinal data, see
Ch.8. If the measurements are all taken at the same time, as in the pot with 5 plants
example above, this is just pseudoreplication. Pseudoreplication is a common fallacy
(Hurlbert 1984), and will invalidate the experiment.
10. We are mainly interested in the effects of the different treatments: by how much does the
response change with treatment i relative to the overall mean response. These are also
called fixed effects: we are interested in the change in response for each of the treatments
chosen. We can also formulate model parameters as random effects if we are interested in
the amount of variation in the response they contribute, rather than the exact change in
the response of each. This formulation is often used when the treatments or blocks are a
random sample from a population (see Ch.8).
There are two reasons why experiments should be conducted when possible:
1. An experiment is almost the only way in which one can control all factors to such an
extent as to eliminate any other possible explanation for a change in response other than
the treatment factor of concern. This then allows one to infer causality. This is achieved
by carefully controlling all factors which might influence the experiment.
2. Well designed experiments are easy to analyse. They will give conclusive answers to the
hypotheses that the experiment set out to test. It will tell us whether or not the treatment
factor has an effect on the response, independent of any other factors in the model. This
is achieved through the experimental design, arranging treatments in such a way that
the estimates will be independent/orthogonal.
Experiments are frequently used to find optimal levels of settings (treatment factors) which will
maximise (or minimise) the response. Such experiments can save enormous amounts of time
and money.
There are three fundamental principles when planning experiments. These will help to ensure
the validity of the analysis and to increase its power:
8
True, independent replication demands that the treatment is set up anew for each ex-
perimental unit, one should not set up the experiment for a specific treatment and then
run all experimental units under that treatment at the same time. This would result in
pseudoreplication, where effectively the treatment is applied only once. For example, if
there are 2 teaching methods, each used on one class of students, there is no true replica-
tion. The experimental units are the two classes (not the students), and the methods are
randomly assigned to the two classes (not the students).
(a) there is no bias on the part of the experimenter, either conscious or unconscious, in
the assignment of the treatments to the experimental units;
(b) possible differences between experimental units are equally distributed amongst the
treatments, thereby reducing or eliminating confounding.
(c) Randomisation helps to prevent confounding with underlying, possibly unknown,
variables (e.g. changes over time).
(d) Randomisation allows us to assume independence between observations.
Both the allocation of treatments to the experimental material and the order in which
the individual runs or trials of the experiment are to be performed must be randomly
determined!
3. Blocking refers to the grouping of experimental units into homogeneous sets, called blocks.
This can reduce the unexplained (error) variance, resulting in increased power for compar-
ing treatments. Variation in the response may be caused by variation in the experimental
units, or by external factors that might change systematically over the course of the ex-
periment (e.g. if the experiment is conducted on different days). Such nuisance factors
should be blocked for whenever possible (else randomised).
Natural grouping or blocking factors are: time, age, sex, litter of animals, batch of mate-
rial, spatial location, size of a city.
Blocking also offers the opportunity to test treatments over a wider range of conditions,
e.g. if I only use people of one age (e.g. students) I cannot generalise my results to older
people. However if I use different blocks (each an age category) I will be able to tell
whether the treatments have similar effects in all age groups or not. If there are known
groups in the experimental units, blocking guards against unfortunate randomisations.
Blocking aims to reduce (or control) any variation in the experimental material, where
possible, with the intention to increase power (sensitivity). Hence, these three principles
are sometimes called the three R’s of experimental design.
Another way to reduce error variance is to keep all factors not of interest as constant as
possible. This principle will affect how experimental material is chosen.
Q: Suppose, that in a simple experiment two drugs are to be compared. Drug A was given
to the males, all aged between 20 and 30; drug B was given to the female patients, all aged
between 50 and 65. What will you be able to conclude about the relative effectiveness of the
two drugs? Explain your answer.
9
1.5 Experimental Design
The design that will be chosen for a particular experiment depends on the treatment structure
(determined by the research question) and the blocking structure (determined by the experi-
mental units available).
Treatment Structure
Single (treatment) factor experiments are fairly straightforward. One needs to decide on which
levels of the factor to choose. If the treatment factor is continuous, e.g. temperature, it may
be wise to choose equally spaced levels, e.g. 50, 100, 150, 200. This will simplify analysis when
you want to fit a polynomial curve through this, i.e. investigate the form of the relationship
between temperature and the response.
If the treatment factor is categorical one should always check whether a control treatment
is needed. A control treatment is a benchmark treatment to evaluate the effectiveness of ex-
perimental treatments. For example, if I test 3 taste enhancers (A, B, C), each is tested by
4 specially trained taste judges and the average scores for the 3 enhancers are: 45, 48, 44, I
will not be able to say whether all of them are good or all of them are useless, unless I have a
(control) score which tells me how good the food tastes without any of these enhancers.
If there is more than one treatment factor, these can be crossed, giving rise to a factorial
experiment, or nested.
Factorial Experiments
In factorial experiments the total number of treatments (and experimental units required) in-
creases rapidly, as each factor level combination is included. For example, if we have tempera-
ture, soil and water level, each with 2 levels there are 2×2×2 = 8 combinations = 8 treatments.
And we will need replications, so at least 16 experimental units are required.
Often, factorial experiments are illustrated by a graph such as shown in Figure 1.1. This quickly
summarizes which factors, factor levels and which combinations are used in an experiment.
One important advantage that factorial experiments have over one-factor-at-a-time experiments,
is that one can investigate interactions. If two factors interact, it means that the effect of the
one depends on the level of the other factor, e.g. the change in response when changing from
level a1 to a2 (of factor A) depends on what level of B is being used. Often, the interesting
research questions are concerned with interaction effects. Interaction plots are very helpful
when trying to understand interactions. As an example, the success of advertising may depend
on how well the online website works (RHS of Figure 7.2): advertising may increase sales if the
website works well, but decrease sales if there are certain problems with the website.
Nested Factors
When factors are nested the levels of one factor, B, will not be identical across all levels of
another factor A. Each level of factor A will contain different levels of factor B. These designs
are common in observational studies; we will briefly look at their analysis in Chapter 8.
10
● ● ●
● ● ●
b2 ● ● ●
Factor B ●
●
●
●
●
●
b1 ● ● ●
Figure 1.1: One way to illustrate a 3 × 2 factorial experiment. The three dots at each treatment
illustrate three replicates per treatment.
● b1 ● b1
●
● b2 ●
Y
● b2
a1 a2 a1 a2
Figure 1.2: On the left factors A and B do not interact (their effects are additive). On the right
A and B interact, the effect of one depends on the level of the other factor. The dots represent
the mean response at a certain treatment. The lines join treatments with the same level of
factor B, for easier reference.
11
Example of nested factors: In an animal breeding study we could have two bulls (sires), and six
cows (dames). Progeny is nested withing dames, and dames are nested within sires.
sire 1 sire 2
dam 1 dam 2 dam 3 dam 4 dam 5 dam 6
progeny 1 2 3 4 5 6 7 8 9 10 11 12
If humans are involved as experimental units or as observers, some psychological effects can creep
into the results. In order to preempt any such possibility it is often necessary to blind either or
both observer and experimental unit: single- or double-blinded studies. This means that
they do not know which treatment the experimental unit has received. This prevents biased
recording of results, because expectations could consciously or unconsciously influence results.
In medical studies, in order to withhold the identity of the treatment from the patient (to blind
the patient) a placebo is often used as a control treatment. A placebo looks and tastes like
the treatment but contains no active ingredients.
Placebo Effect: The physician’s belief in the treatment and the patient’s faith in the physician
exert a mutually reinforcing effect; the result is a powerful remedy that is almost guaranteed to
produce an improvement and sometimes a cure (Follies and Fallacies in Medicine). The placebo
effect is a measurable, observable or felt improvement in health, behaviour not attributable to
a medication or treatment.
It is tricky to measure the placebo effect, but often, to get an idea, 2 control treatments are
used, a placebo and a no-treatment control.
The most important aim of blocking is to reduce unexplained variation (error variance), and
thereby to obtain more precise parameter estimates. Here one should look at the experimental
units available: Are there any structures/differences that need to be blocked? Do I want to
include experimental units of different types to make the results more general? How many ex-
perimental units are available in each block? For the simplest designs covered in this course, the
number of experimental units in each block will correspond to the total number of treatments.
However, in practice this can often not be achieved.
The grouping of the experimental units into homogeneous sets called blocks and the subsequent
randomisation of the treatments to the units in a block form the basis of all experimental
designs. We will study three designs which form the basis of other more complex designs. They
are:
12
2. The randomised block design
This design is used when the experimental units are not all homogeneous but can be
grouped into sets of homogeneous units called blocks. The treatments are randomly
assigned to the units within each block.
In all of these designs the treatment structure can be a single factor or factorial (crossed factors).
Analysis of variance is frequently used for non-experimental studies. The analysis will be the
same, the conclusions will differ in that no causality can be inferred. In observational studies
design refers to how the sampling is done (on the explanatory variables), and is referred to as
sampling design. The aim is, as in experimental studies, to achieve the best possible estimats
of effects.
Randomisation refers to the random allocation of treatments to the experimental units. This
can be done using random number tables or using a computer or calculator to generate random
numbers. Note that when all experimental runs cannot be performed at the same time, both the
assignment of treatment to experimental unit and the sequence in which the runs are performed
have to be randomised!
When assigning treatments to experimental units, each permutation must be equally likely, i.e.
each possible assignment of treatments to experimental units must be equally likely.
For completely randomised designs the experimental units are not blocked, so the treatments
(and their replications) are assigned completely at random to all experimental units available
(hence completely randomised).
If there are blocks, the randomisation of treatments to experimental units occurs in each block.
An example of a table of random numbers can be found in the attached tables, Table 1.
To randomly assign 2 treatments (A and B) to 12 experimental units, 6 experimental units per
treatment, you can:
1. Decide to let odd numbers ≡ treatment A and let even numbers ≡ treatment B and choose
a place to start by using a pin:
13
For example, lets start in row 6, second digit of column 4, and assign A to odd numbers
and B to even numbers.
79 76 49 31 93 54 17 36 91 50 11 38 87 79 97 69 22
A B A A A B A B A B B B
or decide to assign treatment A for two-digit numbers 00 - 49, and treatment B for
two-digit numbers 50 - 99. For example, lets start in row 2, column 8, 3rd digit:
67 49 72 48 95 39 03 22 46 87 71 16 70
B A B A B A A A A B B A B
96 09 58 89 23 71 38 etc
↓ ↓ ↓
AABB BBAA ABAB
↓ ↓ ↓
block 1 block 2 block 3
1. Use any computer software program (or your calculator) to generate a (uniformly dis-
tributed) random number on an interval (e.g. 0 to 1), or uniformly distributed integers
(e.g. between 0 and 99). Make sure you have a random starting point (usually you will be
able to set this ‘seed’, or even better choose it randomly). Then the sequence of random
numbers can be used as above in the random number table examples.
2. First number the experimental units, preferably in a random order. Write the treatments
on separate pieces of paper (with replications on more separate pieces of paper), put in a
hat, shuffle, and draw one treatment for each experimental unit from the hat.
3. Use shuffled red and black cards if there are only 2 treatments.
5. etc...
14
Note: Randomization cannot be emphasized too much. Randomization is necessary for con-
clusions drawn from the experiment to be correct, unambiguous and defensible!
To prevent systematic changes over time from influencing results one must ensure that the order
of the treatments over time is random. If a clear time effect is suspected, it might be best to
block for time. In any case, randomisation over time helps to ensure that the time effect is
approximately the same, on average, in each treatment group, i.e. treatment effects are not
confounded with time.
Q: Think of a simple experiment (with 3 treatments, say) where the treatments may have been
randomly assigned to experimental units, but then the order in which the experiment was run
was not randomised. What would happen?
For the same reason one would block spatially arranged experimental units, or if this is not
possible, randomise treatments in space.
1. Treatment Structure: What is the research question? What is the response? What are
the treatment factors? What levels for each treatment factor should I choose? Do I need
a control treatment? Do I want to look at interactions?
2. Experimental Units: How many replications do I need? How many experimental units
can I get, afford?
3. Blocking: Do I need to block the experimental units? Do I need to control other unwanted
sources of variation?
5. Design: randomisation
15
Chapter 2
This chapter gives a brief overview of the analysis for the three basic designs. You should not
be too concerned about any details at this stage, but it may be useful to read over this chapter
again after you have covered randomised block and Latin square designs.
This design is used when the experimental units are homogeneous. This means that there are
no known differences between the experimental units before the treatments are applied, i.e. no
blocks are needed. Each treatment is randomly assigned to r experimental units. Each unit is
equally likely to receive any of the a treatments. There are N = r × a experimental units.
Effect is a formal term used in the analysis of experiments to denote the amount the response
changes when using the particular treatment compared to the overall mean. The reason for
formulating models with effects has more to do with mathematical convenience, and not so
much with using meaningful terms (see later).
2. Simple analysis even when there are unequal replications of some treatments.
An example
Units 1 2 3 4 5 6 7 8 9 10 11 12
Treatments B C A A C A B D C D B D
16
2.2 Randomised Block Design (RBD)
This design is used if the experimental material is not homogeneous but can be divided into
blocks of homogeneous material. Before the treatments are applied there are no known differ-
ences between the units within a block, but there may be very large differences between units
from different blocks.
Differences between blocks are usually an advantage because it allows us to demonstrate that
treatment differences found to be significant, hold over a wide range of conditions. Treatments
are assigned at random to units within a block.
In a complete block design each treatment occurs once and only once in each block (randomised
complete block design). If there are not sufficient units within a block to allow all the treatments
to be applied an Incomplete Block Design can be used (not covered here, see Hicks & Turner
(1999) for details).
Randomised block designs are easy to design and analyse. The number of experimental units
in each block must be the same as the number of treatments. Blocking allows more sensitive
comparisons of treatment effects. On the other hand, missing data can cause problems in the
analysis.
Any known variability in the experimental procedure or the experimental units can be controlled
for by blocking. A block could be:
A litter of animals.
A single subject.
A Latin Square Design allows blocking for two sources of variation, without having to increase
the number of experimental units. Call these sources, row variation and column variation. The
p2 experimental units are grouped by their row and column position. The p treatments are
assigned so that they occur exactly once in each row and in each column.
17
C1 C2 C3 C4
R1 A B C D
R2 B C D A
R3 C D A B
R4 D A B C
Randomisation
The Latin Square is chosen at random from the set of standard Latin Squares of order p. Then
the rows and columns are permuted randomly.
The Latin Square design can be used to block for time periods and order of presentation of
treatments such as:
Period 1 Period 2 Period 3 Period 4
Order ABCD BCDA CDAB DABC
Latin square designs are efficient in the number of experimental units used when there are two
blocking factors. However, the number of treatments must equal the number of row blocks and
the number of column blocks. One experimental unit must be available at every combination
of the two blocking factors. Also, the assumption of no interactions between treatment and
blocking factors should hold.
2.4 An Example
An experiment is conducted to compare 4 methods of treating motor car tyres. The treatments
(methods), labelled A, B, C and D are assigned to 16 tyres, four tyres receiving A, four others
receiving B, etc.. Four cars are available, treated tyres are placed on each car and the tread
loss after 20 000 km is measured.
This design is terrible! Apparent treatment differences could also be car differences: Treatment
and car effects are confounded.
We could use a Completely Randomized Design (CRD). We would assign the treated tyres ran-
domly to the cars hoping that differences between the cars will average out. Here is one such
randomisation. The numbers in the brackets are the observed tread losses.
18
Car 1 Car 2 Car 3 Car 4
C(12) A(14) D(10) A(13)
A(17) A(13) C(10) D(9)
D(13) B(14) B(14) B(8)
D(11) C(12) B(13) C(9)
To test for differences between treatments, analysis of variance (ANOVA) is used. We will
present these tables here, but only as a demonstration of what happens to the Error Mean
Square (MSE) when we change the design, or account for variation between blocks. The MSE
in this table is identical to the regression MSE.
The ANOVA table for testing the hypothesis of no difference between treatments, H0 : µA =
µB = µC = µD , is:
Is the Completely Randomised Design the best we can do? Note that A is never used on Car
3, and B is never used on Car 1.
Any variation in A may reflect variation in Cars 1, 2 and 4. The same remarks apply to B and
Cars 2, 3 and 4. So the random error SS may contain this variation. Can we remove it? Yes -
by blocking for cars.
Even though we randomized, there is still a bit of confounding (between cars and treatments)
left. To remove this problem we should block for car, and use every treatment once per car, i.e.
use a Randomised Block Design. We would randomly assign one of the four tyres treated with
A to each car. And repeat this with tyres treated with B, C and D. The cars are the blocks.
Differences between the responses to the treatments within a car will reflect the effect of the
treatments.
Car 1 Car 2 Car 3 Car 4
B(14) D(11) A(13) C(9)
C(12) C(12) B(13) D(9)
A(17) B(14) D(10) B(8)
D(13) A(14) C(10) A(13)
19
Tread loss
Treatment
Car 1 Car 2 Car 3 Car 4
A 17 14 13 13
B 14 14 13 8
C 12 12 10 9
D 13 11 10 9
Another source of variation would be from the wheels on which a treated tyre was placed. To
have a tyre of each type on each wheel position on each car would mean that we would need 64
tyres for the experiment, rather expensive! Using a Latin Square Design makes it possible to
put a treated tyre in each wheel position and use all four treatments on each car.
Wheel position Car 1 Car 2 Car 3 Car 4
1 A B C D
2 B C D A
3 C D A B
4 D A B C
Within this arrangement A appears in each car and in each wheel position, and the same applies
to B and C and D, but we have not had to increase the number of tyres needed.
Note that in reality one cannot change the analysis after the experiment was run. The above
example is for illustration only to show what could happen under different designs if not carefully
planned. The design determines the model and all the above considerations whether one should
block by car and wheel position have to be carefully thought through at the planning stage of
the experiment.
References
1. Hicks CR, Turner Jr KV. (1999). Fundamental Concepts in the Design of Experiments.
5th edition. Oxford University Press.
20
Chapter 3
The models we have written for the three designs in the previous chapter are linear models. We
can use the linear model theory used in the regression section of this course to fit these models
to data, and to obtain estimates for the parameters.
yi = β0 + β1 L1i + β2 L2i + . . . + ei
with L1, L2, etc. dummy variables indicating whether response i belongs to group j or not.
However, when dealing with only categorical explanatory variables, as is typical in experimental
data, it is more common to write the above model in the following form:
The dummy variables are implicit but not written. The two models are equivalent in the sense
that they make exactly the same assumptions and describe exactly the same structure of the
data. Model 3.1 is more convenient for experiments because it avoids writing out of the dummy
variables. This form is sometimes referred to as an ANOVA model, as opposed to a regression
model.
21
2. Test whether there are differences between the treatments, i.e. whether a subset of the β’s
is zero: H0 : α1 = . . . = αp = 0.
3. If we can conclude that there are differences between the treatments, we want to know
which treatments differ, which treatments are best, and by how much they differ. This
part is the most important! It is the one that will answer our research question.
After one week’s training all students were asked to read an identical passage on a film, which
was delivered at a rate of 300 words per minute. Students were then asked to answer questions
on the passage read and their marks were recorded. They were as follows:
We want to know whether comprehension is higher for some of these methods of speed-reading,
and if so, which methods work better.
When we have models where all explanatory variables (factors) are categorical (as in experi-
ments), it is common to write them as follows. You will see later why this parameterisation is
convenient for such studies.
Yij = µ + αi + eij
Here αi = µi − µ, i.e. the change in mean response with treatment i relative to the overall mean.
µi is the mean response with treatment i: µi = µ + αi .
By effect we mean here: the change in response with the particular treatment compared to the
overall mean. For categorical variables, effect in general refers to a change in response relative to
a baseline category or an overall mean. For continuous explanatory variables, e.g. in regression
22
models, we also talk about effects, and then mostly mean the change in mean response per unit
increase in x, the explanatory variable.
Note that we need 2 subscripts on the Y - one to identify the group and the other to identify
the subject within the group. Then:
Y1j = µ + α1 + e1j
Y2j = µ + α2 + e2j
Y3j = µ + α3 + e3j
To put our model and data into matrix form we string out the data for the groups into an N × 1
vector, where the first n1 elements are the observations on Group 1, the next observations n2 ,
etc.. Then the linear model, Y = Xβ + e, has the form
Y11 1 1 0 0 e11
Y12
1 1 0 0
e12
Y13
1 1 0 0
e13
Y14
1 1 0 0
e14
Y21 1 0 1 0 e21
µ
Y22 1 0 1 0 e22
α1
Y = Y23 = 1 0 1 0 × + e23
α2
Y24 1 0 1 0 e24
α3
Y31 1 0 0 1 e31
Y32 1 0 0 1 e32
Y33
1 0 0 1
e33
Y34 1 0 0 1 e34
Y35 1 0 0 1 e35
Note that
1. The entries of X are either 0 or 1 (because here all terms in the structural part of the
model are categorical). X is often called the design matrix because it describes the
design of the study, i.e. it describes which of the factors in the model contributed to each
of the response values.
2. The sum of the last three columns of X add up to the first column. Thus X is a 13 × 4
matrix with column rank of 3. The matrix X0 X will be a 4 × 4 matrix of rank 3.
To find estimates for the parameters we can use the method of least squares or maximum
likelihood, as for regression. The least squares estimates minimise the error sum of squares:
23
SSE = (Y − Xβ)0 (Y − Xβ)
XX
= (Yij − µ − αi )2
i j
(since β 0 = (µ, α1 , α2 , α3 ))
and the estimates are given by the solution to the normal equations
X 0 Xβ = X 0 Y
Since the sum of the last three columns of X is equal to the first column there is a linear
dependency between columns of X, and X 0 X is a singular matrix, so we cannot write
β̂ = (X 0 X)−1 X 0 Y
The set of equations X 0 Xβ = X 0 Y are consistent, but have an infinite number of solutions.
Note that we could have used only 3 parameters µ1 , µ2 , µ3 , and we actually only have enough
information to estimate these 3 parameters, because we only have 3 group means. Instead we
have used 4 parameters, because the parameterization using the effects αi is more convenient in
the analysis of variance, especially when calculating treatment sums of squares (see later). But,
the way in which we have defined µ and the αi ’s above, there is a unique connection between
the four parameters in the one model and the 3 parameters in the other model:
N µ = n1 µ1 + n2 µ2 + n3 µ3 = n1 µ + n1 α1 + n2 µ + n2 α2 + n3 µ + n3 α3
P P
by definition.
P The RHS becomes (n1 + n2 + n3 )µ + i ni αi = N µ + i ni αi . From this follows
that i ni αi = 0. The normal equations don’t know this so we add this additional equation
(to calculate the fourth parameter from the other three) as a constraint in order to get the
unique solution. In practice we could add the row (0, n1 , n2 , n3 ) to X, with corresponding entry
in Y and e = 0. X 0 X would then have rank 4 and we can get unique estimates for our four
parameters.
P
In other words, if we have i ni αi = 0 then the αi ’s have exactly the meaning intended above:
they measure the difference in mean responsePwith treatment i compared to the overall mean;
µi = µ + αi . The sum-to-zero constraint i ni αi = 0 together with the treatment effects
model 3.1 is the most frequently used for models in experimental design.
Note that we could define the αi ’s differently, by using a different constraint, e.g.
Yij = µ + αi + eij
α1 = 0
Here the mean for treatment 1 is used as a reference category and equals µ. Then α2 and α3
measure the difference in mean between group 2 and group 1 and between group 3 and group 1
respectively. This parameterization is the one most common in regression, e.g. when you add
24
a categorical variable in a regression model the β estimates are defined like this, differences to
the first category.
2. The constraint must remove the linear dependency, so it cannot be any linear combination
of the rows of X. Denote the constraint by Cβ = 0.
3. The estimate of β subject to the given constraint is unique. For this reason the constraint
should be specified as part of the model. So we write
Yij = µ + αi + eij
X
ni αi = 0
or in matrix notation
Y = Xβ + e
Cβ = 0
4. Although the estimates of β depend on the constraints used, the following quantities are
unique.
Speed-reading Example
82 1 1 0 0 e11
80
1 1 0 0
e12
81
1 1 0 0
e13
83
1 1 0 0
e14
71 1 0 1 0 e21
µ
79 1 0 1 0 e22
α1
Y = 78 = 1 0 1 0 × + e23 = Xβ + e
α2
74 1 0 1 0 e24
α3
e31
91 1 0 0 1
93 1 0 0 1 e32
84
1 0 0 1
e33
90 1 0 0 1 e34
88 1 0 0 1 e35
25
The normal equations, X 0 Xβ = X 0 Y are
1 1 0 0
1 1 0 0
1 1 0 0
1 1 0 0
1 0 1 0
1 1 1 1 1 1 1 1 1 1 1 1 1 µ
1 0 1 0
1 1 1 1 0 0 0 0 0 0 0 0 0
α1
1 0 1 0
0 0 0 0 1 1 1 1 0 0 0 0 0
α2
1 0 1 0
0 0 0 0 0 0 0 0 1 1 1 1 1 α3
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1
82
80
81
83
71
1 1 1 1 1 1 1 1 1 1 1 1 1
79
1 1 1 1 0 0 0 0 0 0 0 0 0
= 78
0 0 0 0 1 1 1 1 0 0 0 0 0
74
0 0 0 0 0 0 0 0 1 1 1 1 1
91
93
84
90
88
Multiply X 0 Xβ = X 0 Y
P
Y
13 4 4 5 µ 1074 Pij ij
4 4 0 0 α1
326
= Y
Pj 1j
302 =
4 0 4 0 α2 Y
Pj 2j
5 0 0 5 α3 446 j Y3j
The sum of the last three columns of X 0 X = column 1, hence the columns are linearly dependent.
4 + 4 + 5 = 13
4 + 0 + 0 = 4
0 + 4 + 0 = 4
0 + 0 + 5 = 5
So
26
X 0 X is singular
There are an infinite number of solutions that satisfy the equations! To find the particular
solution we require, we add the constraint, which defines how the parameters are related
to each other.
We illustrate the effect of different sets of constraints on the least squares estimates using the
speed-reading example. The normal equations are:
13 4 4 5 µ 1074
4 4 0 0 α1 326
X 0 Xβ = 0
=
302 = X Y
4 0 4 0 α2
5 0 0 5 α3 446
P
a) The sum-to-zero constraint ni αi = 0
13 0 0 0 µ 1074
4 4 0 0 α1 326
=
4 0 4 0 α2 302
5 0 0 5 α3 446
µ̂ = 82.62
α̂1 = −1.12
α̂2 = −7.12
α̂3 = 6.58
1074
326
β̂ 0 X 0 Y = 82.62 −1.12 −7.12 6.58 302 = 89153.2
446
27
PP ¯
(Yi. − Y¯.. )2 +
PP ¯
β̂ 0 X 0 Y = SSmean +SStreamtment =P (Y.. −0)2 . This assumes that the total
sum of squares was calculated as Yi0 Yi . β̂ 0 X 0 Y is used here because it is easier to calculate
by hand than the usual treatment sum of squares β̂ 0 X 0 Y − N1 Y 0 Y .
This constraint is important as it is the one used most frequently for regression models with
dummy or categorical variables, e.g. regression models fitted in R.
Now
µ
α1
Cβ = 0 1 0 0 α2 = α1 = 0
α3
This constraint is equivalent to removing the α1 equation from the model, so we strike out the
row and column of the X’X corresponding to the α1 and the normal equations become
13 4 5 µ 1074
4 4 0 α2 = 302
5 0 5 α3 446
µ̂ = 81.5
α̂1 = 0
α̂2 = −6.0
α̂3 = 7.7
1074
326
β̂ 0 X 0 Y = 81.5 0 −6 7.7 302 = 89153.2
446
28
Ŷ1j = µ̂ = 81.5 in Group 1
Ŷ2j = µ̂ + α̂2 = 75.5 in Group 2
Ŷ3j = µ̂ + α̂3 = 89.2 in Group 3
which are the same as previously. However, the interpretation of the parameter estimates is
different. µ is the mean of treatment 1, α2 is the difference in means between treatment 2 and
treatment 1, etc.. Treatment 1 is the baseline or reference category. This is the parameterization
typically used when fitting regression models, e.g. in R, which calls it ’treatment contrasts’.
c) The constraint µ = 0
This will result in the cell means model: yij = αi + eij or µi + eij .
We will be using almost exclusively the sum-to-zero constraint as this has a convenient inter-
pretation and connnection to sums of squares, and the analysis of variance.
The general case: Parameter estimates for the single-factor completely ran-
domised design
Suppose an experiment has been conducted as a completely randomised design: N subjects were
th
P
randomly assigned to a treatments, where the i treatment has ni subjects, with ni = N ,
and Yij = j th observation in ith treatment group. The data have the form:
Group
I II ... a
Y11 Y21 Ya1
Y12 Y22 Ya2
.. ..
. Y2n2 .
Y1n1 Yana
Means Y 1· Y 2· Y a· Y ··
Totals Y1· Y2· Ya· Y··
Variances s1 2 s2 2 sa 2
The first subscipt is for the treatment group, the second for the replication. The group totals
and means are expressed in the following dot notation:
ni
X
group total Yi· = Yij
j=1
29
group mean Y i· = Yi· /ni
and
XX
overall total Y·· = Yij
i j
Yij = µ + αi + eij
X
ni αi = 0
where
µ = general mean
αi = effect of the ith level of treatment factor A
eij = random error distributed as N (0, σ 2 ).
Y = Xβ + e
with
e ∼ N(0, σ 2 I)
Cβ = 0
where
Y11 1 1 0 ... 0
e11
Y12 1 1 0 0
e12
.. ..
..
. .
.
Y1n1 1 1 0 0 µ ..
.
Y21 1 0 1 ... 0 α1
Y22
1 0 1
0 α2
..
.
Y= .. = .. × .. + .. = Xβ + e
.
.
.
.
Y2n2 1 0 1 0 .. ..
. .
.. ..
. . αa ..
.
Ya1 1 0 0 ... 1
..
.. .. .
. .
eana
Yana 1 0 0 1
30
µ
α1
Cβ = 0 n1 n2 . . . na
α2 =0
..
.
αa
S = (Y − Xβ)0 (Y − Xβ)
X
= (Yij − µ − αi )2
ij
P P P
where ij = i j . Let’s put numbers to all of this and assume a = 3, n1 = 4, n2 = 4 and
n3 = 5. Then
Y11 1 1 0 0
Y12
1 1 0 0
Y13
1 1 0 0
Y14
1 1 0 0
Y21 1 0 1 0
µ
Y22 1 0 1 0
α1
Y23 = 1 0 1 0 × + eij
α2
Y24 1 0 1 0
α3
Y31 1 0 0 1
Y32 1 0 0 1
Y33
1 0 0 1
Y34 1 0 0 1
Y35 1 0 0 1
subject to Cβ = 0
µ
α1
Cβ = 0 4 4 5 α2 = 0
α3
31
1 1 0 0
1 1 0 0
1 1 0 0
1 1 0 0
1 0 1 0
1 1 1 1 1 1 1 1 1 1 1 1 1 µ
1 0 1 0
1 1 1 1 0 0 0 0 0 0 0 0 0 α1
X0 Xβ =
1 0 1 0
0 0 0 0 1 1 1 1 0 0 0 0 0
α2
1 0 1 0
0 0 0 0 0 0 0 0 1 1 1 1 1 α3
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1
13 4 4 5 µ 13µ + 4α1 + 4α2 + 5α3
4 4 0 0 α1 4µ + 4α1
=
4
=
0 4 0 α2 4µ + 4α2
5 0 0 5 α3 5µ + 5α3
Y11
Y12
Y13
Y14
Y21
1 1 1 1 1 1 1 1 1 1 1 1 1
Y22
1 1 1 1 0 0 0 0 0 0 0 0 0
X 0Y =
Y23
0 0 0 0 1 1 1 1 0 0 0 0 0
Y24
0 0 0 0 0 0 0 0 1 1 1 1 1
Y31
Y32
Y33
Y34
Y35
P
Y
Pij ij N Y ··
Y n1 Y 1·
Pj 1j
= =
Y n2 Y 2·
Pj 2j
j Y3j n3 Y 3·
32
Which implies that
13µ = 13Y ··
=⇒
µ̂ = Y ··
And that
α̂i = Y i· − Y ··
X 0 Xβ = X 0 Y
where
N n 1 n2 . . . n a N Y ··
µ
n1 n1 0 0 α1
n1 Y 1·
X 0 Xβ = n2 0 n2 0 .. = X 0 Y = n2 Y 2·
.. .. .. .. . ..
. . . . .
αa
nn 0 0 . . . na na Y a·
µ
α1
Cβ = 0 n1 n2 . . . n a =0
..
.
αa
N µ = N Y ··
n1 µ + n1 α1 = n1 Y 1·
.. ..
. .
na µ + na αa = na Y a·
µ̂ = Y ··
µ̂i = Y i·
and
α̂i = Y i· − Y ··
for i = 1, . . . , a. Parameter estimation for many of the standard experimental designs is straight-
forward! From general theory we know that the above are unbiased estimators of µ and the
33
αi ’s. An unbiased estimator of σ 2 is found by using the minimum value of the residual sum of
squares, SSE, and dividing by its degrees of freedom.
X
min(SSE) = (Yij − µ̂ − α̂i )2
ij
X
= (Yij − Y i· )2
ij
1
E(Yij − Ȳi. )2 = V ar(Yij − Ȳi. ) = σ 2 (1 − )
n
Then
hX X i 1
E[SSE] = E (Yij − Ȳi. )2 = anσ 2 (1 − ) = a(n − 1)σ 2
n
SSE
E[M SE] = E = σ2
N −a
So
1 X
s2 = (Yij − Y i· )2
N −a
ij
Mostly, the estimates we are interested in are linear combinations of treatment means. In such
cases it is relatively straightforward to calculate the corresponding variances (of the estimates):
P P
V ar(µ̂) = V ar(Y ·· ) = V ar( i j Yij /N )
= N12 i j V ar(Yij )
P P
1
= N22
N σ2
σ
= N
s2
The estimated variance is then N where s2 is the mean square for error (least squares estimate,
see above).
34
V ar(α̂i ) = V ar(Y i· − Y ·· )
= V ar(Y i· ) + V ar(Y ·· ) − 2Cov(Y i· Y ·· )
Consider
But since the groups are independent cov(Y i· Y k· ) is zero if i 6= k. If i = k then cov(Y i· Y k· ) =
2 2
V ar(Y i· ) = σni . Using this result and summing we find cov(Y i· Y ·· ) = σN . Hence
σ2 σ2 2σ 2
V ar(α̂i ) = ni + N − N
(N −ni )σ 2
= ni N
σ2 σ2
= ni − N
Q: What does sampling distribution refer to? How do we know what the sampling distribution
of an estimator is?
How do we estimate σ 2 ?
s2 = N 1−a ij (Yij − Y i· )2
P
estimate ± tα/2
ν × standard error
35
α/2
Where tν is the α/2th percentile of Student’s t distribution with ν degrees of freedom. The
degrees of freedom of t are the degrees of freedom of s2 .
Speed-reading Example
Note that so far we have not used the assumption of eij ∼ N (0, σ 2 ). The least squares estimates
do not require the assumption of normal errors! However, to construct a test for the above
hypothesis we need the normality assumption. In what follows, we assume that the errors
are identically and independently distributed as N (0, σ 2 ). Consequently the observations are
normally distributed, though not identically. We must check this assumption of independent,
normally distributed errors, else our test of the above hypothesis could give a very misleading
result.
Let’s assume Yij are data obtained from a CRD, and we are assuming model 3.1:
Yij = µ + αi + eij
In statistics, sums of squares refer to squared deviations (from a mean or expected value), e.g.
the residual sum of squares is the sum of squared deviations of observed from fitted values. Lets
rewrite the above model by substituting observed values and rewriting the terms as deviations
from means:
Make sure you agree with the above. Now square both sides and sum over all N observations:
36
XX XX XX XX
(Yij − Y ·· )2 = (Yij − Y i· )2 + (Y i· − Y ·· )2 + 2 (Yij − Y i· )(Y i· − Y ·· )
i j i j i j i j
The crossproduct term is zero after summation over j since it can be written as
X X
2 (Y i· − Y ·· ) (Yij − Y i· )
i j
The second term is the sum of the observations in the ith group about their mean value, so the
sum is zero for each i. Hence
XX XX XX
(Yij − Y ·· )2 = (Y i· − Y ·· )2 + (Yij − Y i· )2
i j i j i j
So the total sum of squares partitions into two components: (1) squared deviations of the
treatment means from the overall mean, and (2) squared deviations of the observations from
the treatment means. The latter is the residual sum of squares (as in regression, the treatment
means are the fitted values). The first sum of squares is the part of the variation that can be
explained by deviations of the treatment means from the overall mean. We can write this as
The analysis of variance is based on this identity. The total sum of squares equals the sum of
squares between groups plus the sum of squares within groups.
Quadratic Forms
Source SS df
treatment A SSA = Y 0 (H − 1
n J)Y a−1
residual SSE = Y 0 (I − H)Y N −a
total SST = Y 0 (I − n1 J)Y N −1
where J is the n × n matrix of ones, and H is the hat matrix X(X 0 X)−1 X 0 .
Q: From your regression notes, what does that imply for the distributions of these sums of
squares?
37
Cochran’s Theorem
v
X
Zi2 = Q1 + Q2 + . . . + Qs
i=1
v = v1 + v2 + . . . + vs
n−1 2
E(Xi − X̄)2 = (µi − µ̄)2 + σ
n
where
n
1X
µ̄ = µi
n
i=1
THEOREM A. Under the assumptions for the model Yij = µ + αi + eih , and assuming all
ni = n
XX
E(SSerror ) = E(Yij − Ȳi. )2
i j
= (N − a)σ 2
XX
E(SStreatment ) = E(Ȳi. − Ȳ.. )2
i j
X
= ni E(Ȳi. − Ȳ.. )2
i
X
= ni αi2 + (a − 1)σ 2
i
38
M SE = SSNerror 2
−a may be used as an estimate for σ . It is an unbiased estimator. If all the αi
are equal to zero, then the expectation of SStreatment /(a − 1) is also σ 2 !
F test of H0 : α1 = α2 = · · · = αa = 0
THEOREM B. If the errors are independent and normally distributed with means 0 and
variances σ 2 , then SSerror /σ 2 follows a chi-square distribution with (N − a) degrees of freedom.
If, additonally, the αi are all equal to zero, then SStreatment /σ 2 follows a chi-square distribution
with a − 1 degrees of freedom and is independent of SSerror .
ni
1 X
(Yij − Ȳi. )2
σ2
j=1
follows a chi-square distribution with ni − 1 degrees of freedom. There are a such sums in
SSerror , and they are independent of each other since the observations are independent. The
sum of a independent chi-square random variables that each have n − 1 degrees of freedom
follows a chi-square distribution with N − a degrees of freedom. The same reasoning can be
applied to SStreatment , noting that V ar(Ȳi. ) = σ 2 /ni .
We next prove that the two sums of squares are independent of each other. SSerror is a function
of the vector U, which has elements Yij − Ȳi. , where i = 1, . . . , a and j = 1, . . . , n. SStreatment
is a function of the vector V, whose elements are Ȳi. . Thus, it is sufficient to show that these
two vectors are independent of each other. First, if i 6= i0 , Yij − Ȳi. and Ȳi0 . are independent
since they are functions of different observations. Second, Yij − Ȳi. and Ȳi. are independent (by
another Theorem from STA2004F). This completes the proof of the theorem.
Under H0 : α1 = α2 = · · · = αa = 0
SStreatment
a−1 M Streatment
F = SSerror
=
N −a
M SE
F ∼ Fa−1,N −a
d2
E[F ] =
d2 − 2
39
THEOREM C. Under the assumption that the errors are normally distributed, the null
distribution of F is the F distribution with (a − 1) and (N − a) degrees of freedom.
Proof. The proof follows from the definition of the F distribution, as the ratio of two independent
chi-square random variables divided by their degrees of freedom.
ANOVA table
This is a one-way analysis of variance. The ‘one-way’ refers to there only being one factor in
the model and thus in the ANOVA table. Note that the ANOVA table is still based on the
model 3.1, and will have one SS for each term in the model (except the mean), but see the table
below.
M SA
To test H0 : α1 = α2 = · · · = αa = 0, we use F = M SE ∼ Fa−1,N −a .
Some Notes
There is an extra term due to the mean with 1 degree of freedom. The error and treatment
SS, and the F test, are the same as previously.
In the table above we have split the total variation, SStot = i j Yij2 with N degrees of
P P
freedom into three parts, namely
N = 1 + (a − 1) + (N − a)
40
respectively. Each SS can be identified with a term in the model
Yij = µ + αi + eij
for
i = 1, . . . , a
j = 1, . . . , ni
2. We have closed form expressions for each of the Sums of Squares. This is in contrast to
multiple regression, where usually explicit expressions cannot be given for the individual
regression sum of squares. Furthermore, subject to the constraints, we have closed form
expressions for the parameter estimates as well.
a
X
(ni − 1)s2i
i=1
X
s2i = (Yij − Y i· )2 /(ni − 1)
For a = 2, this is the estimate of σ 2 we use in the two-sample t-test (assuming equal
variances).
variation in the group means about the overall mean. If the means do not differ, i.e. H0 is
true, then SSA /(a − 1) should also estimate σ 2 . So the test of
H0 : α1 = α2 = · · · = αa = 0
made using
M SA SSA /(a − 1)
= ∼ Fa−1,N −a
M SE SSE(N − a)
is an F-test comparing variances. This is the origin of the term Analysis of Variance.
41
5. The Analysis of Variance for comparing the means of a number of groups is equivalent to
a regression analysis. However, in ANOVA, the emphasis is slightly different. In regression
analysis, we test if an arbitrary subset of the parameters is zero [H0 : β (2) = 0]. In ANOVA
we are interested in testing if a particular subset, namely α1 , α2 , . . . αa , are zero. So the tests
are more structured. For this reason the matrix formulation is not so convenient. We shall
usually describe models by giving an explicit expression for a typical observation, Yij , such
as
Yij = µ + αi + eij
where
i = 1, . . . , a
j = 1, . . . , ni
and a is the number of levels of treatment factor A. The least squares estimates are then
found by minimizing
XX
SSE = (Yij − µ − αi )2 = (Y − Xβ)0 (Y − Xβ)
M SE = NSS−ae
is of vital importance in subsequent analyses, because it is an unbiased esti-
2
mator of σ , provided the model used is the correct one.
1 XX
M SE = σ̂ 2 = (Yij − Ȳi· )2 = s2
N −a
i j
Note that (N − a) = ai=1 (ni − 1). This estimate of σ 2 is used for all comparisons of the
P
treatment/group means.
7. Computing formulae
The formulae derived by theoretical regression methods are not the most convenient for
computing the ANOVA table. To compute the ANOVA table we find the quantities:
(a)
42
(b)
XX X
Total sums of squares = Yij2 and ni Ȳi·2
i j i
Then:
(c) P P 2
Y − N Ȳ··2 =
P P 2
SStotal = Y −C
Pi j 2ij 2
Pi j 2ij
SSA = n Ȳ
i i i· − N Ȳ ·· = i i i· − C
n Ȳ
SSe = SStotal − SSA
Always use plenty of figures when computing. If the data are given to 3 figures, compute
with 6 figures and round up at the end. (In general, if the data are given in n figures
use 2n figures for the computation.)
Speed-reading Example
METHOD
I II III
82 71 91
80 79 93
81 78 84
83 74 90
88
Mean 81.5 75.5 89.2
Std Deviation 1.3 3.7 3.4
ni 4 4 5
Sums of Squares
N Ȳ··2 = 88729
Y2
P
= 89246
P ij 2
ni Ȳi· = 89153
SStotal = 89246 − 88729 = 517 with (N − 1 = 12) df
SSA = 89153 − 88729 = 424 with (α − 1 = 2) df
SSerror = SStotal − SSA = 517 − 524 = 93 with (N − a = 10) df
ANOVA Table
Source SS df Mean Square F stat p-value
Teaching methods 424 2 212 22.8 0.0001862
Error 93 10 9.3
Total 517 12
From this table we would conclude that: There is strong evidence (p < 0.001) that reading
speed differs between teaching methods (F = 22.8 ∼ F2,10 ).
Now that we have found evidence that the teaching methods differ, we can finally answer the
really important question: Which is the best method? How much better is it? For this we need
43
to compare the three methods (means) amongst each other. We do this in Chapter 4. We could
also skip the ANOVA table and jump to the real questions of interest immediately. But very
often, an ANOVA is performed to obtain a summary of which factors are responsible for most
of the variation in the data.
Randomization tests can be used for data from properly randomized experiments. In fact, the
ONLY assumption required for randomization tests to be valid and give exact p-values is that
treatments were randomly assigned to experimental units according to the rules of the partic-
ular experimental design. No assumptions about normality, equal variance, random samples,
independence are needed. Therefore, some call randomisation tests the ultimate nonparametric
tests (Edgington, 2007). Sir R. A. Fisher (1936) used this fact as one of his strong arguments for
the requirement of randomisation in experiments and said about the F and t tests mostly used
for the analysis of experiments: “conclusions have no justification beyond the fact that they
agree with those which could have been arrived at by this elementary method”, the elementary
method being the randomization tests.
Example: Suppose there are 6 subjects, 2 treatments, 3 subjects randomly assigned to each
treatment. The response measured is the reaction time (in seconds).
The null hypothesis will state that the mean (or median) reaction time is the same for each
treatment. The alternative hypothesis states that at least one subject would have provided a
different reaction time under a different treatment.
possible randomisations were equally likely. Under H0 the treatment has no effect, i.e. the ob-
served values are differences between the subjects not due to treatments, and thus the observed
values would stay the same with a different randomisation.
We can now construct a reference distribution (under H0 ) from the 20 test statistics obtained for
the different possible randomisations. The observed test statistic is compared to this reference
distribution and the p-value calculated as the proportion of values ≥ observed test statistic (or
≤, depending on the alternative hypothesis).
Questions
1. What will the test statistic in the above example be? Are there different possible choices of
test statistic? How will the choice of test statistic influence the calculation of the reference
distribution?
44
Sample
Eleven plants in a single row, randomly assigned so that 5 were given standard fertilizer
mixture A and 6 were fed a supposedly improved mixture B.
Method of randomisation
The gardener took 11 playing cards, 5 red and 6 black, thoroughly shuffled these and then
dealt them to result in a given sequence of red (A) and black (B) cards.
P nA = 5 P nB = 6
yA = 104.2 yB = 135.2
ȳA = 20.84 ȳB = 22.53
difference in means (modified minus standard) ȳB − ȳA = 1.69
H0 : modifying the fertilizer mixture has no effect on the results and therefore, in partic-
ular, no effect on the mean. H0 : µB − µA = 0
11!
= 462 possible ways of allocating 5 A’s and 6 B’s to the 11 plants
5!6!
The given experimental arrangement is just one of the 462, any one of which could equally
well have been chosen. To calculate the randomisation distribution appropriate to the H0 that
modification is without effect (i.e. that µA = µB ), we need to calculate all 462 differences in the
averages obtained from the 462 possible arrangements.
The table above shows one such arrangement with its corresponding difference in means = 1.69.
Another arrangement could have been:
Position in row 1 2 3 4 5 6 7 8 9 10 11
Fertilizer A B B A A B A B B A B
Pounds of Tomatoes 29.9 11.4 26.6 23.7 25.3 28.5 14.2 17.9 16.5 21.1 24.3
P nA = 5 P nB = 6
yA = 114.2 yB = 125.2
ȳA = 22.84 ȳB = 20.87
ȳB − ȳA = −1.97
45
0.10
0.08
density
0.06
0.04
0.02
0.00 x
−10 −5 0 5 10
difference in means
Figure 3.1: Randomisation distribution for tomato plant data. The red cross indicates the
observed difference ȳA − ȳB .
There are 460 more such arrangements with resulting differences in means. These 462 differences
are summarised by the histogram below.
The observed difference of 1.69 is indicated with a cross. We find that in this example, 154 of
the possible 462 arrangements yield differences greater than or equal to 1.69: p = 154
462 = 0.33.
Questions
1. What can be concluded about the fertilizers with the above results?
2. Compare the above p-value to that obtained when using a two-sample t-test.
In general, likelihood ratio tests compare two nested models, by comparing their likelihoods
(in the form of a ratio). The likelihood ratio compares the relative support for the two models
based on the information in the data.
For the hypothesis test H0 : α1 = . . . = αa = 0 we will compare the following two models: let
model Ω assume that there are differences between the treatments, and let model ω assume
that the treatments have no effect, i.e. a model that corresponds to H0 being true.
46
(a) Model Ω is
Y = µ + αi + eij
eij ∼ N (0, σ 2 )
Or equivalently
Yij ∼ N (µ + αi , σ 2 )
Y = µ + eij
eij ∼ N (0, σ 2 )
Or equivalently
Yij ∼ N (µ, σ 2 )
L(ω̂)
λ= (3.2)
L(Ω̂)
where L(ω̂) is the maximized likelihood if H0 is true and L(Ω̂) is the maximum value of the
likelihood when the parameters are unrestricted. Since the observations are independent, under
the Ω assumptions we can multiply the probabilities of the observations:
Y 1 n −1 o
L(µ, αi , σ 2 ) = √ exp (Y ij − µ − αi ) 2
(3.3)
2πσ 2 2σ 2
ij
The log-likelihood is
a ni
2 −N N 2 1 XX
l(µ, αi , σ ) = log2π − logσ − 2 (Yij − µ − αi )2 (3.4)
2 2 2σ
i j
where N = n1 ni . For fixed σ 2 this is maximised when the last term is a minimum. But this
P
term is exactly the sum of squares that was minimized when finding the least squares estimate!
So the least squares estimates are the same as the maximum likelihood estimates. (Note that
this is only true for normal models, normal errors).
Let
XX
RSS(Ω) = (Yij − µ̂ − α̂i )2 (3.5)
i j
47
then the maximized loglikelihood for fixed σ 2 is
RSS(Ω̂)
`(Ω̂) = c − (3.6)
2σ 2
where
N
c=− log(2πσ 2 )
2
Y 1 −1 X X
2
L(ω) = √ exp (Yij − µ)
2πσ 2 2σ 2
ij i j
−N N 1 XX
l(ω) = log(2π) − log(σ 2 ) − 2 (Yij − µ)2 (3.7)
2 2 2σ
i j
XX
RSS(ω) = (Yij − µ)2 (3.8)
i j
P P
is a minimum, and this occurs when µ is the least squares estimate and so RSS(ω̂) = i j (Yij −
µ̂)2 , where µ̂ = Y ·· .
RSS(ω̂)
`(ω̂) = c − (3.9)
2σ 2
We now take minus twice the difference of the log-likelihoods (corresponding to minus twice the
log of the likelihood ratio)
" #
L(ω̂)
λ= (3.10)
L(Ω̂)
RSS(ω̂) − RSS(Ω̂)
− 2logλ = (3.11)
σ2
48
RSS(Ω̂)
σ̂ 2 =
N −a
σ̂ 2 (N − a)
∼ χ2N −a
σ2
Under the assumption of normality the likelihood ratio statistic −2logλ has an exact chi-square
distribution when σ 2 is known. When σ 2 is estimated we use
RSS(ω̂)−RSS(Ω̂)
a−1
F = ∼ Fa−1,N −a
RSS(Ω̂)
N −a
Q: Verify that this is equivalent to the F test found in the sums of squares derivation above.
Some Notes:
1. An important result emerges from this derivation. We have shown that least squares is a
method (or algorithm) for obtaining the maximum likelihood estimates of the parameters
when assuming a normal error distribution.
2. When the data are not normally distributed, least squares still provides estimates that min-
imize the squared deviations of the observed values from the estimated. They provide a best
fit.
3. The least squares estimates (= the maximum likelihood estimates for normal models) are
µ̂ = Y ·· and α̂i = Y i· − Y ·· .
This is a nonparametric test to compare more than two independent populations. It is the
nonparametric version of the one-way ANOVA F-test which relies on normality of populations.
The assumptions are that we have k independent random samples of sizes n1 , n2 , . . . , nk (k ≥ 3),
independent observations within samples, populations are identical except possibly with respect
to location, and the data must be at least ordinal.
Hypotheses:
49
To calculate the test statistic we rank all observations from 1 to N (N = ki=1 ni ) (for ties
P
assign the mean value of ranks to each observation). The test statistic is based on comparing
each group’s mean
P rank with mean of all ranks (weighted by sample size). For each group i
calculate Ri = ranks in group i
k
Ri N + 1 2
12 X
H = −
N (N + 1) ni 2
12 X R2
i
= − 3(N + 1)
N (N + 1) ni
For large sample sizes the distribution of the Kruskal-Wallis test statistic can be approximated
by the χ2 -distribution with k − 1 degrees of freedom:
H ≈ χ2k−1
When sampling from a normal distribution, the power for the Kruskal-Wallis test is almost
equal to that of a classical F-test. When outliers are present, the Kruskal-Wallis test is much
more reliable than the F-test.
Example
As well as Statistics and General books I have a number of travel books. I count the number
of pages in a random sample of 8. Examine the validity of the hypothesis that these may be
samples from the same population.
nT = 8 nG = 16 nS = 12 N = 36
RT = 169 RG = 227 RS = 270
50
1692 2272 2702
12
H= + + − 3 × 37 = 4.91
36 × 37 8 16 12
51
Chapter 4
4.1 Introduction
In most experiments we want to compare a number of treatments. For example, a new method
of communicating with employees is compared to the current system. We want to know 1)
whether there is an improvement (or change) in employee happiness, and, more importantly,
2) how large this change is. For the latter we need estimates and standard errors or confidence
intervals.
The analysis of variance table tells us how much evidence there is that the means differ. In this
chapter we consider what the next steps in the analysis are. If there is no evidence for differences
between the means, technically speaking the analysis is complete at this stage. However, one
should remember that there are two possible situations/reasons that could both lead to this
outcome of no evidence that the means differ.
1. There is truly no difference between the means (or it is so small that it is not interesting).
2. There is a difference, but the F -test did not have enough power.
Technically, the power of a test is the probability of rejecting H0 if false. Reasons for lack of
power are:
Both of these are design issues. Therefore, it is crucial to think about power at the design
stage of the experiment (see Chapter 6).
Suppose however, that we have found enough evidence to warrant further investigation into
which means differ, which don’t and by how much they differ. To do this we contrast one
group of means with another, i.e. we compare groups of means to find out where the differences
between treatments are. We can do this in two ways, either by constructing a test of the form
52
H0 : µA − µB = 0
est(diff) ± tν × SE(diff)
As always, a confidence interval is much more informative than the result from a hypothesis
test.
4.2 Contrasts
P
Consider the model Yij = µ + αi + eij with constraint αi = 0. We will be assuming that all ni
are equal. (It is possible to construct contrasts with unequal ni , but this gets very complicated.
This is one reason to design a CRD experiment with an equal number of experimental units per
treatment.)
Pa
L̂ = 1 hi α̂i
Pa
= 1 hi (Ȳi· − Ȳ·· )
Pa P P
= 1 hi Ȳi· since hi Ȳ·· = Ȳ·· hi = 0
Pa Pa h2i
V ar(L̂) = 1 h2i V ar(Ȳi· ) = σ 2 1 ni
P h2i
V ˆar(L̂) = s2 ni
where s2 is the mean square for error with ν degrees of freedom, e.g. ν = N − a in a CRD.
Examples
1. α1 − α2 is a contrast
h withi h1 = 1, h2 = −1, h3 = 0 = · · · = ha = 0. Its estimate is Ȳ1· − Ȳ2·
with variance s n1 + n12 .
2 1
53
3. I might want to compare average salaries in small companies to those in medium and large
companies, maybe to answer the question whether one earns less in small companies. To do
this I would construct a contrast: µsmall − µmed or large = µsmall − 21 (µmed + µlarge ) (assuming
that the number of companies in each group was the same), i.e. I am comparing/contrasting
two average salaries. The coefficients hi sum to zero: 1 − 12 − 12 .
Sometimes contrasts are defined in terms of treatment totals, e.g. Yi· , instead of treatment
means. We will mainly compare/contrast treatment means.
Comments on Contrasts
1. Although we have defined contrasts in terms of the α’s, contrasts estimate differences
between means, so they are very often simply called contrasts of means. Essentially they
estimate the differences between groups of means.
2. In designed experiments the means are usually based on the same number of observations,
n. However contrasts can be defined if there are unequal numbers of observations.
P P P
3. Two contrasts L1 = hi αi and L2 = gi αi are said to be orthogonal if hi gi =
0. Orthogonality implies that they are independent, i.e. that they summarize different
dimensions of the data.
4. The estimate for σ 2 is given by the Mean Square for Error, s2 = M SE. Its degrees of
freedom depend on the design of the experiment, and the number of replicates.
6. The standard error (SE) of a contrast is a measure of its precision or uncertainty, how
well were we able to estimate the difference in means? It is the standard deviation of the
sampling distribution of the contrast.
7. An important contrast is that of the difference between two means Ȳ1· − Ȳ2· . Its standard
error is called the standard error of the difference s.e.d.
r r
1 1 2
s.e.d = s2 ( + ) = s
n n n
Example
Suppose an experiment is conducted to determine the wearing quality of paint. The paint was
tested under four conditions:
54
1. Hard wood dry climate µ1 .
Q: What can you say about the treatment structure? How many factors are involved?
3. Does the difference between wet and dry climates depend on whether or not the wood was
hard or soft?
These questions can be formulated before we see the results of the experiment. To answer
them we would want to test the following hypothesis:
1. H0 : 21 (µ1 + µ2 ) = 12 (µ3 + µ4 )
or equivalently H0 : 21 µ1 + 21 µ2 − 12 µ3 − 12 µ4 = 0
2. H0 : 12 (µ1 + µ3 ) = 12 (µ2 + µ4 ) ≡ H0 : 21 µ1 + 12 µ3 − 12 µ2 − 12 µ4 = 0
3. H0 : 12 (µ1 − µ2 ) = 12 (µ3 − µ4 ) ≡ H0 : 21 µ1 − 12 µ2 − 12 µ3 + 12 µ4 = 0
This last contrast is testing for an interaction between type of wood and climate, i.e. does
the effect of climate depend on type of wood (see Chapter 7).
Clearing these contrasts of fractions we can write the coefficients hki in a table, where hki is the
ith coefficient of the k th contrast.
h1 h2 h3 h4
1. Hard vs soft wood 1 1 -1 -1
2. Dry vs wet climate 1 -1 1 -1
3. Climate effect depends on wood type 1 -1 -1 1
Although it is easier to manipulate contrasts that don’t contain fractions, and hypothesis tests
will lead to the same results with or without fractions, confidence intervals will differ. Keeping
the 12 s will lead to confidence intervals for the difference in means. Without the fractions, we
obtain a confidence interval for 2× the difference in means. As we do want to understand
what these values tell us, the first (with the 21 s) is much more useful.
P4 P4 P4
Note
P4 that 1 h1k = 1 h2k = 1 h3k = 0 by definition of our contrast. But also
1 h1k h2k = 0, i.e. the contrasts are orthogonal. This means that their estimate will be
55
statistically independent under normal theory (or uncorrelated if non-normal). You can verify
that contrasts 2 and 3 and 1 and 3 are also orthogonal.
From the four means we have found three mutually orthogonal (independent) contrasts. In
general, given p means we can find (p − 1) orthogonal contrasts. There are many sets of
(p − 1) orthogonal contrasts - we can select a set that is convenient to us.
If it so happens that the questions of interest result in a set of orthogonal contrasts, this will
simplify the interpretation of results. However, it is more important to ask the right questions
than to obtain a set of orthogonal contrasts.
In some cases it is convenient to test contrasts in the context of analysis of variance. The
treatment sums of squares, SSA say, have (a − 1) degrees of freedom, when a treatments are
compared. This sum of squares can be split into (a − 1) mutually orthogonal (independent)
sums of squares, each with 1 degree of freedom, each corresponding to a contrast, so that
We can test for these (a − 1) orthogonal contrasts simultaneously within the ANOVA table.
For convenience we assume the treatments are equally replicated (i.e. the same number of
observations on each treatment). Let
a = number of treatments,
n = number of replicates per treatment,
Ȳi. = mean
P for
Ptreatment i
SSA = ai=1 n (Ȳi. − Ȳ.. )2 , the treatment SS with (a − 1) df.
Then:
Pa
1. L = h1 Ȳ1. + h2 Ȳ2. + . . . ha Ȳa. is a contrast if 1 hi = 0.
2 P
2. Var(L) = sn i h2i where s2 = MSE.
P
3. L1 and L2 are orthogonal if h1i h2i = 0.
4. Sum of squares for L is
nL2
SSL = P 2
hi
and has one degree of freedom.
nL2
5. If L1 and L2 are orthogonal then SS2 = P 22
h2i
is a component of SSA − SS1
56
SSi
7. The hypothesis H0 : Li = 0 versus H1 : Li 6= 0 can be tested using F = M SE with 1 and ν
degrees of freedom where ν = degrees of freedom of MSE.
8. Orthogonal contrasts can be defined if there are unequal numbers of replications in each
group, but the simplicity of the interpretation breaks down. With n1 , n2 . . . , na
observations in each group
is a contrast iff
n1 h1 + n2 h2 + . . . + na ha = 0
An equal number of replicates for each treatment ensures that we have meaningful sets of
orthogonal contrasts, each of which will explain some aspect of the experiment
independently of any others. This gives a very clear interpretation of the results. If we have
unequal numbers of replications of the treatments the different aspects cannot be
completely separated.
9. The word “orthogonal” is used in the same sense as in mechanics where two orthogonal
forces ↑→ act independently of each other. In an a dimensional space these can be seen as
(a − 1) perpendicular vectors.
Example
To compare the durability of different methods of finishing a piece of mass produced furniture,
the production manager set up an experiment. There were two types of paint available (called
A and B), and two methods of applying it: by brush or spray. Six pieces of furniture were
randomly assigned to each treatment and the durability of each was measured.
3. How methods compare within the two paints, i.e. is the difference between brush and
spray the same for both paints?
57
The treatment means were:
Treatment
1 2 3 4
100 120 40 70
Source SS df MS F
Treatments 22050 3 7350 50.69
Error 2900 20 145
The treatment sum of squares can be split into three mutually orthogonal contrasts, as shown
in the table:
4
X nL2 X
Li = hi Ȳi. SSi = P i2 n=6 h2ij = 4
1
hij
1
Mean Durability for Paint A = 2 (100 + 40) = 70.00
1
Mean Durability for Paint B = 2 (120 + 70) = 95.00
A 95% confidence interval for the difference in durability between paint B and A:
r
2 × 145
25 ± t20 = [14.7; 35.3]
12
So, paint B is estimated to last, on average, between 14.7 and 35.3 months longer than paint
A. Note the 12 in the denominator when calculating the standard error: the mean for paint A
is based on 12 observations, so is the mean for paint B.
58
1
Mean Durability using Brush = 2 (100 + 120) = 110.00
1
Mean Durability using Spray = 2 (40 + 70) = 55.00
Exercise: Construct a confidence intervals for the brush and interaction contrasts.
Overall the above information suggests that the brush method gives a more durable surface,
and that the best combination for durability is paint B applied with a brush. The brush
method is preferable to the spray method irrespective of which paint is used.
We have so far avoided the Neyman-Pearson paradigm for statistical hypothesis testing.
However, for discussing the problem of multiple comparisons, it can be useful to temporarily
revert to thinking in terms of making a decision based on some predetermined cut-off level, i.e.
reject, or fail to reject H0 .
We know that when we make a statistical test, we have a small probability α of rejecting the
null hypothesis when true (α = Type I error). In the completely randomised design, the
means of the groups fall naturally into a family, and statements we make will be made in
relation to the family, or experiment as a whole, i.e. we cannot ignore what other tests have
been performed when interpreting the outcome of any single test. We would like to be able to
control the overall Type I error, also called the experiment-wise Type I error rate, i.e. the
probability of rejecting at least one hypothesis that is true. Controlling the Type II error
(accepting at least one false hypothesis) is more difficult, as for this we would need to know
the true differences between the means.
What is the overall Type I error? We cannot say exactly but we can place an upper bound on
it.
Consider a family of k tests. Let Ei be the event {the ith hypothesis is rejected when true},
i = 1, . . . , k, i.e. the event that we make a type I error in test i. Then for test i, P(Ei ) = αi ,
the significance level of the ith test.
and
Extending the result P (E1 ∪ E2 ) = P (E1 ) + P (E2 ) − P (E1 ∩ E2 ) to k events, we have that
k
X X X
P (∪k1 Ei ) = P (Ei ) − P (Ei ∩ Ej ) + P (Ei ∩ Ej ∩ Ek ) . . . (−1)k P (E1 ∩ . . . ∩ Ek )
1 i<j i<j<k
59
An upper bound for this probability can be obtained:
k
X
P (∪Ei ) ≤ P (Ei ) = kα if all P (Ei ) = α (4.1)
1
This is called the Bonferroni inequality. The Bonferroni inequality implies that when
conducting k tests, the overall probability of a Type I error can be as bad as k × α. For
example, when conducting 10 tests, each with a 5% significance level, the probability of one or
more Type I errors (wrong decisions) is 50%.
P (∪k1 Ei ) = 1 − P (∪k1 Ei )
= 1 − P (∩k1 E i )
Qk
= 1− 1 P (E i ) Ei ’s are independent
Qk
= 1− 1 (1 − αi )
Suppose we want to compare lots of treatments, either in the form of null hypothesis tests or
confidence intervals. Each of these tests, if the null hypothesis is true, has a small chance of
resulting in a type I error (because of the random nature of data). If I conduct many tests, a
small proportion of the results will be Type I errors. Type I errors lead to false claims and
recommendations, in health, science and business, and therefore should be avoided.
We have a dilemma. If we control (using Bonferroni, Tukey, Scheffé or any other of the many
methods available), Type II error rate increases = power decreases, meaning we could be
1 20
If I conduct all pairwise comparisons, there are 2
= 190 tests. More, if I add other contrasts.
60
missing some interesting differences. If we do NOT control, we could end up with a large Type
I error rate (detect differences that are not real).
The following summarizes my personal opinion on how one can approach the above dilemma.
In a confirmatory study you will know exactly what you are looking for, will have a-priori
hypotheses that you want to test, you are collecting more evidence towards a particular goal,
e.g. do violent video games result in more aggressive behaviour? Confirmatory studies will
mostly have only a few a-priori hypotheses, and therefore not such a large experiment-wise
Type I error rate. Also, often, the null hypothesis might not be true (because we are looking
to confirm a difference/effect). If the null hypothesis is false, it is not possible to make a Type
I error, and any method to control for Type I errors would just increase the Type II error rate.
On the other hand, in exploratory studies we have no clear expectations, we are just looking
for patterns, relationships. Here we don’t need to make a decision (reject or not a null
hypothesis). But these are often the studies that generate a very large number of tests, and
hence a potentially very large number of Type I errors. If we control Type I errors, we might
be missing some of the interesting patterns. A solution here is to treat the results with
suspicion/caution: be aware that many of the small p-values may not be repeatable, and need
to be confirmed before you can become more confident that an effect/difference is present. Use
these studies as hypothesis generating, and follow up with confirmatory studies. Use p-values
to measure the strength of evidence, avoid the Neyman-Pearson approach.
Declare whether your study is exploratory or confirmatory. This makes it clear to the reader
how to interpret your results.
A big problem is of course if one wants to show that something DOES NOT HAVE AN
EFFECT, e.g. that violent video games do not increase aggression. Remember that a large
p-value would not mean that the null hypothesis is true or even likely. It could mean that
your power is too low to detect the effect. In this case it is important to make sure that power
is sufficient to detect an effect, AND remember that large p-values DO NOT mean that there
is NO effect. Here we really want to compare evidence for H0 vs evidence for H1. This is not
possible with p-values or the classical statistical hypothesis tests. One solution to this is to
calculate a likelihood ratio (related to Bayes factors): likelihood under null hypothesis, vs
likelihood under the alternative hypothesis, as a measure of how much support the null vs the
alternative hypothesis has. This of course would also need to be repeated, as the likelihood
ratio is just as prone to spurious outcomes as any other statistic.
In summary, a good way to find out if something was a type I error or not (whether an effect
is real or not), is to follow up with other studies. Secondly, it is important to admit how many
tests were conducted, and which of these were post-hoc, so that it is clear what the chances of
type I errors are. Also consider how plausible the null hypothesis is in the first place, based on
other information and context. If the null hypothesis is likely (e.g. you would never expect this
particular gene to be linked to a particular disease, e.g. cancer), but your statistical analyses
have just indicated a small p-value, you should be very suspicious, i.e. suspect a type I error.
61
Also, Type I errors depend on making a decision. If we don’t use the Neyman-Pearson
approach (i.e. don’t make a decision), but just report p-values (collecting evidence), we just
need to be aware that small p-values can occur even if the null hypothesis is true (e.g.
P (p < 0.05|H0 ) = 0.05).
Having said all of the above and having recommended not to correct, you should still know
that in many fields controlling the Type I error rate is expected. Bonferroni’s, Tukey’s and
Scheffś methods are the most commonly used methods to control the Type I error rate,
although there are many, many more.
For Bonferroni’s method we need to know exactly how many tests were performed. Therefore,
this method is mainly used for a-priori hypotheses. Also, it is very conservative (see
Bonferroni’s inequality, it corrects for the worst-case scenario). If the number of tests exceeds
approximately 10, the correction is so severe, that only the very large effects will still be
picked up.
Tukey’s method is used to control the experiment-wise Type I error rate when all pairwise
comparisons between the treatments are performed.
Scheffés method is used when lots of contrasts, not all of which are pairwise, are performed.
Many special methods have been devised to control Type I error rates, also sometimes referred
to as false positives. We discuss some methods for Type I error rate adjustment that are
commonly used, but many others are available for special problems, such as picking the largest
mean, or comparing all treatments to a control (see the list of references at the end of this
chapter).
Scheffé’s method
Bonferroni Correction
Bonferroni’s method of Type I error rate adjustment can be quite conservative (heavy).
Therefore, it is only used in cases for which there is a small number of planned comparisons.
The correction is based on the Bonferroni inequality.
62
α th
2. Adjust the percentile of the t-distribution. Use the ( 2m ) percentile instead of the ( α2 )th
for each statement (Appendix Table 7). In other words, the significance level used for
each individual test/contrast/confidence interval is αC = αmE , where αE is the
experiment-wise type I error rate, or the maximum Type I error rate we are willing to
allow over all tests. Often, αE = 0.05.
The Bonferroni inequality ensures that the probability that all m intervals cover the true
parameters is at least (1 - α), i.e. the probability of no Type I error.
Confidence intervals:
P
Given m statements of the form hi αi , the confidence intervals have the form
a
" a
#1
2
X ( α ) X h2
hi Ȳi· ± tν2m s2 i
ni
1 1
α
α th
Where t 2m is the ( 2m ) percentile of tν , ν = degrees of freedom for s2 (MSE, the mean
square for error).
For example, if we have decided on five comparisons (a-priori), we would make each
comparison at the (two-sided) 1 % level. Then the probability that all five confidence intervals
cover the true parameter values is at least 95%.
Hypothesis tests:
P
H0 : P hi αi = 0
HA : hi αi 6= 0
1. Reject H0 if
P
hi Ȳi· α
h P i 1 > tν2m
h2i 2
s
ni
2. or, equivalently, calculate each p-value as usual, but then reject only if p < αE /m, where
αE is the experiment-wise type I error we are willing to allow.
Tukey’s Method
63
Studentised range is the distribution of q = Rs . The parameters of the Studentised range are a
(the number of values (xi )) and ν (the degrees of freedom of s2 ). The upper α% point of q is
α , i.e. P (q ≥ q α ) = α (see Tables 2 and 3).
denoted by qk,ν k,ν
Tukey’s method is used when we want to make all pairwise comparisons between a treatment
means (µi − µj ). It corrects confidence intervals and hypothesis tests for all these possible
pairwise comparisons.
Let’s say we are comparing a total of a treatment means. We assume that we have the same
number of observations per treatment/group, n, say. The appropriate standard error will be
s2
n , the SE for a treatment mean. Under the null hypothesis of no differences between means
!
R α
P p ≥ qa,ν =α
s2 /n
and, by implication, the probability that any difference, under H0 , exceeds this threshold is at
most α, i.e. at most α of any pairwise differences will exceed the α threshold of the
studentised range distribution.
P
Then, to construct confidence intervals, P
let hi αi be a contrast of the form L = Ȳi. − Ȳj.
(with the sum of positive hi ’s = 1 (and hi = 0)). Then a confidence interval adjusted for all
possible pairwise comparisons is
α s
L̂ ± qa,ν √
n
The overall experiment-wise type I error will be α (see Tables 2 and 3 in Appendix). Here s2
= MSE with ν degrees of freedom and qa,να is the αth percentile of the Studentised range
distribution.
!
L̂
P qa,ν > √
s/ n
1. Using Tukey’s method we can construct as many intervals as we please, either before or
after looking at the data. The method allows for all possible contrasts to be examined.
α s/√n.
2. All intervals have the same length, which is 2qa,ν
3. Tukey’s method gives shorter intervals for pairwise comparisons (compared to Scheffé’s
method), i.e. it has more power, and is thus used almost exclusively for pairwise
comparisons.
5. Need equal numbers per group. For unequal numbers see Spötvoll et al. (1973).
64
P
6. Under the Neyman-Pearson approach, the hypothesis H0 : hi αi = 0 is rejected if
P
h Ȳ
i i· α
> qa,ν
√sn
α s
HSD = qa,ν √ .
n
Then any two means that differ by more than this HSD, are significantly different.
Scheffé’s Method
a
X
α 1 X h2 1
hi Ȳi· ± ((a − 1)Fa−1,ν ) 2 (s2 i
)2
ni
1
Note that the last part is just the standard error of the contrast. Also note how the factor
(a − 1) with which the F-quantile is multiplied will stretch the reference distribution to the
right, making the observed values less extreme.
!
L̂
P S>
SE(L̂)
1. Scheffé’s method is better than Tukey’s method for general contrasts, but intervals are
longer for pairwise comparisons.
2. Intervals are longer than Bonferroni intervals, but we do not have to specify them in
advance.
3. Scheffé’s method covers all possible contrasts, but this makes the intervals longer because
protection is given for many cases of no practical interest.
5. Can be used in multiple regression as well as ANOVA, in fact anywhere where the F -test is
used.
6. When the hypothesis of equal treatment means was rejected in the ANOVA, there will be
at least one significant difference among all possible contrasts. No other method has this
property.
65
P
7. Under the Neyman-Pearson paradigm, to test H0 : hi αi = 0 reject if
P
hi Ȳi· 1
1 > ((a − 1)Fa−1,ν )
2
h
P h2i 2
i
s
ni
In this experiment the strengths of welds produced by four different welding techniques (A, B,
C, D) were compared. Each welding technique was used to weld five pairs of metal plates in a
completely randomized design. The average strengths were:
Technique: A B C D
Mean: 69 83 75 71
The estimate of experimental error variance for the experiment was M SE = 15 with 16
degrees of freedom.
We are going to use all three methods to control the Type I error rate on this example,
although in practice one would probably only use 1, depending on the type of contrasts.
Bonferroni
Suppose we had planned 3 a-priori contrasts: compare every technique to C, maybe because C
is the cheapest welding technique, and we will only start using another technique if it yields
considerably stronger welds.
These are pairwise comparisons, but we can still use Bonferroni’s method. We are going to
assume that the number of replicates for each treatment was the same, i.e. 5 (check this).
We could approach this by constructing a confidence interval for each contrast/difference. For
example, a 95% confidence interval for the difference between techniques A and C:
1−α/(2×3)
p √
69 − 75 ± t16 × 2 × 15/5 = −6 ± 2.67 × 6 = [−12.54, 0.54]
Check that you agree with the standard error, and that you can find the critical value in table
A.8 (appendix to notes).
The above confidence interval tells us that we estimate the true difference in average weld
strength to lie between -12.54 and 0.54 units. Most of the interval is on the negative side,
indicating that there is some evidence that A produces weaker welds, but there is a lot of
uncertainty, and we cannot exclude the possibility that there is not actually a difference.
Confidence intervals provide much more information than hypothesis tests, and you should
always prefer presenting information as a confidence interval rather than as a hypothesis test
66
if possible. But, as an exercise, let’s do the hypothesis test also, first using the
Neyman-Pearson approach (bad) and then Fisher’s approach (p-value, much better).
H0 : µA = µC
H1 : µA 6= µC although, in this case we might want to test H1 : µA > µC (we would get the
critical value from the t-tables: t0.05/3 = t0.017 = 2.318 (qt(0.017, 16, lower.tail = F))).
−6
tobs = p = −2.449
2 × 15/5
For the two-sided test we reject if |tobs | > t0.05/(2×3) = 2.67 (Table A.8).
Here, we cannot reject the null hypothesis, i.e. there is no evidence that techniques A and C
differ in terms of average weld strengths. But note that this does NOT mean that they are the
same in strength. We just don’t know, and might need an experiment with greater power to
find out.
To find the p-value we would use R: 2 * pt(-2.449, 16) = 2 × 0.013 = 0.026. This is the
uncorrected p-value. To correct we multiply by 3 (Bonferroni correction): p = 0.078. Now
that we have the exact (adjusted) p-value, it seems that there is possibly a difference in weld
strength between techniques A and C (little, but not no evidence). This example shows what
happens when adjusting p-values (controlling the Type I error rate). What you can conclude
about techniques A and C really depends on whether this was an exploratory or confirmatory
study, and what you already know about these two techniques. In any case, the p-value gives
you a much better understanding of the situation than the Neyman-Pearson approach above.
Bonferroni’s correction ensures that the overall probability of a Type I error, over the 3 test
conducted, is at most 0.05, but note that the probability of a Type I error for each individual
test (or confidence interval) is much smaller, namely 0.05/3 = 0.017.
Tukey
If we had wanted to make all pairwise comparisons between the treatment means, and wanted
to control the experiment-wise Type I error rate, we would use Tukey’s method. Note that we
could also make all pairwise comparisons without controlling Type I error, and therefore not
use Tukey’s method, i.e. Tukey’s method just refers to a way of controlling Type I error rates
for the particular case of pairwise comparisons.
Tukey’s method is based on the distribution of the maximum difference under the null
hypothesis that all means come from the SAME normal distribution, i.e. that there are no
differences between treatments. It uses the studentized range distribution, which gives the
density of the maximum difference (studentized range). In other words, it defines how often
the maximum difference (under H0) will exceed a certain threshold. For example, if the
maximum difference exceeds a value c only with a 5% probability, that tells us that 95% of the
time the maximum observed difference (in a sample of size a, under H0) should be less than c,
and hence the probability that ANY difference should be less than c is 0.95, and, voila, we
have fixed the experiment-wise type I error at a maximum of 5%.
67
The weld example had 4 treatments. This makes 42 = 6 possible different pairwise
comparisons. There is a trick to do this, which unfortunately only works for the
Neyman-Pearson approach: Sort the means from smallest to largest; calculate the HSD
(honestly significant difference); any difference between any two means which exceeds the HSD
is declared ‘significant’.
Technique: A D C B
Mean: 69 71 75 83
0.05
p √
HSD = q4,16 × 15/5 = 4.05 × 3 = 7.01
p
Note: The 15/5 part here is NOT the standard error of a difference, but refers to the
standard deviation of a mean (in the studentized range distribution the standard deviation of
the values being compared). The critical value can be found in Table A.3 (see the notes below
Table A.3 to help you understand what the rows and columns refer to).
We now go back to our sorted means, and draw a line under means which are NOT
significantly different:
Technique: A D C B
Mean: 69 71 75 83
no sign. diff. between A and D, and A and C
B is sign. diff. from C, and hence from all others
This can be interpreted as follows: There is evidence that B is stronger than A, D, and C, but
there is no evidence that A, D and C differ in mean weld strength. It is unlikely (< 0.05) that
even the largest difference in a sample of means would exceed 7.01 under H0, so we now have
fairly strong evidence that B produces stronger welds than the other 3 welding techniques.
If our means had been slightly different (C = 77), this whole procedure would change to:
Technique: A D C B
Mean: 69 71 77 83
This can then be interpreted as follows: There is evidence that B is stronger than A and D,
but not C, C is stronger than A, but not D, and no evidence that A and D differ in strength.
Scheffé
Lastly, we will briefly use Scheffé’s method to construct a confidence interval for the difference
between B and the other 3 techniques. Scheffé’s method for controlling Type I error rates is
used when we have lots of contrasts (explicit or implicit), and they are not only the pairwise
comparisons, but may be more complicated, as the one above. We wouldn’t use Scheffé’s
68
method if the contrast above was the only contrast we were going to test, but usually, once
you have invested time and money into conducting an experiment, you might as well try and
get all the information you can from it (even if you only use the results to generate hypotheses
for a future experiment).
Scheffé’s method is very similar to a t-test, except that the critical value is not from a
t-distribution but rather
q
α
(a − 1)Fa−1,ν
where a is the total number of treatment means in the experiment, and ν is the error degrees
of freedom. The factor (a − 1) has the effect of shifting the critical region to the right, i.e.
reducing type I error of the individual test, but also reducing power of the individual test.
A 95% confidence interval for the difference between B and the average of the other 3
techniques is found as follows:
L̂ ± c × SE(L̂)
√
Let’s first find the standard error = V ar of the contrast:
µ̂A + µ̂D + µ̂C
V ar(L̂) = V ar µ̂B −
3
15 1 15
= + 3×
5 9 5
=4
69 + 71 + 75 q √
(83 − 0.05
) ± (4 − 1)F4−1,16 × 4
3 √
= 11.33 ± 3 × 3.24 × 2
= [5.095; 17.57]
The F -value is from the usual F table (Table A.6). a is the total number of treatment means
involved in the contrasts, and ν is the error degrees of freedom (from the ANOVA table).
The above confidence interval tells us that mean weld strength with technique B is estimated
to be between 5.1 and 17.6 units stronger than the average strength of the other 3 techniques.
From this confidence interval we can learn how big the difference is, and how much
uncertainty there is about this estimate. The whole range of the confidence is positive, which
indicates that B results in stronger welds than the other 3 techniques on average. This
confidence interval gives us the same conclusion as pairwise comparisons did above, it is just
answering a slightly different question. In other words, the contrast should depend
predominantly on the question you want to answer.
69
4.7 Multiple Comparison Procedures: The Practical Solution
The three methods discussed above all control for the significance level, α, and protect against
overall Type I error. Controlling the significance level causes some reduction in power and
increases the Type II error (accepting at least one false hypothesis). Very little work has been
done in multiple comparison methods that protect against a Type II error (accepting at least
one false hypothesis). A practical way to increase power (i.e. lower the probability of making
a Type II error) is by raising the significance level, e.g. use α = 10%.
The paper by Saville (Saville, DJ. 1990. Multiple Comparison Procedures: The Practical
Solution. The American Statistician 44: 174–180. http://www.jstor.org/stable/2684163),
provides a good overview of the issues in multiple and unplanned comparisons, and gives
practical advice on how to proceed. We recommend Saville’s approach for this course and also
for your future work as a statistician.
The problem of multiple testing does not only occur when comparing means from
experimental data. It also occurs in stepwise procedures when fitting regression models, or
whenever comparing a record to many others (e.g. DNA records in criminology), medical
screening for a disease, and many other areas. Behind all of this is uncertainty, randomness in
data. Humans want certainty, but this can never be achieved from a single experiment, or a
single data point or even a large single data set, because of the inherent nature of data; we can
approach certainty only by accumulating more and more evidence.
It would be much better to take a sceptical approach, and to never treat results from a single
experiment or data set as conclusive evidence. Always keep in the back of your mind how
many tests were done and that some of the small p-values will be spurious results that
happened because of some chance outcome in the particular data set. Discuss your results in
this light, including the above warning. Especially in the case where we look for patterns or
interesting differences (unplanned comparisons), i.e. outcomes we didn’t expect, we should
remain sceptical until further experiments confirm the same effect. In other words, interesting
results found in an exploratory study, can at most generate hypotheses that need to be
corroborated by replicating results.
On the other hand, when we have an a-priori hypothesis that we can test with our data, we
take a much greater step in the process towards knowing if an effect is real or not: therefore
the importance of a-priori hypotheses (planned before having seen the data).
4.8 Summary
You now know how to answer specific questions about how treatments compare. And how to
interpret these results considering the problem of multiple comparisons, and the plausibility of
the null hypotheses you are testing. This is practically the most important part of the
experiment: HOW do the treatments differ, and what can you actually learn and say about
70
the treatments (considering all the uncertainty involved in data and statistical tests and
estimates).
The methods in this chapter do not only apply to completely randomized designs, but to all
designs. The aim of most experiments is to compare treatments.
One situation where you would not compare treatments in the way we have done above, is
where the levels of the treatments are levels of a continuous treatment factor, e.g. you have
measured the response at temperature levels 20, 40, 60, 80 degree Celsius. It would not make
sense to do all pairwise comparisons here. Rather you would want to know how the response
changes as a (continuous) function of temperature. This is done using special contrasts called
orthogonal polynomials.
In experiments, the treatments are often levels of a quantitative variable, e.g. temperature, or
amount of fertilizer. In such a case one might be interested in how the response changes with
increasing X, similarly as we do in linear regression. Suppose the levels are equally spaced,
such as 10◦ , 20◦ , 30◦ in the case of temperatures, and that there is an equal number of
replications for each treatment. The mean response Y may be plotted against X, and we may
wish to test if there is a linear or quadratic or cubic relationship between them. These
relationships can be described by polynomials (linear, quadratic, third order, etc.). For the
analysis of experiments this is often done using orthogonal polynomials.
In regression you will have come across using polynomial terms to account for non-linear
relationships. Orthogonal polynomials have the same purpose as polynomial regression terms.
They have the advantage of being, well, orthogonal, which means that the terms are
independent. This avoids problems of collinearity, and allows us to identify exactly which
component(s) of the polynomial are important in describing the relationship.
The hi coefficients to construct these orthogonal polynomials can be found in Table 4.1. They
are used to define linear, quadratic, cubic polynomial contrasts in the treatment means. We
can test for the presence of each of these contrasts using an F-test, similarly as for orthogonal
contrasts. In effect they create orthogonal contrasts which test for the presence of a linear
component, a quadratic component, etc..
If the number of treatments are not equally spaced or the number of observations differs
between treatment levels this method cannot be used. Instead one could use a regression
model to fit the polynomial.
The main objective in using orthogonal polynomials is to find the lowest possible order
polynomial which adequately describes the relationship between the treatment factor and the
response.
Example: If we have 4 treatments we can construct 3 orthogonal polynomials, and can thus
test for the presence of a linear, quadratic and cubic effect. We would construct 3 orthogonal
contrasts as follows:
71
Table 4.1: Orthogonal Polynomial Coefficients
Ordered Treatment Number
No. of Levels Order 1 2 3 4 5 6 7 8 Divisor λ
3 1 -1 0 +1 2 1
2 +1 -2 + 1 6 3
4 1 -3 -1 + 1 + 3 20 2
2 +1 -1 -1 +1 4 1
3 -1 +3 -3 + 1 20 10/3
5 1 -2 -1 0 + 1 +2 10 1
2 +2 -1 -2 -1 +2 14 1
3 -1 +2 0 -2 +1 10 5/6
4 +1 -4 +6 -4 +1 70 35/12
6 1 -5 -3 -1 + 1 +3 +5 70 2
2 +5 -1 -4 -4 -1 +5 84 3/2
3 -5 +7 +4 -4 -7 +5 180 5/3
4 +1 -3 +2 +2 -3 +1 28 7/12
7 1 -3 -2 -1 0 +1 +2 +3 28 2
2 +5 0 -3 -4 -3 0 +5 84 1
3 -1 +1 +1 0 -1 -1 +1 154 1/16
4 +3 -7 +1 +6 +1 -7 +3 154 7/12
8 1 -7 -5 -3 -1 +1 +3 +5 +7 168 2
2 +7 +1 -3 -5 -5 -3 +1 +7 168 1
3 -7 +5 +7 +3 -3 -7 -5 +7 264 2/3
4 +7 -13 -3 +9 +9 -3 -13 +7 616 7/12
72
L1 = −3Ȳ1. − 1Ȳ2. + 1Ȳ3. + 3Ȳ4.
L2 = +1Ȳ1. − 1Ȳ2. − 1Ȳ3. + 1Ȳ4.
L3 = −1Ȳ1. + 3Ȳ2. − 3Ȳ3. + 1Ȳ4.
L1 is used to test for a linear relationship, L2 for a quadratic effect, etc. (Order 1 = linear;
order 2 = quadratic; order 3 = cubic; order 4 = quartic).
λ = factor used to convert coded coefficients to regression coefficients. We will not be using
this, but rather fit a regression model once we have determined what order polynomial to use.
Consider the second differences: ∆2 µi = (µi − µi−1 ) − (µi−1 − µi−2 ) = µi − 2µi−1 + µi−2 .
If the points (1, µ1 ), (2, µ2 ), . . . lie on a straight line ∆2 µi = 0, if the points lie on a
quadratic curve ∆2 µi 6= 0.
If the means are equal all contrasts will equal 0. If the means increase as a linear function,
all contrasts are zero except the linear contrast. If the means are on a quadratic curve all
contrasts except the quadratic contrast are zero.
Testing m orthogonal polynomial terms is equivalent to adding the first m polynomial basis
functions (Figure 4.1) to a regression model, treating these as if they were values of m
explanatory variables, and estimating the regression coefficients.
Suppose we want to test for the presence of a linear effect. Then H0 : Llinear = 0, i.e. no linear
effect is present. A large L̂linear is evidence for presence of a linear effect, likewise, a small
p-value is evidence for a linear effect.
For this we investigate ‘lack of fit’ after fitting each term sequentially.
2. Compute SSLOF = SSA − SSlin with (a − 1) − 1 df. LOF = lack of fit. SSLOF measures
the unexplained variation (lack of fit) after having fitted the linear term.
73
0.2
0.1
relative weight
0.0
−0.1
−0.2
4. Compute SSquad .
Example
The following data are from an experiment to test the tensile strength of a cotton fibre used to
manufacture men’s shirts. Five different qualities of fibre are available with percentage cotton
contents of 15%, 20%, 25%, 30% and 35% and five measurements were obtained from each
type of fibre.
cotton % 15 20 25 30 35
mean 9.8 15.4 17.6 21.6 10.8
--------------------------------------------------
Df Sum Sq Mean Sq F value Pr(>F)
cotton 4 476 118.9 14.8 9.1e-06 ***
Residuals 20 161 8.1
--------------------------------------------------
74
Percentage cotton has a significant effect on strength. We now partition the treatment sum of
squares into linear, quadratic and cubic effects using the coefficients table:
% Cotton
15 20 25 30 35
mean 9.8 15.4 17.6 21.6 10.8 D Li SSi F p
.L -2 -1 0 1 2 10 8.2 33.6 4.15 0.055
.Q 2 -1 -2 -1 2 14 -31 343 42.35 < 0.001
.C -1 2 0 -2 1 10 -11.4 65 8.02 0.010
.4 1 -4 6 -4 1 70 -21.8 33.9 4.19 0.054
sum 476
The null hypothesis for the .L line (linear effect) is H0 : there is no linear component in the
relationship between % cotton and strength.
2. To test whether we need any higher order terms other than a linear term, we do a lack of
fit test:
we find
These are the unexplained sums of squares, all that the linear term cannot explain. The
F-statistic is
442.4
M SLOF
F = = 3 = 18.21 ∼ F3,20
M Se 8.1
0.05 = 18.3, p < 0.001. There is strong evidence for lack of fit. So we add a quadratic
F3,20
term, and repeat.
Here is a table that summarises the lack-of-fit tests. The null hypothesis tested in each
line is that there is no lack of fit after having added the corresponding term (and all
preceding), i.e. all higher order contrasts equal 0.
-----------------------------------------
SS.lof df.lof F.lof p.lof
linear LOF 442.140 3 18.285 0.000
quadratic LOF 98.926 2 6.137 0.008
cubic LOF 33.946 1 4.212 0.053
quartic LOF 0.000 0 NaN NaN
-----------------------------------------
75
25 ●
20
● ●
● ●
strength
●
15 ● ●
● ●
10 ●
● ●
15 20 25 30 35
percentage cotton
Figure 4.2: Tensile strength of cotton fibre with respect to percentage cotton. Dots are obser-
vations. The line is a fitted cubic polynomial curve.
3. We need a linear, quadratic and cubic term. We always keep all lower-order terms in the
model! The quartic effect might help to improve the relationship but for simplicity we
prefer not to construct too complicated polynomials.
The relationship between tensile strength and percentage cotton content is described by
a cubic polynomial (see fitted curve in Figure 4.2).
4.10 References
1. Abdi, H. and Williams, L. 2010. Contrast Analysis. In: Salkind N. (Ed.), Encyclopedia of
Research Design. Sage. (This is a very good overview of contrasts).
https://www.utd.edu/~herve/abdi-contrasts2010-pretty.pdf.
2. Miller, R.G (Jnr.) (1981). Simultaneous Statistical Inference. 2nd edition. Springer.
3. Dunn. O.J. and Clarke, V. (1987). Applied Statistics: Analysis of Variance and Regression.
Wiley.
4. O’Neil, R.O. and Wedderburn, G.B. (1971). The Present State of Multiple Comparisons.
Journal of the Royal Statistical Society Series B, 33, 218–244.
5. Hochberg, Y. and Tamhane, A.C. (1987). Multiple Comparison Procedures. John Wiley
and Sons.
76
6. Peteren, R. (1986). Designs and Analysis of Experiments. Marcel Dekker.
8. Ruxton, G. D. and G. Beauchamp (2008). Time for some a priori thinking about post hoc
testing. Behavioral Ecology 19, 690–693.
10. Scheffé, H. (1953). A method for judging all contrasts in the analysis of variance.
Biometrika 40, 87–104.
12. Saville, D.J. (1990) Multiple comparison procedures: the practical solution. The American
Statistician 44, 174–180.
77
Chapter 5
We have seen how to analyse data from a single factor completely randomised design, using a
one-way ANOVA. Completely randomized designs are used when the experimental units are
homogeneous or similar. In this chapter we will look more closely at designs which have used
blocking factors (one or two), but still a single treatment factor. Recall that blocking is done
to reduce experimental error variance. This is done by separating the experimental units into
blocks of similar (homogeneous) units. This makes it possible to account for the differences
between the blocks, thereby reducing the experimental (remaining) error variance. Any
differences in experimental units which are not blocked for or measuered will end up in the
error variance.
5. The positions of the experimental units when they occur along a gradient in spatial
settings (e.g. from light to dark, or from lots of nutrients to few nutrients).
The experimental units within a block should be homogeneous so that ideally the only thing
that can affect the response will be the different treatments. The treatments are assigned at
random to the units within each block so that a given unit is equally likely to receive any of
the treatments. Randomization minimizes the effects of other factors that may influence the
result, but which may not have been blocked out. One does not usually test for block
differences - if the blocking was successful, the F statistic will be greater than 1.
Randomized block designs are used when the experimental units are not very homogeneous
(similar). Similar experimental units are grouped together in blocks.
78
Here is an example that could represent an agricultural experiment. Typically in agriculture
one needs to use fields that are not homogeneous. For example, the one side is higher up, has
fewer nutrients, less water-logged. Agricultural experiments almost always use designs with
blocks.
10
●
●
8
B D A E
●
C
●
6
C D E A B
0:10
●
4
A C
●
D E B
●
2
A
●
D E C B
●
0
0 2 4 6 8 10
0:10
The experimental units on the light blue side of the field are more similar and thus grouped
into a block, the experimental units (plots) on the dark blue side are grouped together.
Experimental units within blocks are similar (homogeneous), but differ between blocks.
Typical blocking factors include age, sex, material from the same batch, time (e.g. week day,
year), spatial gradients.
Ideally (easiest to analyse and interpret results), we would have each treatment once in every
block. This means that each block has the same number of experimental units. Often it is
worth choosing the experimental units in such a way that we can have such a complete
randomized block design.
Randomization is not complete but restricted to each block. This means we randomly assign
the a treatments to the a experimental units in block 1, then randomize in block 2, etc..
The main purpose of blocking is to reduce experimental error variance (unexplained variation).
This increases power and precision of estimates. If there is lots of variation between blocks,
this variation, that would otherwise end up in the experimental error variance, is now
absorbed into the block effects (variation due to blocks). Therefore, experimental error
variance can be reduced considerably if there are large differences between the blocks.
If there are only small differences between blocks, error variance will not decrease very much.
Additionally we will loose error degrees of freedom, and may end up with less power. So it is
important to consider carefully at the design stage whether blocks are necessary or not.
79
Model
βj is the effect of block j, i.e. the change in mean response with block j relative to the overall
mean. Again, the (identifiability) constraints are required because the model is
over-parametrized; they ensure unique estimates.
There are two subscripts: i refers to the treatment, j refers to the block (essentially the
replicate). The block effect is defined exactly like we defined the treatment effect before: the
difference between the block mean and the overall mean, i.e. the change in average/expected
response when the observation is from block j relative to the overall mean.
E(Yi j) = µ + αi + βj
i.e. the observation from treatment i and block j is made up of an overall mean, a treatment i
effect and a block j effect (and the effect of treatment i does not depend on what block we
have, i.e. the effects are additive = there is no interaction between the blocking and the
treatment factor).
If we take µ over to the LHS of the equation, the deviation of the observed value from the
overall mean equals the sum of 3 effects (or deviations from a mean): treatment effect, block
effect and experimental unit effect (or error term).
If we assume that the effect of treatment i is the same in every block, then the average (over
all blocks j) of the Yij − Ȳ.j deviations will give us an estimate of the effect of treatment i. If
we cannot make this assumption, i.e. the effect of treatment i depends on the block, or is
different in every block (there is an interaction between the blocking and the treatment
factor), then the treatment effect in block j is confounded with the error term of the
observation (because there is no replicate of treatment i in block j).
The above is the central, crucial idea to understanding how randomized block designs work,
how we get the estimates and why the no interaction assumption plays a role.
Here is another way to look at it. Assume we have just two blocks and 2 treatments.
Assume the effect of treatment 2 is the same in every block (here slightly increases response).
Then we can estimate α2 by taking the average effect of treatmentP 2 (relative to block means).
It turns out that we could also get α2 from Y2j − Ȳ.. , because βj = 0.
If you sum the values in the first column (treatment 1), you wouldP be estimating
(µ + α1 + β1 ) + . . . + (µ + α1 + βb ) = bµ + bα1 + 0 (because βj = 0, see the identifiability
constraint in the model). And the mean would be estimating µ + α. Therefore, we could
estimate alphai by Ȳi . − Ȳ.. , as before. Try the same for row one.
In this we are assuming that αi is the same in every block. alphai essentially gives us an
average estimate of how the response changes with treatment i. With no replication of
80
Table 5.1: Sketch of data table for randomized block design.
treatment
block 1 2 3 ... a mean
1 Ȳ.1
2 Ȳ.2
..
.
b Ȳ.b
mean Ȳ1. Ȳ2. Ȳa. Ȳ..
treatments within blocks, this is all we can do. We CANNOT estimate a separate effect of
treatment i for every block.
So, when we use a randomized block design, we need to make an important assumption,
namely that the effect of treatment i, αi , is the same in every block. Technically we refer to
this as there is no interaction between the treatment and the blocking factors, or the effects are
additive.
What happens if block and treatment factors DO interact? The model would still need to
make the assumption of no interaction, our estimates would still calculate average effects, but
this might not be very meaningful, or not a very useful description of what happens if we use
treatment i in block j. Also, the residuals and thus the experimental error variance might
become quite large, because the observations deviate quite a bit from the average values.
Additivity of block and treatment effects is therefore another thing we need to check, in
addition to the previous normal, equal variance residual checks. A good way to check for this
is through an interaction plot.
Yij = µ + αi + βj + eij
Again start with the model. Take µ to the left hand side. Then all terms on the RHS are
deviations from a mean: treatment means around overall mean, block means around overall
mean, observations around µ + αi + βj .
As we did for CRD, we can substitute observed values, square both sides and sum over all
observations to obtain
with
ab − 1 = (a − 1) + (b − 1) + (a − 1)(b − 1)
degrees of freedom, respectively. The error degrees of freedom can just be calculated from the
rest: (ab − 1) − (a − 1) − (b − 1).
81
(Ȳ.j − Ȳ.. )2 , where Ȳ.j denotes the mean in block j.
PP
Note that SSblocks =
XX
SSE = (Yi j − Ȳ.j − Ȳi. + Ȳ.. )2
When we have data from a completely randomized design we do not have a choice about
which terms need to be in the model.
But just to illustrate how the blocks reduce the SSE, compare the model with block effects
to a model we would use for a single-factor CRD (essentially ignoring that we actually had
blocks):
The SStotal and SStreatment will be exactly the same in both models. If we add block effects
the SSE is reduced, but so are its degrees of freedom. MSE will only become smaller if the
reduction in SSE is large relative to the number of degrees of freedom lost.
Usually we are not interested in officially testing for block effects. Actually the F-test is not
quite valid for block effects because we haven’t randomized the blocks to experimental units.
If we want to test for differences between blocks, we can use the F-test, but remember that we
cannot make causal inference about blocks. If we are only interested in whether blocks have
reduced the MSE, we can look at the F-value for blocks: blocking has reduced the MSE iff
F > 1 (iff = if and only if).
Example
Executives were exposed to one of 3 methods of quantifying the maximum risk premium they
would be willing to pay to avoid uncertainty in a business decision. The three methods are: 1)
U: utility method, 2) W: worry method, 3) C: comparison method. After using the assigned
method, the subjects were asked to state their degree of confidence in the method on a scale
from 0 (no confidence) to 20 (highest confidence).
You can see that the experimenters blocked for age of the executives. This would have been a
reasonable thing to do if they expected, for example, lower confidence in older executives, i.e.
different response due to inherent properties of the experimental units (which here are the
executives).
82
Table 5.2: Layout and randomization for premium risk experiment.
Experimental Unit
1 2 3
Block 1 (oldest executives) C W U
2 C U W
3 U W C
4 W U C
5 (youngest executives) W C U
We have a randomized block design with blocking factor age, treatment factor method of
quantifying risk premium, response = confidence in method. The executives in one block are
of a similar age. If the experiment was conducted correctly, the three methods were randomly
assigned to the three experimental units in each block.
Here is the ANOVA table. NOTE: one source of variation (SS) for every term in the model!
------------------------------------------------
> m1 <- aov(rate ~ block + method)
> summary(m1)
We are mainly interested in the treatment factor method. We can use the ANOVA table to
test H0 : alpha1 = α2 = α3 = 0, that method has no effect on confidence.
There is strong evidence that average confidence differs with different methods (p = 0.0001).
[Remember NOT to use the Neyman-Pearson approach, i.e. no significance levels, no 0.05, and
avoid saying ’reject the null hypothesis’ !!]
There is evidence for differences between the age groups. And blocking in this experiment was
effective as the F -value is much larger than 1, i.e. by blocking for age we have been able to
reduce experimental error variance, and thus to increase power to detect the treatment effects.
Is it reasonable to assume that block and treatment effects are additive? The interaction plot
can give some indication of how acceptable this assumption is. Usually we plot the treatments
on the x-axis and each block is represented by a line or trace. On the y-axis we show the mean
response for treatment i and block j. Because in RBDs there is mostly only a single
observation for treatment i and block j, the point shown here just represent the observations.
If block and treatment do not interact, i.e. method and age do not interact, the lines should
be roughly parallel. Because no interaction means that the effect of method i is the same in
every block, and also when changing from method 1 to 2 the change in mean response should
be roughly the same. HOWEVER, remember that here the points are based on single
observations, and we still expect some variation between executives, as part of natural
variation between experimental units.
83
block
15
5
3
2
1
confidence rating
10
5
1 2 3
method
Figure 5.2: Interaction plot for risk premium data. In R: interaction.plot(method, block,
rate, cex.axis = 1.5, cex.lab = 1.5, lwd = 2, ylab = "confidence rating"), first
the factor that goes on the x-axis, then the trace factor, then the response.
Even though the lines are not parallel here, there is no indication that the effect of method is
very different in the different blocks. We also did not (and could not!) show that the residuals
ARE normal, we are only worried if they are drastically non-normal. Here we are only worried
if there are clear indications of interactions. There are not, and averaging the effects over
blocks would give us a reasonable indication of what happens when the different methods of
risk assessment are used.
Moving beyond the ANOVA, we might now want to compare the treatment means directly to
find out which method results in highest confidence, and to find out HOW BIG the differences
are.
We can do this exactly as we did for CRDs. For example if we compare two treatment means
we need the standard error of a treatment mean, and the standard error of a difference
between treatment means.
r
2.98
q
SE(Ȳi. ) = V ar(Ȳi. ) = = 0.77
5
This is a measure of the uncertainty of a specific mean (how close it is to the true treatment
mean, or how well it estimates the true treatment mean). The variance of repeated
observations from this particular treatment is estimated by MSE. Each treatment mean is
based on 5 observations (one from each block). The standard error for the difference between
two means:
r r
2 × M SE 2 × 2.98
SED = = = 1.09
5 5
84
5.1 The Randomised Block Design
In a randomised block design we have one blocking factor. In each block each treatment will
appear once and only once, but see the note below. In the randomised block design we assume
that the treatments and blocks do not interact - that is to say the effect of the treatment
does not depend on which block it is in, i.e. the effect of treatment i is the same in every
block. This is also referred to as additive block and treatment effects.
Note: If the number of treatments is greater than the number of units in a block we must use
an incomplete block design. See the references for more information on incomplete block
designs. The analysis is not difficult provided the design is chosen carefully. On the other
hand, if we have more experimental units in a block than treatments, we can replicate some of
the treatments in each block. This will allow us to estimate block-treatment interactions.
Q: Describe in your own words how randomisation for a randomised block design should be
done. What happens if we randomly assign treatments to blocks, e.g. treatment 1 to block 3,
treatment 2 to block 1, etc.?
Suppose we wish to compare a treatments and have N experimental units arranged in b blocks
each containing a homogeneous experimental units: N = ab. The a treatments, A1 , A2 , . . . Aa
say are assigned to the units in the j th block at random.
Note that the design (blocking and treatment factors and the randomisation) as good as
dictates the model to be used:
Let Yij be the response to the ith treatment in the j th block. The linear model for the RBD is:
Yij = µ + αi + βj + eij
i = 1 . . . a and
j = 1...b
where
Pa Pb
i=1 αi = j=1 βj =0
µ overall mean
αi effect of the ith treatment
βj effect of the j th block
eij random error of observation
eij ∼ N (0, σ 2 ) and are independent
This model says that the response depends on a treatment effect, a block effect and the overall
mean. It also says that these effects are additive. In other words we now have a × b
distributions/populations corresponding to the a treatments in each of the b blocks. The
means of these a × b populations are given by:
85
µ + αi + βj ≡ the property of additivity (no interaction)
What do we mean by additive effects? It means that we are assuming that the effect of the ith
treatment on the response is the same (αi ) regardless of the block in which the treatment is
used. Similarly, the effect of the jth block is the same (βj ) regardless of the treatment.
If the additivity assumption is not valid, the effect of treatment i will differ depending on
block. The response can then not be described as in the model above but we need another
term, the interaction effect, which describes the difference in effect of treatment i in block j,
compared to the additive model. To be able to estimate these interaction effects we need at
least 2 replications of each treatment in each block. In general, for randomised block designs,
we make the assumption of additivity, but then need to check this.
Estimation of µ, αi (i = 1, 2, . . . a) and βj (j = 1, 2 . . . b)
When assuming a normally distributed error term, the maximum likelihood and least squares
estimates of the parameters are the same and are found by minimizing
− µ − αi − βj )2
P P
S= i j (Yij
∂S P
∂µ = −2 ij (Yij − µ − αi − β j ) =0
∂S Pb
∂αi = −2 j=1 (Yij − µ − αi − βj ) = 0 i = 1, . . . a
∂S Pa
∂βj = −2 i=1 (Yij − µ − αi − β j ) = 0 j = 1, . . . b
Note the limits of the summation. Using the constraints we find the a + b + 1 normal equations
abµ = Y··
bαi + bµ = Yi·
aβj + aµ = Y·j
The unbiased estimate of σ 2 is found by substituting these estimates into SSE to give
86
− µ̂ − α̂i − βˆj )2
P
SSresidual = ij (Yij
and
The model is
Yij = µ + αi + βj + eij
so
Yij − µ = αi + βj + eij
Yij − Ȳ·· = (Ȳi· − Ȳ·· ) + (Ȳ·j − Ȳ·· ) + (Yij − Ȳi· − Ȳ·j + Ȳ·· )
since the cross products vanish when summed. This can be written symbolically as
87
with degrees of freedom
(ab − 1) = (a − 1) + (b − 1) + (a − 1)(b − 1)
Thus the total sums of squares can be split into three sums of squares for treatments, blocks
and error respectively. Using the theory of quadratic forms, the sums of squares are
independent and each has a χ2 distribution (Cochran’s Theorem).
E(Ȳi· − Ȳ·· ) = αi
Then
h i
SStreat
E(M Streat ) = E (a−1)
h i
b
− Ȳ·· )2
P
= E a−1 i (Ȳi·
b
= σ2 + αi2
P
a−1
as for the CRD, except that now blocks are the replicates.
Also
a
βj2 + σ 2
P
E(M Sblocks ) = b−1 j and
E(M SE) = σ 2
So
M SA
F = ∼ F(a−1),(a−1)(b−1)
M SE
α
If H0 : α1 = α2 = . . . = αa = 0, then reject H0 if F > F(a−1),(a−1)(b−1) .
M SA
If H0 is false, M SE has a non-central F distribution with non-centrality parameter
αi2
P
b
λ=
σ2
and (a − 1) and (a − 1)(b − 1) degrees of freedom. This distribution can be used to find the
power of the F-test and to determine the number of blocks needed to guarantee a specific
power (see Chapter 6.)
88
Table 5.3: Analysis of Variance Table for the Randomised Block Design with model Yij =
µ + αi + βj + eij
Source SS df MS F EMS
b α2i
P
SSA M SA
− Ȳ·· )2 σ2 +
P
Treatments A SSA = b i (Ȳi· (a − 1) (a−1) M SE (a−1)
a β2
P
SSB M Sblocks
SSB = a j (Ȳ·j − Ȳ·· )2 σ 2 + (b−1)i
P
Blocks B (b − 1) (b−1) M SE
SSE
SSE = ij (Yij − Ȳi· − Ȳ·j + Ȳ·· )2 σ2
P
Error (a − 1)(b − 1) (a−1)(b−1)
SStotal = (Yij − Ȳ·· )2
P
Total ab − 1
2. orthogonal contrasts
4. unplanned comparisons
Computing Formulae
( Y )2
P
C = ab 2 ij
PȲ·· 2 = ab
SS“tot” = Y −C
P ij 2
SSA = b P Ȳi· − C
SSB = a Ȳ·j2 − C
SSe = SS“tot” − SSA − SSB
Estimates
µ̂ = Ȳ··
α̂i = Ȳi· − Ȳ··
β̂j = Ȳ·j − Ȳ··
Current recommendations for nitrogen fertilisation were developed through the use of periodic
stem tissue analysis for nitrate content of the plant. This was thought to be an effective
means to monitor nitrogen content of the crop and a basis for predicting optimum production.
However, stem nitrate tests were found to over-predict nitrogen amounts. Consequently the
researcher wanted to evaluate the effect of several different fertilization timing schedules on
the stem tissue nitrate amounts and wheat production to refine the recommendation
procedure (Source: Kuehl 2000). The data and R code can be found in nitrogen2.txt and
nitrogen2.R, respectively.
The treatment structure included six different nitrogen application timing and rate schedules
that were thought to provide the range of conditions necessary to evaluate the process. For
comparison, a control treatment of no nitrogen was included as was the current standard
recommendation.
The experiment was conducted in an irrigated field with a water gradient along one direction
of the experimental plot area as a result of irrigation. Since plant responses are affected by
variability in the amount of available moisture, the field plots were grouped into blocks of six
89
Table 5.4: Observed nitrate content (ppm ×102 ) from samples of wheat stems from each plot.
First row in each block indicates treatment number.
Block 1 2 5 4 1 6 3 Irrigation
40.89 37.99 37.18 34.98 34.89 42.07 Gradient
Block 2 1 3 4 6 5 2
41.22 49.42 45.85 50.15 41.99 46.69 ⇓
Block 3 6 3 5 1 2 4
44.57 52.68 37.61 36.94 46.65 40.23
Block 4 2 4 6 5 3 1
41.90 39.20 43.29 40.45 42.91 39.97
plots such that each block occurred in the same part of the water gradient. Thus, any
differences in plant responses caused by the water gradient could be associated with the
blocks. The resulting experimental design was a randomized (complete) block design with four
blocks of six field plots to which the nitrogen treatments were randomly allocated.
The layout of the experimental plots in the field is shown in Table 5.4. The observed nitrate
content (ppm ×102 ) from a sample of wheat stems is shown for each plot along with the
treatment numbers, which appear in the small box of each plot.
Yij = µ + αi + βj + eij
where µ is the overall mean, αi is the nitrogen treatment effect, βj is the block effect, and eij
is the experimental error assumed ∼ N (0, σ 2 ). Treatment and block effects are assumed to be
additive.
ANOVA Table:
----------------------------------------------
Df Sum Sq Mean Sq F value Pr(>F)
TREATMNT 5 201.32 40.263 5.5917 0.004191
BLOCK 3 197.00 65.668 9.1198 0.001116
Residuals 15 108.01 7.201
----------------------------------------------
The blocked design will markedly improve the precision on the estimates of the treatment
means if the reduction in SSE with blocking is substantial.
90
The F statistic to test the null hypothesis of no differences among the treatment means is
F = 5.59. The p-value is 0.04, suggesting differences between the nitrogen treatments with
respect to stem nitrate. There is usually little interest in a formal inference about block
effects, although we might be interested in whether blocking increased the efficiency of the
design, which it did if F > 1.
Treatment 4 was the standard fertilizer recommendation for wheat. We could now compare
each of the treatments to treatment 4 to see if any differ from the current recommended
treatment. The control gives a means of evaluating the nitrogen available without fertilization.
Easy analysis of the Randomised Block Design depends on having an observation in each cell
of the two-way table, i.e. each treatment appears once in each block. We call this a balanced
design. Balanced designs ensure that block and treatment effects can be estimated
independently. This greatly simplifies interpretation of results. More generally, data or designs
are balanced when we have the same number of observations for all factor level combinations.
Missing observations result in unbalanced data.
What happens if some of the observations in our RBD experiment are missing? This could
happen if an experimental unit runs away or explodes, or dies or becomes sick during the
experiment and can no longer participate.
Then we no longer have a balanced design. Refer back to the layout of the RBD (Table 5.1).
If we have no missing observations, and we compare treatment 1 to treatment 2, on average,
they don’t differ with respect to anything except for the treatment (exactly the same block
contributions are made in each treatment, and there are no interactions, which means that the
block effect is the same for each treatment). Now, what happens if one observation is missing?
The problem is the same we have in regression, where coefficients are interpreted conditional
on the values of all other variables in the model. There the variables are all more or less
correlated. In such a case it is not entirely possible to extract the effect of a single predictor
variable, the coefficient or effect depends on what other terms are in the model. The same
happens when the data from an experiment become unbalanced. For example, the treatment i
effect can no longer be estimated by Yi. − Y.. , which would give a biased estimate for αi .
There are two strategies to deal with unbalanced data. The first is to estimate the value,
substitute it back in, but reduce the error degrees of freedom accordingly. The advantage is
that the data become balanced, and the results are as easy to interpret as before: we are
exactly able to attribute variation caused by differences between treatments and variation
caused by differences between blocks. The second strategy is to fit a regression model. The
least squares estimates from this model will still give you the best possible estimate of the
treatment effects, provided you have accounted for blocks, i.e. the blocking factor must be in
the model.
Sums of squares can’t be split exactly any more, but we would base our F-test for treatments
on the change in variation explained relative to a full model except for the treatment term, i.e.
change in variation explained when the treatment factor is added last.
91
You don’t need to remember the formulae for estimating the missing values. You would get
the same value when fitting a regression model and from that obtain the estimate for the
missing value; you should know how to go about this in practice (both strategies).
1. In the case of only one or two observations missing, one could estimate the value of the
missing treatment, based on the other observations. The error degrees of freedom are
reduced accordingly, by the number of estimated observations.
Suppose observation Yij is missing. Let u be our estimate of the missing observation.
The least squares estimate of the observation Yij would be
u = µ̂ + α̂i + β̂j
G0 +u
µ̂ = N
T 0 +u 0
α̂i = b − GN+u
B 0 +u 0
β̂j = a − GN+u
So
G0 +u 0 G0 +u B 0 +u G0 +u
u = N + T b+u − N + a − N
T 0 +u 0 G0 +u
= b + B a+u − N
Hence
aT 0 +bB 0 −G0
u= (b−1)(a−1)
The estimate of the missing value is a linear combination of the other observations. It
can be shown that it is the value u which minimizes the SSE when ordinary ANOVA is
carried out on the N data points (the (N − 1) actual observations and u).
Since the missing value is a linear combination of the other observations it follows that
Ȳi· - the ith treatment is correlated with the other means. If there is a missing
92
observation on the ith treatment it can be shown that the variance of the estimated
difference between treatment i and any other, i0 is
h i
2 a
V ar(Ȳi· − Ȳ ) =
i0 · a2 b + b(b−1)(a−1)
If there are 2 missing values we can repeat the procedure above and solve the
simultaneous equations
u1 = µ̂ + α̂i + β̂j
u2 = µ̂ + α̂i0 + β̂j 0
One degree of freedom is subtracted from the error degrees of freedom for each missing
value estimated. Thus the degrees of freedom of s2 (MSE) are (a − 1)(b − 1) − k, where
k is the number of missing values. The degrees of freedom of the F tests are adjusted
accordingly.
2. Alternatively, one can estimate the parameters using a linear regression model. But
because treatments and blocks are no longer orthogonal (independent), the order in
which the terms enter the model will become important, and interpretation may become
more difficult.
The estimates obtained from fitting a regression model are ‘last one in’ estimates. This
means they estimate the change in response after all other variables in the model have
explained what they can, i.e. variation in residuals. So are the t-tests. If we want to
conduct an ANOVA table in a similar way (last one in) we cannot use R’s aov function,
which calculates sequential SS. Sequential ANOVA tables test change in variance
explained when adding each term, given all previous terms in the model. The SSs and
F-tests will thus change and give different results depending on the order in which the
terms appear in the model.
The Anova function in R’s car package uses Type II sums of squares, i.e. calculating SSs
as last-one-in, as the regression t-tests do: each SS is calculated as change in SS
explained compared to SS explained given all other terms in the model, see nitrogen2.R
for an illustration.
What design was used here? Can you identify all blocking and treatment factors? What are
the experimental units? Is there a pseudo-replication problem in this experiment? Try and
answer these questions before you read on.
This is a special kind of randomized block design. Each subject is used for all treatments, but
because the treatments are randomly assigned to different time slots, and we only obtain one
observation per treatment application, this is not pseudo-replication. The only problem might
be that the subject learns over time (gets better with balancing over time). But we can deal
93
Table 5.5: Data for balancing times experiment. The response is balancing time (seconds).
with this by randomizing the order of treatments within subjects. Here the subject is a block,
the 3 time slots within one subject are expected to be very similar. When we do model
checking, we need to check whether order (within subject) had an effect. If it did, we could
just add order as a term in the model to try and account for differences because of order (but
this might make the data unbalanced).
Do the treatment conditions influence balancing time? There is not much evidence that they
do (p = 0.1720). And there seem to be relatively small differences between subjects.
20
balancing time
balancing time
15
15
10
10
5
94
Residuals vs Fitted Normal Q−Q
2.0
Standardized residuals
4● 4●
6
●2 2●
4
Residuals
1.0
●
●
2
●
● ●
● ●
●
0.0
●
●
●
●
−1.0
−4
● ● ●
1● ● ●
●1 ●
Constant Leverage:
Scale−Location Residuals vs Factor Levels
4●
Standardized residuals
Standardized residuals
●4
2
1.2
●2
1● 2●
●
● ●
1
● ●
0.8
●
● ●
●
● ●
0
● ●
0.4
−1
● ● ●
●1
0.0
treat :
8 10 12 14 16 speaking humming silent
id
1
4
20
3
2
mean of bal.time
15
10
5
The box plots suggest that balancing times are generally longer with silence. The residual
plots here are not very informative (partly because we have only 12 observations). But the
interaction plot might help us identify what the problem is. Remember that in the model for
the RBD we assumed that there is no interaction between the subject and the treatment. The
interaction plot seems to indicate that some subjects do well with silent, but not all. There is
very large error variance, i.e. it is really not clear how repeatable these results are, and there
might be a subject-treatment interaction.
Now suppose the 5th observation went missing (speaking from subject 2).
The mean for subject 2 is high because it does not include a ‘speaking’ observation. The mean
for ‘speaking’ is low, possibly because it does not include a ’subject 2’ observation. So the
effects can now no longer be estimated as Ȳi. − Ȳ.. , but depend on the other estimates (the
estimate for speaking depends on the subject 2 estimate). This is a lot more complicated than
before.
95
Table 5.6: Data for balancing times experiment with one missing observation. The response is
balancing time (seconds).
Let’s see what happens to the ANOVA table when we want to test whether treatments affect
balancing time. In the 2 ANOVA tables below, note that the order in which the terms are
added to the model, differs. Compare the p-values.
> bal.time[5]
[1] 5.5
>
> m4 <- aov(bal.time[-5] ~ treat[-5] + id[-5])
> summary(m4)
Df Sum Sq Mean Sq F value Pr(>F)
treat[-5] 2 100.802 50.401 1.9171 0.241
id[-5] 3 26.716 8.905 0.3387 0.799
Residuals 5 131.451 26.290
>
> m5 <- aov(bal.time[-5] ~ id[-5] + treat[-5])
> summary(m5)
Df Sum Sq Mean Sq F value Pr(>F)
id[-5] 3 41.117 13.706 0.5213 0.6861
treat[-5] 2 86.401 43.200 1.6432 0.2828
Residuals 5 131.451 26.290
What is happening here? In this example the p-values are not terribly different, but in other
cases order could make a huge difference. R’s aov function constructs sequential F-tests,
testing change in variance explained, relative to what was in the model before. This is NOT
the right approach for unbalanced data.
R’s lm() summary gives last-one-in t-tests. This is a much better approach, given all else in
the model, how much does the specific term contribute to the variation explained.
The Anova function in R’s car package provides a last-one-in ANOVA. This is also called Type
II sums of squares.
> library(car)
> Anova(m4)
Anova Table (Type II tests)
96
Response: bal.time[-5]
Sum Sq Df F value Pr(>F)
treat[-5] 86.401 2 1.6432 0.2828
id[-5] 26.716 3 0.3387 0.7990
Residuals 131.451 5
This gives us a test for treatment effects after adjusting for block effects. There is still not
much evidence for treatment effects.
To obtain the least squares estimate for the missing observation, use the regression estimates
below. Can you show that the least squares estimate for the missing value is 12.77?
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.7611 3.6756 2.112 0.0885
id[-5]2 4.6889 4.8342 0.970 0.3766
id[-5]3 1.8333 4.1865 0.438 0.6797
id[-5]4 0.8333 4.1865 0.199 0.8501
treat[-5]silent 6.0000 3.6256 1.655 0.1589
treat[-5]speaking 0.3167 4.0536 0.078 0.9408
Now we would substitute this back into the data, obtain a new ANOVA table (which is now
balanced), but manually we would need to decrease the error degrees of freedom by 1 (for the
estimated value), and conduct the F-test with the adjusted degrees of freedom (also by hand).
97
flip of a coin. The following table gives the data. It illustrates that A showed less wear than
B, since (B − A) > 0 most generally.
The observed sequence of tosses leading to the above treatment allocations was (head implies
A worn on right foot):
T T H T H T T T H T
Under H0 , A and B are merely labels and could be swopped without making a difference.
Hence boy 1 could have worn A on the right and B on the left foot, resulting in a difference of
B − A = 13.4 − 14 = −0.8. Similarly, boy 6 could have worn A on the right and B on the left
foot, giving B − A = 6.6 − 6.4 = +0.2. This implies the actual values of wear, and hence the
values of the differences do not change, but the signs associated with these differences do.
The given sequence of coin tosses is one of 210 = 1024 equally probable outcomes: There exist
2 orderings for each pair, the 10 pairs are independent, hence there are 2 × 2 × 2 × 2 . . . = 210
different orderings.
To test H0 , the actually observed difference of 0.41 may be compared with all possible 1024
average differences that could have occurred as a result of different outcomes of coin tosses. To
obtain these 1024 average differences we need to average the differences for all possible
combinations of + and - signs. This is hard work! So lets think how can we obtain average
differences greater than the observed 0.41: Only when the positive differences stay the same
and one or both of the negative differences become positive (since the 2 negative differences
were associated with the smallest absolute values!). This implies 3 possible differences > 0.41.
Four further samples give values of d¯ = 0.41. This implies a p-value of 1024
7
= 0.007.
Questions
98
1. What are your conclusions about the shoe sole materials?
2. What parametric test could be used for the above example? What are its assumptions
about the data?
The Friedman test is a non-parametric test to test for for differences between treatments
(k ≥ 3), in a randomized block design (several related samples). It is the non-parametric
equivalent of the F-test in a two-way ANOVA. This is an extension of matched pairs to more
than two samples. The test can be used on on ranked data.
k
b(k + 1) 2
12 X
T = Ri −
bk(k + 1) 2
k
12 X
= Ri2 − 3b(k + 1)
bk(k + 1)
If ties are present, the above statistic T needs to be adjusted. In that case one can use the
sums of squares of the ranks to calculate the following statistic:
M Streatment
T2 = ∼ Fk−1,(b−1)(k−1)
M SE
Critical Values
99
Example
Six quality control laboratories are asked to analyze 5 chemicals to see if they are performing
analyses in the same manner. Determine whether any of the labs are different from any others
if the ranked data are as follows.
Chemical Lab 1 2 3 4 5 6
A 1 5 3 2 4 6
B 3 2 1 4 6 5
C 3 4 2 5 1 6
D 1 4 6 3 2 5
E 4 5 1 2 3 6
Ri 12 20 13 16 16 28
k = 6, b = 5
12 2
12 + 202 + 132 + 162 + 162 + 282 − 3 × 5 × 7 = 9.8
T =
5×6×7
Using T ∼ χ25 we obtain a p-value = 0.08. There is little evidence to indicate that the labs are
performing the analyses differently.
A Latin Square of order p is an arrangement of p letters each repeated p times into a square of
side p so that each letter appears exactly once in each row and once in each column (a bit like
Sudoku).
Example
AB BA
BA AB
100
A Latin Square can be changed into another one of the same order by interchanging rows and
columns. The Latin Square is a useful way of blocking for 2 factors each with p levels without
increasing the number of treatments. The rows of the square denote one blocking factor and
the columns the other. The entries are p treatments which are to be compared. We require
only p2 experimental units. If we had made an observation on each combination of 2 blocking
factors and p treatments we would have needed p3 experimental units.
Let Yijk = observation on the k th treatment in the ith row and j th column of the square.
Then a suitable model is
P P P
i αi = j βj = k γk = 0
where
µ general mean
αi ith row effect
βj j th column effect
γk k th treatment effect
eijk random error of observation
eijk ∼ N (0, σ 2 ) and are independent
− µ − αi − βj − γk )2
P
S= ijk (Yijk
∂S
Yijk = p2 µ since there are p2 observations.
P
∂µ = 0 gives {ijk}∈D
P Yijk
µ̂ = Ȳ··· = {ijk} p2
∂S P
∂γk = −2 ij (Yijk − µ − αi − βj − γk ) = 0
101
Using the constraints we find
pµ + pγk = Y··k
γ̂k = Ȳ··k − Ȳ···
Similarly
The error is found by substituting µ̂ , α̂i , β̂j and γ̂k in SSE to give
and
SSresidual
σ̂ 2 =
(p − 1)(p − 2)
As in the other cases, we could derive a likelihood ratio test for H0 , by fitting two models -
one which contains the γ’s and one that does not.
We shall use the short method.
Consider
Squaring and summing over the p2 (ijk)’s present in the design gives
and so
with df
102
p2 − 1 = (p − 1) + (p − 1) + (p − 1) + (p − 1)(p − 2)
As in previous examples, the SS’s are independent and when divided by σ 2 each have a χ2
distribution with appropriate degrees of freedom.
Source SS df MS F
p (i) (Ȳi·· − Ȳ··· )2
P
Rows SSrow = (p − 1)
p (j) (Ȳ·j· − Ȳ··· )2
P
Columns SScol = (p − 1)
P 2 M Streat
Treatment SStreat = p (k) (Ȳ··k − Ȳ··· ) (p − 1) M Streat M Se
P
Error SSe = (ijk) (Ȳijk − Ȳi·· (p − 1)(p − 2) M Se
−Ȳ·j· − Ȳ··k + 2Ȳ··· )2
2 p2 − 1
P
Total SS“tot” = (ijk) (Yijk − Ȳ··· )
Usually only treatments are tested but row and column differences can be tested in the usual
way. The power of the F test can be found using the non-central F with (p − 1) and
(p − 1)(p − 2) df and non-centrality parameter
X γ2
k
λ=p
σ2
1. Can only be used when the number of treatments = number of rows = number of
columns.
The Latin Square can also be used in factorial experiments for experimenting with 3 factors
each with p levels using only ( p1 )th of the complete design. For example a 53 factorial
experiment has 125 treatments. If we are willing to assume there is no interaction we can
arrange the 3 factors in a Latin Square with rows for one factor, columns for another and
letters for the third. We then use only 25 treatments instead of 125, an 80 % reduction.
103
Missing Values
The analysis of unbalanced data from Latin Square Designs follows along the same lines as for
Randomized Block Designs (see Section 5.3).
Suppose the Y value corresponding to the ith row, j th column and k th letter is missing. It can
be estimated by
where
Example
Another source of variation in the tyre experiment would be the position of tyres on the
wheels of the car. In the Randomised Block Design we tried to average this effect out by
assigning the tyres to the wheels at random on each car. The effect of the wheel position could
be tested by putting a tyre of each brand on each wheel position of the car. This would make
the experiment very expensive and also time consuming, since the car would have to travel
four times as far. A design that will allow us to block for both wheel position and car without
increasing the number of tyres(16) in the experiment is the Latin Square Design. Let the
number 1,2,3 and 4 represent the wheel positions. If we arrange the tyres on the cars as in
Design 4.
Car
I II III IV
1 A B C D
2 B C D A
Wheel Positions
3 C D A B
4 D A B C
We notice that Brand A appears once in every car (as in the Randomised Block Design) and
once in every wheel position. The same applies to Brands B, C and D. Thus we can find a sum
of squares for tyres, cars, wheel position and error. We can even incorporate randomisation
into the experiment by choosing our Latin Square at random from the totality of Latin
Squares and then assigning the cars and wheel positions at random to the column and row
numbers of the square. Suppose our observations are the same as in the Randomised Block -
that is , tyre A on Car 1 had tread loss of 17mm, tyre B on Car 1 has tread loss of 14mm etc.
We obtain
104
Car
I II III IV
1 A(17) B(14) C(10) D(9)
2 B(14) C(12) D(10) A(13)
Wheel
3 C(12) D(11) A(13) B(8)
4 D(13) A(14) B(13) C(9)
The wheel position averages are given by the row averages of the square and the car averages
by the column averages. A table of observations on the brands of tyres is obtained from the
entries in the cells of the Latin Square.
Treatments
A B C D
17 14 10 9
13 14 12 10
13 8 12 11
14 13 9 13
Totals 57 49 43 43
Means 14.25 12.25 10.75 10.75
Wheels
1 2 3 4
17 14 12 13
14 12 11 14
10 10 13 13
9 13 8 9
Totals 50 49 44 49
Means 12.50 12.25 11 12.25
Cars
I II III IV
17 14 10 9
14 12 10 13
12 11 13 8
13 14 13 9
Totals 56 51 46 39
Means 14 12.75 11.50 9.75
105
Grand Total Y·· = 192
Correction Factor C = N Ȳ···2 = 16 × 122 = 2304
P 2
SStyres = 4 Ȳ··k −C
= 4(14.252 + 12.252 + · · · ) − 2304
= 4 × 584.25 − 2304
= 33.0
P 2
SScars = 4 Ȳ·j· −C
= 4(14.0 + 12.752 + · · · ) − 2304
2
= 4 × 585.875 − 2304
= 39.5
4 Ȳi··2 − C
P
SSwheels =
= 4(12.502 + 12.252 + · · · ) − 2304
= 4 × 577.375 − 2304
= 5.50
P 2
SStotal = Yijk − C
= (172 + 142 + · · · + 132 + 92 ) − 2304
= 84.0
SSE = SS“tot” − SStreat − SScars − SSwheels
= 84.00 − 33.00 − 39.50 − 5.50
= 6.00
This section is just added out of interest. Blocking for three factors, e.g. cars, wheel positions
and drivers, can be achieved by using a Graeco-Latin Square. A Graeco-Latin Square is
formed by taking a Latin Square and superimposing a second square with the treatments in
Greek letters. For example
A B C D and α β γ δ gives Aα Bβ Cγ Dδ
B A D C γ δ α β Bγ Aδ Dα Cβ
C D A B δ γ β α Cδ Dγ Aβ Bα
D C B A β α δ γ Dβ Cα Bδ Aγ
If the two squares have the property that each Greek letter coincides with each Latin letter
exactly once then the squares are called orthogonal. Complete sets of (p − 1) mutually
orthogonal Graeco-Latin Squares exist whenever p is a prime or a power of a prime. No
orthogonal Graeco-Latin Square exists for p=6.
106
Chapter 6
6.1 Introduction
Although this is often referred to as sample size calculations, in experiments we are not really
talking about samples, but the number of replicates needed per treatment.
Questions:
1. How can an experiment fail if the sample size was too small?
2. What is statistical power, in your own words?
3. Which are the 3 key ingredients that will determine statistical power (in the
experimental setting)?
Basically, the smaller the differences (effects) are that you want to detect, the larger the
sample sizes will have to be!
To calculate power we need the distribution of F if H0 is false. Recall the Expected Mean
Squares for treatment. If H0 is false, the test statistic
107
M Streat
F0 =
M SE
has a noncentral F distribution with a − 1 and N − a degrees of freedom (in the case of a
CRD) and noncentrality parameter
a
X
λ=r (µi − µ̄)2 /σ 2
i
Rather than having to specify the size of all effects it will be easier to specify the smallest
difference between any two treatment means that would be physically meaningful. Suppose we
want to detect a significant difference if any two treatment means differ by D = µi − µj . With
a larger non-centrality parameter P r[F > c] increases, and power increases. So we want to
ensure the smallest λ in a given situation will lead to a rejection. This will ensure that the
power is at least as specified. The minimum λ when there is a difference of at least D will
then be
rD2
λ=
2σ 2
Example: In a study to compare the effects of 4 diets on the weight of mice of 20 days old,
the experimenter wishes to detect a difference of 10 grams. The experimenter estimates that
the standard deviation σ is no larger than 5 grams. How many replicates are necessary to
have a probability of 0.9 to detect a difference of 10 grams using the F test with significance
level α = 0.05?
A difference of 10 grams means that the maximum and the minimum treatment means differ
by 10 grams. The noncentrality parameter is smallest when the other two treatment means are
all in the middle, i.e. the four treatment means are a, a + 5, a + 5, a + 10 for some constant a.
Here is some R code for the above example, followed by the output (see power.R).
----------------------------------------------------------------------------
a <- 4
D <- 10
sigma <- 5
alpha <- 0.05
df1 <- a - 1
for (r in 2:9)
{ df2 <- a * (r - 1)
ncp <- r * D^2 / (2 * sigma^2)
108
fcrit <- qf(alpha, df1, df2, lower.tail = FALSE)
# this is the critical value of the
# F-distribution under H0
power <- 1 - pf(fcrit, df1, df2, ncp)
r power ncp
2 0.1698028 4
3 0.3390584 6
4 0.503705 8
5 0.6442332 10
6 0.7545861 12
7 0.8361289 14
8 0.8935978 16
9 0.9325774 18
-----------------------------
ncp is the non-centrality parameter. r = 9 replicates will give us a power > 0.90. See
RforED.html for more details on how to calculate power using R.
Consider factor A with a levels, and a second factor B (a blocking or a treatment factor) with
b levels. The non-centrality parameter will be
a
X
λ=b (µi − µ̄)2 /σ 2
i=1
and power for detecting differences between the levels of A can be calculated similarly to
above, except that the degrees of freedom in the F-tests will change.
These notes refer to power for the special case of the ANOVA F-test. In general we would
need to know the distribution under the alternative hypothesis in order to calculate power.
References
http:
//www.stat.purdue.edu/~zhanghao/STAT514/handout/chapter03/PowerSampleSize.pdf
109
Chapter 7
Factorial Experiments
7.1 Introduction
1. The yield of a crop might depend on the amount of nitrogen and the amount of
potassium in the fertilizer.
2. The response to the drug may depend on the age of the patient and the severity of his
illness.
3. The yield of a chemical compound may depend on the pressure and the temperature at
which the chemical reaction takes place.
Factorial experiments allow us to evaluate the effect of each factor on its own and to study
the effect of a number of them working together or interacting.
Three types of adhesive (glue) are being tested in an adhesive assembly of glass specimens. A
tensile test is performed to determine the bond strength of the glass to glass assembly. Three
different types of assembly (cross-lap, square-centre and round-centre) are tested. The
following table shows the bond strength on 45 specimens.
110
Assembly
Adhesive Cross-lap Square-Centre Round-Centre
047 16 17 13
14 23 19
19 20 14
18 16 17
19 14 21
00T 23 24 24
18 20 21
21 12 25
20 21 29
21 17 24
001 27 14 17
28 26 18
14 14 13
26 28 16
17 27 18
No. of Levels 1 2 3
Adhesive (Factor A) 3 047 00T 001
Assembly (Factor B) 3 Cross Square Round
Response (Y) Bond Strength
Model
X X X X
αi = βj = (αβ)ij = (αβ)ij = 0
i j
1. Factor
Any feature of the experiment that can be changed from trial to trial is called a factor.
Factors can be qualitative or quantitative.
Examples of qualitative factors are: colour, sex, social class, severity of disease,
residential area. Strictly speaking, sex and social class are not treatment factors.
However, one may still be interested in differences (between sexes, say). In this case
one analyses such data exactly as a factorial experiment. However, interpretation can
only be about association, not causality.
111
Examples of quantitative factors are: temperature, pressure, age, income.
2. Levels
The various values of the factor examined in the experiment are called its levels.
Suppose temperature (T) is a factor in the experiment. Then the levels of T might be
chosen as 0◦ C, 10◦ C, 20◦ C. If Colour is a factor C then the levels of C might be Red,
Green, Blue. Sometimes the levels of a quantitative factor are treated qualitatively, e.g. the
levels of temperature T are cold, warm and hot. The levels of a factor are denoted by
subscripts: T1 , T2 , T3 are the levels of factor T.
3. Treatment
A combination of a single level from each factor in the experiment is called a treatment.
Example:
Suppose we wish to determine the effects of temperature (T) and Pressure (P) on the yield
of Y of a chemical compound. If T has two levels 0◦ and 10◦ and P has three levels, Low,
Medium and High, the treatments would be:
0◦ and Low pressure 10◦ and Low pressure
0◦ and Medium pressure 10◦ and Medium pressure
0◦ and High pressure 10◦ and High pressure
There are 2 × 3 = 6 treatments on the experiment and a number of measurements on Y
would be made on each of the treatments.
4. Effect of a factor
The change in the response produced by a change in the level of the factor is called the
effect of the factor. There are two types of effects, main effects and interaction effects.
5. A Main effect is the average change in the response produced by changing the level of a
single factor. It is the average over all the levels of the other factors in the experiment.
Thus in the experiment above, the main effect of a temperature of 0◦ would be the average
change in yield of the compound averaged over the three pressures low, medium and high
relative to the overall mean. All the effects we have looked at so far were main effects.
6. Interaction: If the effect of a factor depends on the level of another factor that is present
then the two factors interact. For example, consider the amount of risk people are willing
to take and the two factors gender and situation. Women might be willing to take high
risks in one situation but very little in another, while for men this willingness might be
directly opposite. So the response depends not only on situation or only on gender but one
has to look at the particular combination of factors. Therefore, if interactions are present,
it will not be meaningful to interpret the main effects, it is not very informative to know
what women risk on average, one will have to look at the combination of gender and
situation to understand the willingness to take risks.
7. Interaction effect: The interaction effect is the change in response (compared to the
overall mean) over and above the main effects at a certain combination.
8. Fixed and Random effects: Effects can also be classified as Fixed or Random. Suppose
the experiment were repeated a number of times. If the levels of the factors are the same
each time the experiment is repeated then the effects are called fixed and the results only
apply to the levels used in the experiments. If the levels of a factor are chosen at random
112
from a population of levels each time the experiment is repeated, then the effects are called
random.
Example:
If we are only interested in temperatures of 0◦ and 10◦ the temperature would be a fixed
effect. If the experiment were repeated we would use 0◦ and 10◦ again. If we were interested
in the range of temperatures from 0◦ to 20◦ say, and each time we ran the experiment we
decided on two Temperatures at random, then temperature would be a random effect.
The arithmetic of the analysis of variances is exactly the same for both fixed and random
effects but the interpretations of the results, the expected mean squares and tests are
completely different. We shall deal mainly with fixed effects.
9. In a complete factorial experiment every combination of factor levels is studied and the
number of treatments is the product of the number of levels of each factor.
Example:
If we examine the effect of 3 factors A, B and C on response Y and A has two levels, B has
3 and C has 5 then we have 2 × 3 × 5 = 30 treatments. The design is called a 2 × 3 × 5
factorial design.
Very often, more than one factor affects the response. For example, in a chemical experiment,
temperature and pressure affect yield. In an agricultural experiment, nitrogen and phosphate
in the soil affect yield. In a sports health experiment, improvement in fitness may not only
depend on the physical training program, but also on the type of motivation offered.
Effect of selling price (R55, R60, R65) and type of promotional campaign (radio, newspaper,
website pop-ups) on the number of products sold (new type of cell-phone contract). There are
two treatment factors (price and type of promotion). If we are going to use a factorial
treatment structure, there are 3 × 3 = 9 treatments. The experimental units could be different
towns.
If we only experiment with one factor at a time, we need to keep all other factors constant.
But in this way, we can never find out whether factor A would have influenced the response
differently at another level of B, i.e. we cannot look at interactions (if we did two separate
experiments to repeat all levels of A at another level of B, B and time would be confounded).
113
65 x 65 65 x x x
price (R)
60 x 60 x x x 60 x x x
55 x 55 55 x x x
Figure 7.1: Illustration of different ways of investigation two treatment factors. The first two
figures are two one-factor-at-a-time experiments. The figure on the RHS illustrates a factorial
experiment.
Interactions
Factors A and B are said to interact if the effects of factor A depend on the level of factor B
(or the other way around).
● b1 ● b1
●
● b2 ●
Y
● b2
a1 a2 a1 a2
Figure 7.2: Two different scenarios of response in a factorial experiment. On the LHS factors
A and B do not interact. On the RHS, factors A and B do interact. Y is the response. A is a
factor with 2 levels (a1 and a2), B is a factor with levels b1 and b2. The points indicate the
mean response at a certain treatment, or factor level combination.
Consider Figure 7.2. This is a factorial experiment (although it could also be illustrating
results in a RBD, where B, with levels b1 and b2, would denote a blocking factor instead of a
second treatment factor). Remember, what we mean by effect: the change in mean response
relative to some baseline level. In ANOVA models, effect usually refers to change in mean
response relative to overall mean. But, to understand interactions, I am here going to use
effect as the change in mean response relative to the baseline level (first level of the factor).
114
In the left-hand plot, when changing from a1 to a2 at level b1, the mean response increases by
a certain amount. When changing from a1 to a2 at level b2, that change in mean response is
exactly the same. In other words the effect of A (when changing from a1 to a2) on the mean
response does not depend on the level of B, i.e. A and B do not interact.
In the right-hand plot, when changing from a1 to a2 at level b1, the mean response increases,
when changing from a1 to a2 at level b2, the mean response decreases: the effect of A depends
on the level of B, i.e. A and B interact.
Note that interaction (between A and B) and independence of A and B are two completely
different things/concepts! We have designed the experiment so that A and B are independent!
Yet they can interact. The interaction refers to what happens to the response at a certain
combination of factor levels, not whether A and B are correlated or not.
Interactions are interesting because they will indicate particularly good or bad combinations
and how one factor effects the response in the presence of another factor, and very often the
response of one factor depends on what else you manipulate or keep constant.
If interactions are suspected to be present, factorial experiments are much more efficient than
one-factor-at-a-time experiments. Consider again the campaign example above. If I
investigated first the one factor, then in a second experiment only the second factor, I would
need at least 12 experimental units (2 × 3 + 2 × 3, 2 replicates per treatment), and probably
twice the amount of time. On the other hand, I could run a factorial experiment all at once,
with a minimum of 9 experimental units which would allow me to estimate all main effects. I
would need a minimum of 18 experimental units (two replicates for each treatment) to also
estimate interaction effects.
65 x x x 65 x x x 65 x x x
price (R)
price (R)
price (R)
60 x x x 60 x x x 60 x x x
55 x x x 55 x x x 55 x x x
Figure 7.3:
In a factorial experiment one can estimate main effects even if treatments are not replicated.
See Figure 7.3. On the LHS, I can estimate the average response (number of items sold) with
a web campaign. This will give me the main effect of web campaign when compared to the
overall mean, i.e. on average, what happens with web campaign. In other words, the main
effects measure the average change in response with the particular level, averaged over all
levels of the other factors, averaged over all levels of price in this example.
115
Similarly, I can estimate the main effect of price R55, by taking the average response with R55
over all levels of campaign type. This is sometimes called hidden replication: even though the
treatments are not replicated, the levels of each factor are.
Can I estimate the interaction effects when there treatments are not replicated? In an a × b
factorial experiment, there are a × b interaction effects, one for every treatment. The
interaction effect measures how different the mean response is relative to the sum of the main
effects (µ + αi + βj ).
Consider the RHS plot in Figure 7.3, and the typical model for a factorial experiment with 2
treatment factors:
In order to estimate the interaction effect (αβ)ij , we need to compare the mean response at
this treatment to µ + αi + βj (the sum of the main effects). But there is only one observation
here, and we need this observation to estimate eijk , i.e. the experimental unit and interaction
effect are confounded here. The only solution to this is to have replication at that level. For
example, if we want to estimate the interaction effect of newspaper campaign with a price of
R65, we need to have replication at newspaper and R65 (and every other campaign x price
treatment).
One always needs to be able to estimate an error term (experimental unit effect). If there is
only one observation per treatment, we need to assume that the effect of newspaper is the
same for all price levels. Then we can estimate an average (main) effect for newspaper. But we
cannot test or establish whether the effect of newspaper is different in the different price levels.
If I want to estimate the effect of a particular combination of factor levels, over and above the
average effects (i.e. the interaction effect), then I need replication at that combination of
treatment levels.
Sums of Squares
To understand the sums of squares and the corresponding degrees of freedom, think in terms
of the design that was used. For example, Figure 7.4 shows the break-down of the total sum of
squares in a completely randomized factorial experiment. Exactly like in a CRD, the total
sum of squares is split into error and treatment sums of squares. There are a × b treatments,
thus ab − 1 treatment degrees of freedom. There are abn experimental units in total (ab
treatments, each replicated n times), thus abn − 1 error degrees of freedom. Sums of squares
for main effects are as before, and the interaction degrees of freedom is just the rest (or again
the typical cross-tabulation degrees of freedom that you have come across in the RBD and in
the chi-squared test for independence).
The treatment mean is calculated as the average of the n observations for treatment ij, as
before, and the interaction effect is estimated as
116
SSerror
ab(n − 1)
SStotal
abn − 1 SSAB
(a − 1)(b − 1)
SStreatment
ab − 1 SSB
b−1
SSA
a−1
Figure 7.4: Breakdown of total sum of squares in a completely randomized factorial experiment,
with two treatment factors.
The data below show tensile strength (Pa) of asphaltic concrete specimens for two aggregate
types with each of four compaction methods. Tensile strength is the amount of force one must
apply to pull something apart until it breaks, and is here measured in Pascal = newtons/m2 .
Suppose this was a factorial experiment and designed properly. Then, how many treatment
factors? How many treatments? How many experimental units? How many replicates per
treatment?
There are 2 treatment factors: aggregate type with 2 levels, and compaction method with 4
levels. This gives 2 × 4 = 8 treatments (Table 7.1). There are 3 replicates of each treatment.
You can see this in Table 7.1 or from the degrees of freedom in the ANOVA table. Total
number of experimental units = 24, assuming the same number of replicates for each
treatment, makes 3 replicates per treatment.
Note how the model was fitted in R. One term for every term in the model, the interaction is
written as aggregate:compaction.
117
Table 7.2: ANOVA table for asphalt strength example.
Df Sum Sq Mean Sq F value Pr(>F)
aggregate 1 1734.0 1734.0 182.526 3.628e-10
compaction 3 16243.5 5414.5 569.947 < 2.2e-16
aggregate:compaction 3 1145.0 381.7 40.175 1.124e-07
Residuals 16 152.0 9.5
In R, aggregate*compaction means all interactions and lower order terms, i.e. main effects.
So
is equivalent to the earlier R statement, i.e. fits exactly the same model. And these model
terms again correspond exactly to the terms in the ANOVA table.
From the ANOVA table we can see which terms are important or explain the variation in the
data. Usually one starts with the highest-order interaction term. Here, the
aggregate-compaction interaction term.
H0 : (αβ)ij = 0 ∀ij
i.e. all interactions terms equal 0, i.e. there is no interaction between aggregate type and
compaction method, i.e. the average response can be explained just using main effects (note,
the ∀ij is important, the hypothesis tests if they are all simultaneously equal to 0. The
alternative hypothesis is that at least one of these interaction terms does not equal 0.
There is strong evidence that aggregate type and compaction method do interact (p < 0.0001).
This means that the effect of compaction method depends on type of aggregate being used. If
interaction effects are present, it is mostly useful to look at an interaction plot to further
interpret the results.
At this stage it is important to consider what the experimenter actually wanted to learn from
this experiment, i.e. what is important or interesting from a practical point of view. If you are
lucky, the experimenter will have stated this in terms of clear contrasts or hypotheses. Here,
let’s see if we can come up with some interesting points.
The response is tensile strength, so it might be interesting to see which aggregate type,
compaction method, or combination result in the strongest and the weakest asphalt.
Generally, basalt seems to result in stronger asphalt, except with the static compaction
method, where it does not seem to matter which aggregate is used. The 2 lines are fairly
parallel, except for the static compaction method, where the difference seems much smaller.
Of the compaction methods, regular seems to produce the strongest asphalt, and very low
compaction the weakest.
What about the main effects? They measure the average effect. Some people say that it does
not make sense to interpret or test main effects when interactions are present. However, I
118
aggregate
120
basalt
silicious
100
mean of strength
80
60
40
compaction
Figure 7.5: Interaction plot for asphalt strength data. On the y-axis is plotted the mean
response (tensile strength (Pa)).
believe that they can still give you some information. One important thing to note when
interpreting main effects in the presence of interactions (e.g. from an ANOVA table), is that
NO or very little evidence for the presence of main effects, does NOT mean that the factor has
no effect on the response. It only means that, on average (averaged over all levels of the other
factor) there is no difference between the levels. But the interaction tells you that there is an
effect, it just differs for each level of the other factor.
In the asphalt example, there is evidence for main effects of aggregate type and for the main
effects of compaction method (both p < 0.0001). Compaction method actually seems to be the
most important factor in determining asphalt strength, as it has clearly the largest mean
square. The main effects here can be interpreted, but one should proceed with caution, they
only give average effects. For example, on average, a basalt aggregate results in stronger
asphalt than a siliceous aggregate, but this is not true for all compaction methods (the
interaction). And there are fairly clear differences between compaction methods.
Confidence intervals
In the end we want to know more about how large some of these differences are (contrasts). So,
let’s construct a 95% confidence interval to estimate the difference in asphalt strength between
the two aggregate types when using a static compaction method. This asks us to compare the
two static treatments. Here is some R output from which we can get the relevant means:
------------------------------------------
> model.tables(m1, "means")
Tables of means
Grand mean
119
78.75
aggregate
aggregate
basalt silicious
87.25 70.25
compaction
compaction
low regular static very low
79.0 120.0 66.5 49.5
aggregate:compaction
compaction
aggregate low regular static very low
basalt 97.33 129.00 65.33 57.33
silicious 60.67 111.00 67.67 41.67
------------------------------------------
σ̂ 2 9.5 × 2
V ar(L̂) = ×2= = 6.333
3 3
Each treatment mean is based on 3 observations, and the estimate for σ 2 comes from MSE in
the ANOVA table. Then the 95% confidence interval is:
q
L̂ ± t16 × 9.5×2
3 = [−7.68; 3.00]
This tells us that with the static compaction method, asphalt with basalt is estimated to be
between 7.68 Pa weaker and 3.00 Pa stronger than asphalt with siliceous aggregate, i.e. there
is no clear indication that it makes a difference which aggregate type is used; or we could say
that the difference between the two methods is not large enough to say anything with
certainty.
To estimate effects, we also need the above R output. For example to estimate the main effect
for basalt:
This tells us that ON AVERAGE asphalt with basalt is 8.5 Pa stronger than the overall mean,
and, perhaps more meaningful, 17 Pa stronger than if made with siliceous aggregate, averaged
over all levels of compaction method. This works here because there are only 2 levels of
aggregate type, and because the effect estimates sum to zero. The actual difference in strength
120
(with basalt vs silicious aggregate type) at a particular compaction method level depends on
compaction method level.
This tells us that mean asphalt strength with the static compaction method and basalt
aggregate is 9.67 Pa lower relative to the sum of the main effects, or relative to the mean
under an additive (without interactions) model.
What is the predicted strength for the regular-basalt treatment combination? Assuming, we
use the model including the interaction term, this would just be the observed treatment mean:
The residual corresponding to the first observation from this treatment would be
If the tensile strength had been tested only once for each treatment, we would not have been
able to estimate the interaction effects. So our model would necessarily have to also exclude
the interaction term. We would only be able to estimate average/main effects, and would miss
out on the interaction effects.
I could make all pairwise comparisons of treatments, as before. There are 8 treatments, so
8
2 = 28 pairwise comparisons. I might then choose to use Tukey’s method of controlling the
experiment wise type I error rate.
When constructing contrasts of treatment means, keep track of how many observations each
mean was based on (for calculating standard errors). For example, if you compare average
strength with basalt aggregate to average strength with silicious aggregate, then each of those
two means are based on 12 observations (see Table 7.1). So the standard error of this
difference would be
r
9.5 × 2
SE =
12
Here are data from a completely randomized factorial experiment with 3 treatment factors.
Four-week weight gain of shrimp cultured in aquaria at different levels of temperature (T),
density of shrimp populations (D), and water salinity (S) was investigated.
121
TEMPERATURE SALINITY DENSITY GAIN
25 10 80 86, 52, 73
25 10 160 86, 53, 73
25 25 80 544, 371, 482
25 25 160 393, 398, 208
25 40 80 390, 290, 397
25 40 160 249, 265, 243
35 10 80 439, 436, 349
35 10 160 324, 305, 364
35 25 80 249, 245, 330
35 25 160 352, 267, 316
35 40 80 247, 277, 205
35 40 160 188, 223, 281
Here are two equivalent ways to fit the full factorial model in R:
What is the experimental unit? The individual shrimp? An aquarium? It is an aquarium, the
entity to which a specific combination (treatment) of temperature, salinity and density can be
applied. There are 12 treatments, each replicated 3 times, makes 36 experiment units.
What can we learn from the above ANOVA table? By far the largest sum of squares, and
mean squares, is accounted for by the temperature-salinity interaction term. This is some
indication that this term is the most important, and can explain much of the variability in the
response. Next, salinity seems quite important.
But formally, we again start at the highest order interaction, here, the three-way interaction
between all three factors (temp:salinity:density). There is evidence of a three-way interaction.
This is a bit problematic, as it becomes almost impossible to understand these things. A
three-way interaction basically means that the two-way interactions are different for the
different levels of the third factor.
122
interpret. For the shrimp example, this would leave only the temperature-salinity interaction
with a small p-value.
But here, let’s see what we can learn from the current data and model. Again, if factors
interact, it is a good idea to look at interaction plots. We saw that salinity is quite important,
so here are shown two plots, one for each level of temperature. The temperature-salinity
interaction term captures much of the variability in the data, so let’s see how the effect of
salinity (on weight gain) changes at different levels of temperature. This is, by definition, the
salinity:temperature interaction.
25 C 35 C
400 400
300 300
gain
gain
200 200
100 100
0 0
10 25 40 10 25 40
salinity salinity
Figure 7.6: Interaction plots for shrimp example. Response is weight gain of shrimp.
At the lower temperature level (25 °C) shrimp don’t do very well at low salinity, but much
better at medium salinity levels. At the higher temperature level (35 °C), shrimp do better at
the lower salinity level, but weight gain decreases with increasing salinity.
The three-way interaction here means that the interaction between salinity and density
changes with temperature level.
This example demonstrates how complicated understanding of experiments can become. And
that the ANOVA table can give you a bit of information on which terms seem to be
important, but that the really hard part is to make sense of the results and try and
understand what is really going on, how the different variables affect the response, and how
effects are influenced by other factors.
123
inhibitor
55
2
1
50
45
mean of N
40
35
30
25
1 2 3
timing
(early, optimum, late) and 2 levels of nitrification inhibitor (none, 1.5 kg/ha). Data are
percentage of labelled N taken up by sweet corn plants. The nitrification inhibitor prevents
the nitrogen being released all at once, and disappearing. With the inhibitor, nitrogen is
therefore available for longer.
Nitrogen Inhibitor
None 1.5 kg/ha
Block Early Optimum Late Early Optimum Late mean
1 21.4 50.8 53.2 54.8 56.9 57.7 49.13
2 11.3 42.7 44.8 47.9 46.8 54.0 41.25
3 34.9 61.8 57.8 40.1 57.9 62.0 52.42
mean 22.53 51.77 51.93 47.6 53.87 57.9
1. Write down the linear model for this experiment (including all constraints).
124
5. Draw the interaction plot by hand to illustrate interactions between the two treatment
factors.
6. Summarize what you can learn from the ANOVA table, and the interaction plot about
HOW timing and presence or absence of the nitrogen inhibitor affect nitrogen found in
plants.
7. Estimate effects, standard errors, confidence intervals for means and contrasts.
8. Be able to apply any of the methods of controlling for type I error to calculating
confidence intervals, tests.
Note that ‘factorial experiment’ refers to the treatment structure and is not one of the basic
experimental designs. Factorial experiments can be conducted as any of the 3 designs we have
seen in earlier chapters.
1. To get a proper estimate of σ 2 more than one observation must be taken on each
treatment - i.e. we must have replication. In factorial experiments a replication is a
replication of all treatments (all factor level combinations).
2. The same number of units should be assigned to each treatment combination. The total
sum of squares can then be uniquely split into components associated with the main
effects of the factors and their interactions. The sums of squares are independent. This
allows us to assess the effect of factor A, say, independently of factor B, etc.. If unequal
numbers of units are assigned to the treatments, the simplicity of the ANOVA and
interpretation breaks down. There is no longer a unique split of the total sums of
squares into independent sums of squares associated with each factor. The sum of
squares for factor A will depend upon whether or not factor B has been fitted. The
conclusions that can be drawn are not as clear as they are with a balanced design.
Unbalanced designs are difficult to analyse.
3. Factorial designs can generate a large number of treatments. If factor A has 2 levels, B
has 3 and C has 4, then there are 24 treatments. If three replications are made then 72
experimental units will be needed. If there are not sufficient homogeneous experimental
units, they can be grouped into blocks. Each replication could be made in a different
block. Incomplete factorial designs are available, in which only a carefully selected
number of treatments are used. See any of the recommended texts for details.
Why are factorial experiments better than experimenting with one factor at
a time?
Consider the following simple example: Suppose the yield of a chemical process depends on 2
factors: (T) - the temperature at which the reaction takes place and (P) - the pressure at
which the reaction takes place.
125
Suppose T has 2 levels T1 and T2
and P has 2 levels P1 and P2
Suppose we experiment with one factor at a time. We would need at least 3 observations to
give us information on both factors. Thus we would observe Y at T1 P1 and T2 P1 which would
measure a change in temperature only because the pressure is kept constant. If we then
observed Y at T1 P2 we could measure the effect of change on pressure only. Thus we have the
following design
T1 T2
P1 (1) (2)
P2 (3)
For a factorial experiment with the above treatments, we consider every treatment
combination.
T1 T2
P1 (1) (2)
P2 (3) (4)
If there is no interaction, i.e. the effect of changing temperature does not depend on the
pressure level, then the estimates (i.) and (ii.) only differ by experimental error and their
average gives the effect of temperature just as precisely as the duplicate observation in (1) and
(2) we needed in the one factor experiment. The same is true for the pressure effect. Hence, if
there is no interaction of factors, we can obtain as much information with 4 observations in a
factorial experiment as we did with 6 observations varying only one at a time. This is because
all 4 observations are used to measure each effect in the factorial experiment whereas in the
one factor at a time experiment only 32 of the observations are used to estimate each effect.
126
Suppose the factors interact. If we experiment with one factor at a time we have the situation
as shown above. We see that T1 P2 and T2 P1 give higher yields than T1 P1 . Could we assume
that T2 P2 would be better than both? This would be true if factors didn’t interact. If they
interact then T2 P2 might be very much better than both T1 P2 and T2 P1 or it may be very
much worse. The “one factor at a time” experiments do not tell us this because we do not
experiment at the most favourable (or least favourable) combination.
Suppose we investigate the effects of 2 factors, A and B where A has a levels and B has b
levels. The a × b treatments are arranged in a factorial design and the design is replicated n
times. The a × b × n experimental units are assigned to the treatments as in the completely
randomised design with n units assigned to each treatment.
B B1 B2 . . . Bb
¯
A1 x̃ ¯
x̃ ¯
. . . x̃
A ¯
x̃ ¯
x̃ ¯
. . . x̃
A .2 . .. .
.. .. ..
. . ..
¯
Aa x̃ ¯
x̃ ¯
. . . x̃
The entire design should be completed, then a second replicate model made, etc. This is
relatively easy to do in agricultural experiments, since the replicates would be made
simultaneously on different plots of land. In chemical, industrial or psychological experiments
there is a tendency for a treatment combination to be set up and a number of observations
made, before passing on to the next treatment. This is not a replication and if the experiment
is analysed as though it were, the estimate of the error variance will be too small. Performed
this way the observations within a treatment are correlated and a different analysis should be
used. See Winer (pg. 391–394) for details.
7.8 Interaction
127
Factor B
B1 B2 . . . Bj . . . Bb
A1
A2
..
.
Ai µij µi·
Factor A ..
. ↑ ↑
..
. (2) (1)
..
. ↓ ↓
Aa
µ·j µ··
Then
If there is no interaction Ai will have the same effect at each level of B. If there is interaction
it can be measured by the difference of (8.2) - (8.1),
The same formula for the Ai Bj interaction would be found if we started with the main effect
for the j th level of B (µij − µ·· ) and compared it with the effect of Bj at the ith level of A,
µij − µi· .
From equation (5.3) we see that the interaction involves every cell in the table, so if any cells
are empty (i.e. no observations on that treatment) the interaction cannot be found.
In practice we have a random error as well. Replacing the true means in (5.3) by their sample
means we estimate the interaction by
128
7.9 Interpretation of results of factorial experiments
No interaction
Very rarely, if we a-priori do not expect the presence of any interactions, we would fit the
following model:
Yijk = µ + αi + βj + eijk
Interpretation is very simple. No interaction means that factors act on the response
independently of each other. Apart from random variation, the difference between
observations corresponding to any level of A is the same for all levels of B and vice versa. The
main effects summarise the whole experiment.
Interaction present
Most often, interactions are one of the main points of interest in factorial experiments, and we
would fit the following model:
The plots of the means of B at each level of A may interweave or diverge since the mean of B
depends on what level of A is present.
If interactions are present, main effects estimate the average effect (averaged over all levels of
the other factor). For example if α1 > α2 , we must say that averaged over B α1 > α2 but for
some levels of B, it may happen that α1 < α2 . Interaction plots are very useful when
interpreting results in the presence of interaction effects. These plots can give a good
indication as to patterns and could be used to answer some of the following or similar
questions.
Which levels are always better than other levels, regardless of the level of B?
Some of these questions should be part of your a-priori hypothesis. In statistical reports,
however, plots are not enough. For a report on any of the questions, we would rephrase them
in the form of a contrast and give a confidence interval to back up our statement.
Sometimes a large interaction indicates that a non-linear relationship between response and
treatment factors. In this case a non-linear transformation of the response variable (e.g. a
log-transformation) may produce a model with no interactions. Such transformations are
called power transformations.
129
Power transformations
λ
Z = y λ−1 λ 6= 0
Z = log(y) λ = 0
Special cases of these include the square root transformation or the log transformation. A log
transformation of the observations means that we really have a multiplicative model
If the data are transformed for analysis, then all inferences such as mean differences and
confidence intervals are calculated from the transformed values. After all these quantities the
results are transformed and expressed in terms of the original data.
The value of λ has to be found by trial and error or can be estimated by maximum likelihood.
It can sometimes make the experiment difficult to interpret. Alternatively, if the interaction is
very large, the data can be analysed as a one-way layouts with (nb) observations per
treatment or as a completely randomised design with (ab) treatments with n observations per
treatment. When experimenting with more than 2 factors higher order interactions may be
present, for example a 3 factor interaction or 4 factor interaction. Higher order interactions
are difficult to interpret. A direct interpretation in terms of interactions is rarely enlightening.
A good discussion of principles to follow when higher order interactions are present is given by
Cox (1984). If higher order interactions are present he recommends attempting one or more of
the following approaches:
4. Splitting the factor combinations on the bases of one or more factors, e.g. considering
AB for each level of C.
5. Adopting a new system of factors for the description of the treatment combinations.
130
X X X X
αi = βj = (αβ)ij = (αβ)ij = 0
i j
i = 1, . . . , a
j = 1, . . . , b
k = 1, . . . , n
µ = general mean
αi = main effect of the ith level of A
βj = main effect of the j th level of B
(αβ)ij = interaction between the ith level of A and the j th level of B.
Note that (αβ) is a single symbol and does not mean that the
interaction is a product of the two main effects.
The “sum to zero” constraints are defined as part of the model, and ensure that the parameter
estimates subject to these constraints are unique. Other commonly used constraints are the
“corner-point ” constraints α1 = β1 = αβ1j = αβi1 = 0. Again these estimates are unique
subject to these constraints, but different from those given by the “sum to zero” constraints.
The maximum likelihood/least squares estimates are found by minimizing
X
S= (Yijk − µ − αi − βj − (αβ)ij )2 (7.5)
ijk
Differentiating with respect to each of the (ab + a + b +1) parameters and setting the
derivatives equal to zero gives
∂S P
∂µ = −2 ijk (Yijk − µ − αi − βj − (αβ)ij ) = 0
∂S P
∂αi = −2 jk (Yijk − µ − αi − βj − (αβ)ij ) = 0 i = 1, . . . , a
∂S P
∂βi = −2 ik (Yijk − µ − αi − βj − (αβ)ij ) = 0 j = 1, . . . , b
∂S P
∂(αβ)ij = −2 k (Yijk − µ − αi − βj − (αβ)ij ) = 0 i = 1, . . . , a
j = 1, . . . , b
abnµ = Y···
bnµ + bnαi = Yi·· i = 1, . . . , a
anµ + anβj = Y·j· j = 1, . . . , b
nµ + nαi + nβj + n(αβ)ij = Yij· i = 1, . . . , a
j = 1, . . . , b
131
µ̂ = Ȳ···
α̂i = Ȳi·· − Ȳ··· i = 1, . . . , a
βˆj = Ȳ·j· − Ȳ··· j = 1, . . . , b
ˆ
(αβ) = Ȳij· − Ȳi·· − Ȳ·j· + Ȳ··· i = 1, . . . , a
ij
j = 1, . . . , b
Note that s2 is obtained by pooling the within cell variances and could be written as
− 1)s2ij
P
2 ij (n
s =
ab(n − 1)
When interpreting an ANOVA table, one should always start with the highest order
interactions. If strong interaction effects are present, interpreting main effects needs to
consider this. For example if there is no evidence for main effects of factor A, this DOES NOT
mean that factor A does not affect the response.
The alternative hypothesis is, in each case, at least one of the parameters in H is non-zero.
The F–test for each of these hypotheses effectively compares the full model to one of the three
reduced models:
132
4. Yijk = µ + βj + eijk which omits effects due to A.
The residual sum of squares from the full model provides the sum of squares for error s2 with
ab(n − 1) degrees of freedom. Denote this sum of squares by SSE. To obtain the appropriate
sums of squares for each of the three hypotheses, we could obtain the residual sums of squares
from each of the models.
X
S= (Yijk − µ − αi − βj )2 (7.7)
ijk
Equating to zero
∂S X
= −2 (Yijk − µ − αi − βj ) = 0
∂µ
ijk
Note that the least squares estimates for µ, αi and βj are the same as under the full model.
This is because the X’X matrix is orthogonal (or equivalently block-diagonal). This will not
be the case if the numbers of observations per treatment differ (unbalanced designs). The
residual sum of squares under HAB is
XXX
SSres = (Yijk − µ̂ − α̂i − β̂j )2 (7.8)
i j k
The numerator sum of squares for the F test of HAB is given by the difference between (7.8)
and the residual sum of squares from the full model. Regrouping the terms of (7.5) as
X
(Yijk − µ̂ − α̂i − βˆj − (αβ)
ˆ ij )2 (7.9)
| {z } | {z }
ijk
X X
(Yijk − µ̂ − α̂i − βˆj )2 − n ˆ 2
(αβ)ij (7.10)
ijk ij
133
since the cross-product terms are zero in summation. From (7.8) and (7.10) we see that the
numerator sum of squares for the F tests is
X
SSAB = n ˆ 2
(αβ)ij
ij
X
= n (Ȳij· − Ȳi·· − Ȳ·j· + Ȳ··· )2
ij
is made using
M SAB
∼ F(a−1)(b−1),ab(n−1)
M SE
where
SSAB SSE
M SAB = and M SE =
(a − 1)(b − 1) ab(n − 1)
Similar results can be derived for the test of HA and HB . Because of the orthogonality of the
design, the estimates of the main effects under the reduced models are the same as under the
full model. Hence we can split the total sum of squares about the grand mean
X
SStotal = (Yijk − µ̂)2
ijk
X
= (Yijk − Ȳ··· )2
ijk
uniquely as
134
Table 7.3: Analysis of variance table for two-factor factorial experiment
M SA nb
− Ȳ··· )2 σ2 + 2
P P
A Main Effects SSA = nb i (Ȳi·· (a − 1) M SA M SE a−1 i αi
M SB na
− Ȳ··· )2 σ2 + βj2
P P
B Main Effects SSB = na j (Ȳ·j· (b − 1) M SB
135
M SE b−1 j
M SAB n
− Ȳi·· − Ȳ·j· + Ȳ··· )2 σ2 + 2
P P
AB Interactions SSAB = n ij (Ȳij· (a − 1)(b − 1) M SAB M SE (a−1)(b−1) ij (αβ)ij
− Ȳij· )2 σ2
P
Error SSE = ijk (Yijk ab(n − 1) M SE
− Ȳ··· )2
P
Total SStotal = ijk (Yijk abn − 1
Comments on the ANOVA table
1. Computing formula for the ANOVA table. These would be used for hand calculations:
C = abn 2
PȲ···2
SScells = n Ȳij· − C
bn Pi Ȳi··2 − C
P
SSA =
SSB = an j Ȳ·j· 2 −C
2. The expected mean squares are found by replacing the observations in the sums of
squares by their expected values under the full model, dividing by the degrees of freedom
and adding σ 2 . For example
X
SSA = nb (Ȳi·· − Ȳ··· )2
i
Now
and
E(α̂i ) = αi
nb X 2
E(M SA ) = σ 2 + αi
a−1
i
3. Power analysis and sample size: The non-centrality parameter of the F test is found
using a similar technique. For
αi2 nb αi2
P P
nb
HA : λ= φ2 =
σ2 aσ 2
βj2 na βj2
P P
na 2
HB : λ= φ =
σ2 bσ 2
136
and the non-central F has (b-1) and ab(n-1) df
αβ 2ij (αβ ij )2
P P
n 2 n
HAB : λ= φ =
σ2 abσ 2
and the non-central F has (a-1)(b-1) and ab(n-1) df.
The non-centrality parameters can be used to determine the number of replications
necessary to achieve a given power for certain specified configurations of the parameters.
Note that the error degrees of freedom is ab(n-1) where a = number of levels of A and b
is the number of levels of B. As a rough rule of thumb, we should aim to have enough
replications to give about 20 degrees of freedom for error. In higher-way layouts, where
some of the interactions may be zero, we can allow even fewer degrees of freedom for the
error. In practise, the number of replications possible is often determined by the amount
of experimental material, and the time and resources of the experimenter. Nonetheless, a
few power calculations are helpful, especially if the F test fails to reject the null
hypothesis. The reason for this could be due to the F test having insufficient power.
If the interactions were significant it makes sense to compare the ab cell means (treatment
combinations). If no interactions were found, one can compare the levels of A and the levels of
B separately. If the treatment levels are ordered, it is preferable to test effects using
orthogonal polynomials. Both the main effects and the interactions can be decomposed into
orthogonal polynomial contrasts.
Questions:
1. On how many observations is each of the cell means based? What is the standard error
for the difference between two cell means?
2. If we compare levels of factor A only, on how many observations are the means based?
What is the standard error for a difference between two means now?
p
1 main effect A, B, C etc.
p
2 2 factor interactions
p
3 3 factor interactions
p
p =1 p-factor interaction
137
If there are n > 1 observations per cell we can split SStotal into 2p − 1 component SS’s and an
error SS. If p > 4 or 5, factorial experiments are very difficult to carry out.
7.14 Examples
Example 2
A small experiment has been carried out investigating the response of a particular crop to two
nutrients, N and K. A completely randomised design was used with six treatments arranged in
a 3 × 2 factorial structure. N was applied in 0, 4 and 8 units, whilst K was applied in 0 and 4
units.
K0 K4
10.02 10.72
N0 11.74 14.08
13.27 11.87
20.65 19.33
N4 18.88 20.77
16.92 21.70
19.47 21.45
N8 20.06 20.92
20.74 24.87
There is not enough evidence for an interaction between N and K (p = 0.63). There is strong
evidence that different N (nitrogen) levels affect yield (p < 0.0001), but only little evidence
that yield differed with different levels of K (potassium) (p < 0.06).
Note that the levels of nitrogen are equally spaced. As a next step we could fit orthogonal
polynomials to see if the relationship between yield and nitrogen is quadratic (levels off), as
perhaps suggested by the interaction plot. From the interaction plot it seems that perhaps the
effect of K increases with increasing N, but the differences are too small to say anything with
certainty (about K) from this experiment.
138
22 K
2
1
20
mean of Yield
18
16
14
12
0 4 8
N
Figure 7.7: Interaction plot for nitrogen and potassium factorial experiment.
Return to the bond strength example from the beginning of the chapter.
Model
X X X X
αi = βj = (αβ)ij = (αβ)ij = 0
i j
Cross-Lap and Square-Centre assembly with adhesive 001 appear to be more variable than the
other treatments — but this is not significant:
A modern robust test for the homogeneity of variances across groups is Levene’s test. It is
based on absolute deviations from the group medians. It is available in R as leveneTest from
package car.
139
assembly adhesive
24
24
square−centre 001
cross−lap 00T
round−centre 047
22
22
mean of strength
mean of strength
20
20
18
18
001 00T 047 cross−lap round−centre square−centre
adhesive assembly
Figure 7.8: These interaction plots show (a) the mean bond strength for adhesive at different
levels of assembly, and (b) the mean bond strength for assembly at different levels of adhesive.
-----------------------------------------------------------
Levene’s Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 8 1.0689 0.406
36
-----------------------------------------------------------
The null hypothesis is that all variances are equal, therefore there were no significant
differences between the variances. Here is the ANOVA table:
---------------------------------------------------
Df Sum Sq Mean Sq F value Pr(>F)
adhesive 2 127.51 63.756 3.6432 0.03623
assembly 2 4.98 2.489 0.1422 0.86791
adhesive:assembly 4 196.09 49.022 2.8013 0.04015
Residuals 36 630.00 17.500
---------------------------------------------------
Adhesive and assembly method interact in their effects on strength (p = 0.04). So we look no
further at the main effects but instead look at the interaction plots (Figure 7.8) to give us an
idea of how these factors interact to influence strength.
The round-centre assembly method works well with adhesive 00T (best of all combinations),
but relatively poorly with the other two adhesives. The other two assembly methods seem to
work best with adhesive 001, intermediate with 00T and worst with 047. We could now do
some post-hoc tests (with corrections for multiple testing) to see for example whether the two
assembly methods (square-centre and cross-lap) make any difference.
140
Questions:
2. Refer back to the ANOVA table, to the assembly line. Does the large p-value imply that
assembly method has no effect on strength? With the help of the interaction plots
briefly explain.
Example 3
A hypothetical experiment from Dean and Voss (1999): In 1994 the Statistics Department at
a university introduced a self-paced version of a data analysis course that is usually taught by
lecture. Suppose the department is interested in student performance with each of the 2
methods, and also how student performance is affected by the particular instructor teaching
the course. The students are randomly assigned to one of the six treatments.
Figure 7.9 shows different hypothetical interaction plots that could result from this study. The
y-axis shows student performance, the x-axis shows instructor. In which of these do method
and instructor interact?
141
3.0
3.0
● ● ●
method method
L ● SP
mean of M1
mean of M1
2.0
2.0
● ● ● ● SP L
1.0
1.0
0.0
0.0
1 2 3 1 2 3
instructor instructor
3.0
3.0
● ●
method method
L ● SP
mean of M1
mean of M1
2.0
2.0
● SP L
● ●
Student Performance
1.0
1.0
● ●
0.0
0.0
1 2 3 1 2 3
instructor instructor
3.0
3.0
●
●
method method
●
● SP L
mean of M1
mean of M1
2.0
2.0
L ● ● SP
●
●
1.0
1.0
0.0
0.0
1 2 3 1 2 3
instructor instructor
3.0
3.0
●
method ●
method
● ●
● SP ● SP
mean of M1
mean of M1
2.0
2.0
L L
1.0
1.0
●
0.0
0.0
1 2 3 1 2 3
Figure 7.9: Possible configurations of effects present for two factors, Instructor and Teaching
Method.
142
Chapter 8
So far we have assumed that the treatments used in an experiment are the only ones of
interest. We aimed to estimate the treatment means and to compare differences between
treatments. If the experiment were to be repeated, the same treatments would be used again.
This means that each factor used in defining the treatments would have the same levels.
When this is the case we say that the treatments or the factors defining them are fixed. The
model is referred to as a fixed effects model. The simplest fixed effects model is the completely
randomised design, or one-way lay-out, in which a treatments are compared. N experimental
units are randomly assigned to the a treatments, usually with n subjects per treatment and
the j th observation on the ith treatment has the structure
i = 1, . . . , a
P j = 1, . . . , n
αi = 0
where
µ = general mean
αi = effect of treatment i
eij = random error such that eij ∼ N (0, σ 2 )
Example
143
the cholesterol levels were returned to the Department and the results analysed using a
one-way analysis of variance (model (9.1)). Here, the parameter αi measured the effect of the
ith laboratory. Significant differences between the αi ’s would mean that some laboratories
tended to return, on average, higher values of cholesterol than others.
Now consider the other situation. Suppose the Department believes that the determination of
cholesterol levels varies from laboratory to laboratory about some mean value and they want
to measure the amount of variation. Now they are not interested in any particular laboratory,
so they select 4 laboratories at random from a list of all laboratories that perform such
analyses, and send each laboratory ten samples as before.
Now if this experiment were repeated, there is very little chance of the same four laboratories
being used so, the effect of the laboratory is random. We now write a model for the j th
determination from the ith laboratory as:
Here we assume that ai is a random variable such that E(ai ) = 0 and V ar(ai ) = σa2 , where σa2
measures the component in the variance of Yij that is due to laboratory. We also assume that
ai is independent of eij . The term eij has E(eij ) = 0 and V ar(eij ) = σe2 .
Hence
E(Yij ) = µ
V ar(Yij ) = σa2 + σe2
i = 1, . . . , a
j = 1, . . . , b
This is called the random effects model or the variance components model. To distinguish
between a random or fixed factor, ask yourself this question: If the experiment were repeated,
would I observe the same levels of A? If the answer is Yes, then A is fixed. If No, then A is
random.
Note that the means of the observations do not have a structure. In more complex situations
it is possible to formulate models in which some effects are fixed and others are random. In
this case a structure would be defined for the mean. For a two-way classification with A fixed
and B random
where αi is a fixed effect of A, and bj is the value of a random variable with E(bj ) = 0 and
V ar(bj ) = αb2 , due to the random effect of B. The model (9.3) is called a mixed effects model.
144
8.2 The Random Effects Model
Assume that the levels of factor A are a random sample of size a from a large population of
levels. Assume n observations are made on each level of A. Let Yij be the j th observation on
the ith level of A, then
Yij = µ + ai + eij
where
µ is the general mean
ai is a random variable with E(ai ) = 0 and V ar(ai ) = σa2
eij is a random variable with E(eij ) = 0 and V ar(eij ) = σe2
ai and eij are uncorrelated
Further it is usually assumed that ai and eij are normally distributed. The Analysis of
Variance table is set up as for the one-way fixed effects model.
The fixed and random effects models are very similar and we can show that the fixed effects
MSE provides an unbiased estimate of σe2 , i.e. E(M SE) = σ 2 (the proof is left as an exercise).
Does the fixed-effects mean square for treatments, M SA , provide an unbiased estimate for σa2 ?
Not quite! However, we can show that M SA is an unbiased estimator for nσa2 + σe2 , i.e.
E(M SA ) = nσa2 + σe2 .
Then
M SA − M SE
E( ) = σa2
n
NOTE: this estimator can be negative, even though σa2 cannot be zero. This will happen when
M SE > M SA and is most likely to happen when σa2 is close to zero. If M SE is considerably
greater than M SA , the model should be questioned.
Can we use the same test as we used in the fixed effects model for testing the equality of
treatment effects?
If H0 is true, then
σa2 = 0
=⇒ E(M SA ) = 0 + σe2 = σe2
= E(M SE)
E(M SA )
=⇒ E(M SE) = 1
145
However, if σa2 is large, the expected value of the numerator is larger than the expected value
of the denominator and the ratio should be large and positive, which is a similar situation to
the fixed effects case.
Table 8.1: Anova for simple random effects model, with expected mean squares (EMS):
M SA
Does M Se ∼ Fa−1;N −a under H0 ? We can show (see exercise below) that
SSA
∼ χ2a−1
nσa2 + σe2
and that
SSE
∼ χ2N −a
σe2
M SA
(nσa2 +σe2 ) M SA
M Se
=
σe2
M SE
∼ Fa−1,n−a
M SA
Under H1 , M SE has a non-central F distribution.
Exercise
1. Show that Cov(Yis , Yit ) = σa2 , where Yis and Yit are two observations in group i.
2
2. Use the above to show that (for the simple random effects model) V ar(Ȳi. ) = σa2 + σni .
This implies that the observed variance between group means does not directly estimate
σa2 .
SSA
3. Hence show that nσa2 +σe2
∼ χ2a−1 . [Hint: Consider the distribution of (Ȳi. − Ȳ.. )2 ]
146
Expected Mean Squares for the Random Effects Model
XX X
SSE = Yij2 − ni Ȳi.2
i
Then
1 X
Ȳi. = µ + ai + eij
ni
σ2
V ar(Ȳi. ) = σa2 +
ni
E(Ȳi. ) = µ
σ2
E(Ȳi.2 ) = (σa2 + + µ2 )
ni
XX X σ2
E(SSE) = (σa2 + σ 2 + µ2 ) − ni (σa2 + + µ2 )
ni
i
= N σ 2 − νσ 2
= (N − ν)σ 2
E(M SE) = σ 2
P
Above, N = ni , and ν is the number of random effect levels.
X
SSA = ni Ȳi.2 − N Ȳ..2
1 X 1 XX
Ȳ.. = µ + ni ai + eij
N N
n2i 2
P
N
E(Ȳ.. ) = µ, V ar(Ȳ.. ) = 2
σa + 2 σ 2
N N
σ2
E(Ȳi. ) = µ, V ar(Ȳi. ) = σa2 +
ni
147
σ2
P 2
ni 2 σ 2
X
E(SSA) = + 2
ni (σa2
+µ )−N σ + +µ 2
ni N2 a N
i
P 2
ni
= N− σa2 + (ν − 1)σ 2
N
E(M SA) = = cσa2 + σ 2
Thus
M SA − M SE
E = σa2
c
If all ni = n then c = n.
Rather than testing whether or not the variance of the population of treatment effects is zero,
one may want to test whether the variance is equal to (or less than) some proportion of the
error variance, i.e,
σa2
H0 : = θ0 , for some constant θ0
σe2
α
We can use the same F-statistic, but reject H0 if F > (1 + nθ0 )Fa−1;n−a .
Variance Components
Usually estimation of the variance components are of greater interest than the tests. We have
already shown that
(M SA − M SE)
σ̂a2 =
n
and
σ̂e2 = M SE
To determine whether or not flavours of ice cream melt at different speeds, a random sample
of three flavours were selected from a large population of flavours. The three flavours of ice
cream were stored in the same freezer in similar-sized containers. For each observation one
teaspoonful was taken from the freezer, transferred to a plate, and the melting time at room
temperature was observed to the nearest second. Eleven observations were taken on each
flavour:
148
Flavour Melting Times (sec)
1 24 876 1150 1053 1041 1037
1125 1075 1066 977 886
2 891 982 1041 1135 1019 1093
994 960 889 967 838
3 817 1032 844 841 785 823
846 840 848 848 832
Anova Table:
Source df SS MS F p
Flavour 2 173 009.8788 86 504.9394 12.76 0.0001
Error 30 203 456.1818 6 781.8727
Total 32 376 466.0306
An unbiased estimate for σ̂e2 = 6 781.8727 secs2 . An unbiased estimate for σa2 is
M SA − M SE 86504.9394 − 6781.8727
σ̂a2 = = = 7247.5515 secs2
n 11
Ha : σa2 = 0 vs Ha > 0 can be tested using F = 12.76, p = 0.0001 where the p-value comes
from the F2;30 distribution.
In such an experiment there will be a lot of error variability in the data due to fluctuations of
room temperature and the difficulty of determining the exact time at which the ice cream has
melted completely. Hence variability in melting times of different flavours (σa2 ) is unlikely to
be of interest unless it is larger than the error variance:
σa2 σa2
H0 : σa2 ≤ σe2 vs H0 : σa2 > σe2 ≡ H0 : σe2
≤ 1 vs H0 : σe2
>1.
Again use F = 12.76 but compare with (1 + 11 × 1)F2;30 = 12F2;30 =⇒ there is no evidence
that variation between flavours is larger than the error variance.
Nested designs are common in sampling designs, and less common for real experiments.
However, many of the principles of analysis, and ANOVA still apply for such carefully
designed studies. For example, the data are still balanced.
Nested designs that are more like real experiments occur in animal breeding studies, e.g.
where a bull is crossed with a number of cows, but the cows are nested in bull. Also in
microbiology, where you can have daughter clones from mother clones (bacteria or fungi).
149
City 1 City 2 City 1
S1 S2 S3 S1 S2 S3 S1 S2 S3
H1 8876 9141 9785 9483 10049 9975 9990 10023 10451
H2 8745 9827 9585 8461 9720 11230 9218 10777 11094
H3 8601 10420 9009 8106 12080 10877 9472 11839 10287
To be precise we should really label the streets as a factor with 9 levels, S1, S2, ... , S9 since
there are 9 different streets. The same remarks apply to the houses. We say that the streets
are nested in the cities, since to locate a street we must also state the cities and likewise, the
houses are nested in the streets. We denote nested factors as S(C). The effects associated with
the factor will have 2 subscripts, bij , where i denotes the level of C and j the levels
Clearly we need another model. Since a factor S-Street is nested in the Cities, and the
H-House is nested in the street. Also, since the streets and houses within each city were
sampled, if the study were repeated the same houses and streets would not be selected again
(assuming there is a large number of streets and houses in each city). So, both S and H are
random factors. If Yijk is the amount of electricity consumed by the k th household in the j th
street in the ith city.
i = 1, 2, 3
j = 1, 2, 3
k = 1, 2, 3
where
eijk is the random effect of the k th house in the j th street in the ith city
1. To estimate the mean consumption in each city, and to compare the mean consumptions
150
2. To estimate the variance component due to streets and to households within streets.
We assume that these components are constant over the three cities.
Ȳ··· estimates µ
(Ȳi·· − Ȳ··· ) estimates αi 1=1, ..., a
(Ȳij· − Ȳi·· ) measures the contribution from the j th street in city i
(Ȳijk − Ȳij· ) measures the contribution from the k th house in the j th street in the ith city
We can construct expressions for the ANOVA table from the identity
Squaring and summing over ijk, the cross products vanish on summation and we find the sums
of squares are
X X XX X
(Yijk − Ȳ··· )2 = nb (Ȳi·· − Ȳ··· )2 + n (Ȳij· − Ȳi·· )2 + (Ȳijk − Ȳij· )2
ijk i i j ijk
We denote these sums of squares as SStotal = SSC + SSS(C) + SSE and they have
abn − 1 = (a − 1) + a(b − 1) + ab(n − 1) degrees of freedom.
Here we assume that there are a cities, b streets are sampled in each city and n houses in each
street. Note that the last term SSE should strictly speaking be written as SSH(S(C)) .
However, it is exactly the same expression as would be evaluated for an error sum of squares,
so it is called SSE. We give the ANOVA table and state the values for the expected mean
squares. A complete derivation is given in Scheffé (1959).
Source SS df MS EMS
α2
P
SSC
Cities (Fixed) SSC (a-1) (a−1) σe2 + nσb2 + bn (a−1)i
SSS(C)
Streets (Random) SSS(C) a(b-1) a(b−1) σe2 + nσb2
SSE
Houses (Random) SSE ab(n-1) ab(n−1) σe2
M SC
F = ∼ Fa−1,a(b−1)
M SS(C)
151
σe2 is estimated by M Se
These are method of moment estimators. The maximum likelihood estimates can also be
found. See Graybill.
No special program is needed for a balanced nested design. Any program that will calculate a
factorial ANOVA can be used.
For our example we calculate the ANOVA table as though is were a three-way cross
classification with factors C,S and H. The ANOVA table is:
Source SS df
Cities 488.4 2
S Streets 1090.6 2
H Houses 49.1 2
CS 142.6 4
CH 32.3 4
SH 592.6 4
CSH 203.3 8
Source SS df MS F
Cities 488.4 2 244.20 1.19
Streets 1233.3 6 205.55 4.22
Houses 877.3 18 48.7
M SC 244.2
= = 1.19
M SS(C) 205.5
This is distributed as F2,6 and is not significant. Conclude that there is no difference in mean
consumption of electricity between two cities.
152
M SS(C) 205.5
= = 4.22 ∼ F6,18
M Se 48.7
Reject H0 .
M SS(C) − M Se
n
Then
205 − 48.7
σ̂s2 = = 52.25
3
and
σ̂e2 = 48.7
We note that the variation attributed to streets is about the same size as the variation
attached to houses. Since there is no significant difference in mean consumption between
cities, we estimate the mean consumption as
Ȳ··· = 9896.85
For further reading on these designs see Dunn and Clark (1974), Applied Statistics: Analysis
of Variance and Regression.
A1 A2 A3
B1 B2 B1 B2 B1 B2
- - - - - -
- - - - - -
- - - Y222 - -
- - - - - -
Ȳ11· Ȳ12· Ȳ21· Ȳ22· Ȳ31· Ȳ32·
Ȳ1·· Ȳ2·· Ȳ3··
Ȳ···
153
Yijk = k th observation at the j th level of B nested in the ith level of A
Yijk = µ + αi + bij + eijk
P
αi = 0
bij ∼ N (0, σB 2)
Factor B can also be regarded as fixed. Instead of estimating an overall variance for factor B,
the contribution of the levels of B within each level A is of interest. The sum of squares for B
(nested in A) is
a X
X b
SSB(A) = n (Yij· − Ȳi·· )2
i j
and has a(b-1) degrees of freedom. This can be split into “a” component sums of squares each
with (b-1) degrees of freedom
Pb
where SSB(Ai ) = n j=1 (Ȳij − Ȳi·· )2
M SB(Ai )
So tests of the levels of B nested in level i of A can be made using M Se
1. Confidence levels for linear combinations of the means can be found. The variance will
be estimated by M SB(A) .
2. We have assumed that the levels of B are sampled from an infinite (or very large)
“population” of levels. If there are not a large number of possible values of the levels of
B, a correction factor is included in EMS for A. For example, if b levels are drawn from
nb α2i
P
2 n 2
a possible K levels in the populations then E(M SA ) = σe + n(1 − K )σ + a−1 .
154
8.4 Repeated Measures
A form of experimentation often used in medical and psychological studies is one in which a
number of subjects are measured on several occasions or undergo several different treatments.
The aim of the experiment is to compare treatments or occasions. It is hoped that by using
the same subjects on each treatment more sensitive comparisons can be made because
variation will be reduced. The treatments are regarded as a fixed effect and the subjects are
assumed to be sampled from a large population of subjects so we have again a mixed model.
More complex experimental set-ups than the one described here are used, for details see Winer
(1971). The general theory of balanced mixed models is given in Scheffé (1959). Other
methods for such data are given on Hand and Crowder (1995). Repeated measures data,
which is the term to describe the data from such experiments, can also be analysed by
methods of multivariate analysis. However, for relatively small numbers of subjects, the
ANOVA methods described here are useful.
Example
A psychologist is studying memory retention. She takes 10 subjects and each subject is asked
to learn 50 nonsense words. She then tests each subject 4 times: after 12, 36, 60 and 84 hours
after learning the words. On each occasion she scores the subject’s performance. The data
have the form:
Subjects
1 2 ... j ... 10
1 Y11 Y12 . .
. . . . .
Times
i . . Yij .
4 . . . Y4,10
Interest centres on comparing the recalls at each time. If the same subjects had not been
tested each time, the design would have been a completely randomised design. However, it is
not because of the repeated measurement on each subject.
Let
Yij = µ + αi + bj + eij
i = 1, . . ., a
j = 1, . . ., b
where
µ = general mean
αi = effect of the ith occasion or treatment
bj = effect of the j th subject
eij = random error
155
Assume
E(bj ) = 0 and V ar(bj ) = σb2
E(eij ) = 0 and V ar(eij ) = σe2
and that bj and eij are independent
From the formulation, we see that we have a 2-way design with one fixed effect (times) and
one random effect (subjects). Formally it appears to be the same as a randomised block
design with subjects forming the block, but there is one important difference: the times could
not have been randomly assigned in the blocks. Thus we have, in our example with 10
subjects, exactly 10 experimental units receiving 4 treatments each. Strictly speaking with a
randomised block design we would have had 40 experimental units, arranged in 10 blocks with
four homogenous units in each. The units within a block would have been assigned at random
to the treatments, giving 40 independent observations. The observations are independent both
within each block and between blocks. With repeated measurement data, the four
observations within a block all made on the same subject are possibly dependent, even if the
treatments can be randomly assigned. Of course the observations on different subjects are
independent. If the observations made on the same subject are correlated the data strictly
speaking should be handled as a multivariate normal vector with mean µ and covariance
matrix Σ. Tests of hypothesis about the mean vector of treatments can be made (Morrison).
However, if the covariance matrix Σ has a pattern such that σij = σii + σjj − 2σij is constant
for all ij, then the ANOVA approach used here is valid. It can also be shown that for a small
number of subjects the ANOVA test has greater power.
We proceed with the calculations in exactly the way as we did for the Randomised Block
Design, and obtain the same ANOVA table (consult the earlier notes for the exact formulae
and calculations). The ANOVA table is:
From the Expected Mean Squares column, we see that the hypothesis of H0 : No differences
between occasions/treatments, or, α1 = α2 = . . . ... = αa = 0, can be tested using
M So
F = ∼ F(a−1);(a−1)(b−1)
M So×sub
The test is identical to that of the treatment in the randomised block design.
For this and all other more complex repeated measurement designs, see Winer Chapter 4. For
other methods of handling repeated measurement data see Hand and Crowder (1996).
References
1. Hand, D and Crowder (1996). Practical Longitudinal Data Analysis. Chapman and
Hall. Texts in Statistical Sciences
156
2. Winer, B.J. (1971). Statistical Principles in Experimental Design. McGraw Hill. Gives a
detailed account of the ANOVA Approach to repeated measures.
157
Appendix A
Tables
94624 84407 45836 58183 36101 73520 99122 19331 38822 52221
27773 48118 18689 35557 37737 93664 34373 44674 97248 95390
32246 87711 67008 75556 79643 91530 27001 93596 47409 90699
05911 10106 92494 59476 26405 58259 55267 74589 47411 97392
16206 28851 36012 73942 47459 96562 93845 95200 35009 61712
96095 88923 71388 27976 49319 35417 36915 01138 87799 76922
94970 87741 49306 38968 19504 40641 85510 12277 79581 32057
73650 91850 79223 38101 98276 52227 41620 76817 64831 00309
69946 75822 36055 57946 76155 02751 92400 78527 30124 67665
89400 61375 79686 36559 70057 99931 32555 85051 22440 35351
158
Table A.2: Upper 1% Points of Studentized Range q
p∗
n2 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 13.90 19.02 22.56 25.37 27.76 29.86 31.73 33.41 34.93 36.29 37.53 38.66 39.70 40.66 41.54 42.36 43.13 43.85 44.53
2 8.26 10.62 12.17 13.32 14.24 15.00 15.65 16.21 16.71 17.16 17.57 17.95 18.29 18.62 18.91 19.20 19.46 19.71 19.95
3 6.51 8.12 9.17 9.96 10.58 11.10 11.54 11.92 12.26 12.57 12.84 13.09 13.32 13.53 13.72 13.91 14.08 14.24 14.39
4 5.70 6.98 7.80 8.42 8.91 9.32 9.67 9.97 10.24 10.48 10.70 10.89 11.08 11.24 11.40 11.55 11.68 11.81 11.93
5 5.24 6.33 7.03 7.56 7.97 8.32 8.61 8.87 9.10 9.30 9.48 9.65 9.81 9.95 10.08 10.21 10.32 10.43 10.54
6 4.95 5.92 6.54 7.00 7.37 7.68 7.94 8.17 8.37 8.55 8.71 8.86 9.00 9.12 9.24 9.35 9.46 9.55 9.65
7 4.75 5.64 6.20 6.62 6.96 7.24 7.47 7.68 7.86 8.03 8.18 8.31 8.44 8.55 8.66 8.76 8.85 8.94 9.03
8 4.60 5.43 5.96 6.35 6.66 6.91 7.13 7.33 7.49 7.65 7.78 7.91 8.03 8.13 8.23 8.33 8.41 8.49 8.57
9 4.48 5.27 5.77 6.14 6.43 6.67 6.87 7.05 7.21 7.36 7.49 7.60 7.71 7.81 7.91 7.99 8.08 8.15 8.23
10 4.39 5.15 5.62 5.97 6.25 6.48 6.67 6.84 6.99 7.13 7.25 7.36 7.46 7.56 7.65 7.73 7.81 7.88 7.95
11 4.32 5.05 5.50 5.84 6.10 6.32 6.51 6.67 6.81 6.94 7.06 7.17 7.26 7.36 7.44 7.52 7.59 7.66 7.73
12 4.26 4.96 5.40 5.73 5.98 6.19 6.37 6.53 6.67 6.79 6.90 7.01 7.10 7.19 7.27 7.35 7.42 7.48 7.55
159
13 4.21 4.89 5.32 5.63 5.88 6.08 6.26 6.41 6.54 6.66 6.77 6.87 6.96 7.05 7.13 7.20 7.27 7.33 7.39
14 4.17 4.84 5.25 5.56 5.80 5.99 6.16 6.31 6.44 6.55 6.66 6.76 6.84 6.93 7.00 7.07 7.14 7.20 7.26
15 4.13 4.79 5.19 5.49 5.72 5.92 6.08 6.22 6.35 6.46 6.56 6.66 6.74 6.82 6.90 6.97 7.03 7.09 7.15
16 4.10 4.74 5.14 5.43 5.66 5.85 6.01 6.15 6.27 6.38 6.48 6.57 6.66 6.73 6.81 6.87 6.94 7.00 7.05
17 4.07 4.70 5.09 5.38 5.60 5.79 5.94 6.08 6.20 6.31 6.41 6.50 6.58 6.65 6.73 6.79 6.85 6.91 6.97
18 4.05 4.67 5.05 5.33 5.55 5.73 5.89 6.02 6.14 6.25 6.34 6.43 6.51 6.58 6.65 6.72 6.78 6.84 6.89
19 4.02 4.64 5.02 5.29 5.51 5.69 5.84 5.97 6.09 6.19 6.28 6.37 6.45 6.52 6.59 6.65 6.71 6.77 6.82
20 3.96 4.55 4.91 5.17 5.37 5.54 5.69 5.81 5.92 6.02 6.11 6.19 6.26 6.33 6.39 6.45 6.51 6.56 6.61
24 3.89 4.45 4.80 5.05 5.24 5.40 5.54 5.65 5.76 5.85 5.93 6.01 6.08 6.14 6.20 6.26 6.31 6.36 6.41
30 3.82 4.37 4.70 4.93 5.11 5.26 5.39 5.50 5.60 5.69 5.76 5.83 5.90 5.96 6.02 6.07 6.12 6.16 6.21
40 3.76 4.28 4.59 4.82 4.99 5.13 5.25 5.36 5.45 5.53 5.60 5.67 5.73 5.78 5.84 5.89 5.93 5.97 6.01
60 3.70 4.20 4.50 4.71 4.87 5.01 5.12 5.21 5.30 5.37 5.44 5.50 5.56 5.61 5.66 5.71 5.75 5.79 5.83
∞ 3.64 4.12 4.40 4.60 4.76 4.88 4.99 5.08 5.16 5.23 5.29 5.35 5.40 5.45 5.50 5.54 5.58 5.61 5.65
∗p is the number of quantities (e.g. means) whose range is involved; n2 is the degrees of freedom in the error estimate.
produced using R: R Development Core Team (2008). R: A language and environment for statistical computing.
R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.
Table A.3: Upper 5% Points of Studentized Range q
p∗
n2 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
2 6.08 8.33 9.80 10.88 11.73 12.43 13.03 13.54 13.99 14.40 14.76 15.09 15.39 15.67 15.92 16.16 16.38 16.59 16.78
3 4.50 5.91 6.82 7.50 8.04 8.48 8.85 9.18 9.46 9.72 9.95 10.15 10.35 10.52 10.69 10.84 10.98 11.11 11.24
4 3.93 5.04 5.76 6.29 6.71 7.05 7.35 7.60 7.83 8.03 8.21 8.37 8.52 8.66 8.79 8.91 9.03 9.13 9.23
5 3.64 4.60 5.22 5.67 6.03 6.33 6.58 6.80 6.99 7.17 7.32 7.47 7.60 7.72 7.83 7.93 8.03 8.12 8.21
6 3.46 4.34 4.90 5.30 5.63 5.90 6.12 6.32 6.49 6.65 6.79 6.92 7.03 7.14 7.24 7.34 7.43 7.51 7.59
7 3.34 4.16 4.68 5.06 5.36 5.61 5.82 6.00 6.16 6.30 6.43 6.55 6.66 6.76 6.85 6.94 7.02 7.10 7.17
8 3.26 4.04 4.53 4.89 5.17 5.40 5.60 5.77 5.92 6.05 6.18 6.29 6.39 6.48 6.57 6.65 6.73 6.80 6.87
9 3.20 3.95 4.41 4.76 5.02 5.24 5.43 5.59 5.74 5.87 5.98 6.09 6.19 6.28 6.36 6.44 6.51 6.58 6.64
10 3.15 3.88 4.33 4.65 4.91 5.12 5.30 5.46 5.60 5.72 5.83 5.93 6.03 6.11 6.19 6.27 6.34 6.40 6.47
11 3.11 3.82 4.26 4.57 4.82 5.03 5.20 5.35 5.49 5.61 5.71 5.81 5.90 5.98 6.06 6.13 6.20 6.27 6.33
12 3.08 3.77 4.20 4.51 4.75 4.95 5.12 5.27 5.39 5.51 5.61 5.71 5.80 5.88 5.95 6.02 6.09 6.15 6.21
13 3.06 3.73 4.15 4.45 4.69 4.88 5.05 5.19 5.32 5.43 5.53 5.63 5.71 5.79 5.86 5.93 5.99 6.05 6.11
160
14 3.03 3.70 4.11 4.41 4.64 4.83 4.99 5.13 5.25 5.36 5.46 5.55 5.64 5.71 5.79 5.85 5.91 5.97 6.03
15 3.01 3.67 4.08 4.37 4.59 4.78 4.94 5.08 5.20 5.31 5.40 5.49 5.57 5.65 5.72 5.78 5.85 5.90 5.96
16 3.00 3.65 4.05 4.33 4.56 4.74 4.90 5.03 5.15 5.26 5.35 5.44 5.52 5.59 5.66 5.73 5.79 5.84 5.90
17 2.98 3.63 4.02 4.30 4.52 4.70 4.86 4.99 5.11 5.21 5.31 5.39 5.47 5.54 5.61 5.67 5.73 5.79 5.84
18 2.97 3.61 4.00 4.28 4.49 4.67 4.82 4.96 5.07 5.17 5.27 5.35 5.43 5.50 5.57 5.63 5.69 5.74 5.79
19 2.96 3.59 3.98 4.25 4.47 4.65 4.79 4.92 5.04 5.14 5.23 5.31 5.39 5.46 5.53 5.59 5.65 5.70 5.75
20 2.95 3.58 3.96 4.23 4.45 4.62 4.77 4.90 5.01 5.11 5.20 5.28 5.36 5.43 5.49 5.55 5.61 5.66 5.71
24 2.92 3.53 3.90 4.17 4.37 4.54 4.68 4.81 4.92 5.01 5.10 5.18 5.25 5.32 5.38 5.44 5.49 5.55 5.59
30 2.89 3.49 3.85 4.10 4.30 4.46 4.60 4.72 4.82 4.92 5.00 5.08 5.15 5.21 5.27 5.33 5.38 5.43 5.47
40 2.86 3.44 3.79 4.04 4.23 4.39 4.52 4.63 4.73 4.82 4.90 4.98 5.04 5.11 5.16 5.22 5.27 5.31 5.36
60 2.83 3.40 3.74 3.98 4.16 4.31 4.44 4.55 4.65 4.73 4.81 4.88 4.94 5.00 5.06 5.11 5.15 5.20 5.24
120 2.80 3.36 3.68 3.92 4.10 4.24 4.36 4.47 4.56 4.64 4.71 4.78 4.84 4.90 4.95 5.00 5.04 5.09 5.13
∞ 2.77 3.31 3.63 3.86 4.03 4.17 4.29 4.39 4.48 4.55 4.62 4.69 4.74 4.80 4.85 4.89 4.94 4.98 5.01
∗p is the number of quantities (e.g. means) whose range is involved; n2 is the degrees of freedom in the error estimate.
produced using R: R Development Core Team (2008). R: A language and environment for statistical computing.
R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.
Table A.4: t-Distribution: One sided critical values, i.e. the value of tPn such that
P = P r[tn > tPn ], where n is the degrees of freedom
Probability Level P
n 0.4 0.3 0.2 0.1 0.05 0.025 0.01 0.005 0.0025 0.001 0.0005
1 0.325 0.727 1.376 3.078 6.314 12.706 31.821 63.657 127.321 318.309 636.619
2 0.289 0.617 1.061 1.886 2.920 4.303 6.965 9.925 14.089 22.327 31.599
3 0.277 0.584 0.978 1.638 2.353 3.182 4.541 5.841 7.453 10.215 12.924
4 0.271 0.569 0.941 1.533 2.132 2.776 3.747 4.604 5.598 7.173 8.610
5 0.267 0.559 0.920 1.476 2.015 2.571 3.365 4.032 4.773 5.893 6.869
6 0.265 0.553 0.906 1.440 1.943 2.447 3.143 3.707 4.317 5.208 5.959
7 0.263 0.549 0.896 1.415 1.895 2.365 2.998 3.499 4.029 4.785 5.408
8 0.262 0.546 0.889 1.397 1.860 2.306 2.896 3.355 3.833 4.501 5.041
9 0.261 0.543 0.883 1.383 1.833 2.262 2.821 3.250 3.690 4.297 4.781
10 0.260 0.542 0.879 1.372 1.812 2.228 2.764 3.169 3.581 4.144 4.587
11 0.260 0.540 0.876 1.363 1.796 2.201 2.718 3.106 3.497 4.025 4.437
12 0.259 0.539 0.873 1.356 1.782 2.179 2.681 3.055 3.428 3.930 4.318
13 0.259 0.538 0.870 1.350 1.771 2.160 2.650 3.012 3.372 3.852 4.221
14 0.258 0.537 0.868 1.345 1.761 2.145 2.624 2.977 3.326 3.787 4.140
15 0.258 0.536 0.866 1.341 1.753 2.131 2.602 2.947 3.286 3.733 4.073
16 0.258 0.535 0.865 1.337 1.746 2.120 2.583 2.921 3.252 3.686 4.015
17 0.257 0.534 0.863 1.333 1.740 2.110 2.567 2.898 3.222 3.646 3.965
18 0.257 0.534 0.862 1.330 1.734 2.101 2.552 2.878 3.197 3.610 3.922
19 0.257 0.533 0.861 1.328 1.729 2.093 2.539 2.861 3.174 3.579 3.883
20 0.257 0.533 0.860 1.325 1.725 2.086 2.528 2.845 3.153 3.552 3.850
21 0.257 0.532 0.859 1.323 1.721 2.080 2.518 2.831 3.135 3.527 3.819
22 0.256 0.532 0.858 1.321 1.717 2.074 2.508 2.819 3.119 3.505 3.792
23 0.256 0.532 0.858 1.319 1.714 2.069 2.500 2.807 3.104 3.485 3.768
24 0.256 0.531 0.857 1.318 1.711 2.064 2.492 2.797 3.091 3.467 3.745
25 0.256 0.531 0.856 1.316 1.708 2.060 2.485 2.787 3.078 3.450 3.725
26 0.256 0.531 0.856 1.315 1.706 2.056 2.479 2.779 3.067 3.435 3.707
27 0.256 0.531 0.855 1.314 1.703 2.052 2.473 2.771 3.057 3.421 3.690
28 0.256 0.530 0.855 1.313 1.701 2.048 2.467 2.763 3.047 3.408 3.674
29 0.256 0.530 0.854 1.311 1.699 2.045 2.462 2.756 3.038 3.396 3.659
30 0.256 0.530 0.854 1.310 1.697 2.042 2.457 2.750 3.030 3.385 3.646
31 0.256 0.530 0.853 1.309 1.696 2.040 2.453 2.744 3.022 3.375 3.633
32 0.255 0.530 0.853 1.309 1.694 2.037 2.449 2.738 3.015 3.365 3.622
33 0.255 0.530 0.853 1.308 1.692 2.035 2.445 2.733 3.008 3.356 3.611
34 0.255 0.529 0.852 1.307 1.691 2.032 2.441 2.728 3.002 3.348 3.601
35 0.255 0.529 0.852 1.306 1.690 2.030 2.438 2.724 2.996 3.340 3.591
36 0.255 0.529 0.852 1.306 1.688 2.028 2.434 2.719 2.990 3.333 3.582
37 0.255 0.529 0.851 1.305 1.687 2.026 2.431 2.715 2.985 3.326 3.574
38 0.255 0.529 0.851 1.304 1.686 2.024 2.429 2.712 2.980 3.319 3.566
39 0.255 0.529 0.851 1.304 1.685 2.023 2.426 2.708 2.976 3.313 3.558
40 0.255 0.529 0.851 1.303 1.684 2.021 2.423 2.704 2.971 3.307 3.551
45 0.255 0.528 0.850 1.301 1.679 2.014 2.412 2.690 2.952 3.281 3.520
50 0.255 0.528 0.849 1.299 1.676 2.009 2.403 2.678 2.937 3.261 3.496
60 0.254 0.527 0.848 1.296 1.671 2.000 2.390 2.660 2.915 3.232 3.460
70 0.254 0.527 0.847 1.294 1.667 1.994 2.381 2.648 2.899 3.211 3.435
80 0.254 0.526 0.846 1.292 1.664 1.990 2.374 2.639 2.887 3.195 3.416
90 0.254 0.526 0.846 1.291 1.662 1.987 2.368 2.632 2.878 3.183 3.402
100 0.254 0.526 0.845 1.290 1.660 1.984 2.364 2.626 2.871 3.174 3.390
110 0.254 0.526 0.845 1.289 1.659 1.982 2.361 2.621 2.865 3.166 3.381
120 0.254 0.526 0.845 1.289 1.658 1.980 2.358 2.617 2.860 3.160 3.373
140 0.254 0.526 0.844 1.288 1.656 1.977 2.353 2.611 2.852 3.149 3.361
160 0.254 0.525 0.844 1.287 1.654 1.975 2.350 2.607 2.846 3.142 3.352
180 0.254 0.525 0.844 1.286 1.653 1.973 2.347 2.603 2.842 3.136 3.345
200 0.254 0.525 0.843 1.286 1.653 1.972 2.345 2.601 2.839 3.131 3.340
∞ 0.253 0.524 0.842 1.282 1.645 1.960 2.326 2.576 2.807 3.090 3.291
161
Table A.5: F distribution: Fα,ν1 ,ν2 for P (F ≥ Fα,ν1 ,ν2 ) = α
α = 0.01
ν1 1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 ∞
ν2
1 4052 4999.5 5403 5624 5763 5858 5928 5981 6022 6055 6106 6157 6208 6234 6260 6286 6313 6339 6365
2 98.5 99 99.17 99.25 99.3 99.33 99.36 99.37 99.39 99.4 99.42 99.43 99.45 99.46 99.47 99.47 99.48 99.49 99.5
3 34.12 30.82 29.46 28.71 28.24 27.91 27.67 27.49 27.35 27.23 27.05 26.87 26.69 26.6 26.5 26.41 26.32 26.22 26.13
4 21.2 18 16.69 15.98 15.52 15.21 14.98 14.8 14.66 14.55 14.37 14.2 14.02 13.93 13.84 13.75 13.65 13.56 13.46
5 16.26 13.27 12.06 11.39 10.97 10.67 10.46 10.29 10.16 10.05 9.89 9.72 9.55 9.47 9.38 9.29 9.2 9.11 9.02
6 13.75 10.92 9.78 9.15 8.75 8.47 8.26 8.1 7.98 7.87 7.72 7.56 7.4 7.31 7.23 7.14 7.06 6.97 6.88
7 12.25 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62 6.47 6.31 6.16 6.07 5.99 5.91 5.82 5.74 5.65
8 11.26 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81 5.67 5.52 5.36 5.28 5.2 5.12 5.03 4.95 4.86
9 10.56 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 5.26 5.11 4.96 4.81 4.73 4.65 4.57 4.48 4.40 4.31
10 10.04 7.56 6.55 5.99 5.64 5.39 5.2 5.06 4.94 4.85 4.71 4.56 4.41 4.33 4.25 4.17 4.08 4 3.91
11 9.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54 4.4 4.25 4.1 4.02 3.94 3.86 3.78 3.69 3.6
12 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.5 4.39 4.3 4.16 4.01 3.86 3.78 3.7 3.62 3.54 3.45 3.36
13 9.07 6.7 5.74 5.21 4.86 4.62 4.44 4.30 4.19 4.10 3.96 3.82 3.66 3.59 3.51 3.43 3.34 3.25 3.17
162
14 8.86 6.51 5.56 5.04 4.69 4.46 4.28 4.14 4.03 3.94 3.8 3.66 3.51 3.43 3.35 3.27 3.18 3.09 3
15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.89 3.80 3.67 3.52 3.37 3.29 3.21 3.13 3.05 2.96 2.87
16 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.78 3.69 3.55 3.41 3.26 3.18 3.10 3.02 2.93 2.84 2.75
17 8.40 6.11 5.18 4.67 4.34 4.10 3.93 3.79 3.68 3.59 3.46 3.31 3.16 3.08 3.00 2.92 2.83 2.75 2.65
18 8.29 6.01 5.09 4.58 4.25 4.01 3.84 3.71 3.60 3.51 3.37 3.23 3.08 3.00 2.92 2.84 2.75 2.66 2.57
19 8.18 5.93 5.01 4.5 4.17 3.94 3.77 3.63 3.52 3.43 3.30 3.15 3.00 2.92 2.84 2.76 2.67 2.58 2.49
20 8.1 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.46 3.37 3.23 3.09 2.94 2.86 2.78 2.69 2.61 2.52 2.42
21 8.02 5.78 4.87 4.37 4.04 3.81 3.64 3.51 3.40 3.31 3.17 3.03 2.88 2.80 2.72 2.64 2.55 2.46 2.36
22 7.95 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.35 3.26 3.12 2.98 2.83 2.75 2.67 2.58 2.50 2.4 2.31
23 7.88 5.66 4.76 4.26 3.94 3.71 3.54 3.41 3.30 3.21 3.07 2.93 2.78 2.70 2.62 2.54 2.45 2.35 2.26
24 7.82 5.61 4.72 4.22 3.90 3.67 3.50 3.36 3.26 3.17 3.03 2.89 2.74 2.66 2.58 2.49 2.40 2.31 2.21
25 7.77 5.57 4.68 4.18 3.85 3.63 3.46 3.32 3.22 3.13 2.99 2.85 2.7 2.62 2.54 2.45 2.36 2.27 2.17
26 7.72 5.53 4.64 4.14 3.82 3.59 3.42 3.29 3.18 3.09 2.96 2.81 2.66 2.58 2.5 2.42 2.33 2.23 2.13
27 7.68 5.49 4.60 4.11 3.78 3.56 3.39 3.26 3.15 3.06 2.93 2.78 2.63 2.55 2.47 2.38 2.29 2.2 2.1
28 7.64 5.45 4.57 4.07 3.75 3.53 3.36 3.23 3.12 3.03 2.9 2.75 2.60 2.52 2.44 2.35 2.26 2.17 2.06
29 7.60 5.42 4.54 4.04 3.73 3.50 3.33 3.20 3.09 3.00 2.87 2.73 2.57 2.49 2.41 2.33 2.23 2.14 2.03
30 7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.07 2.98 2.84 2.70 2.55 2.47 2.39 2.30 2.21 2.11 2.01
40 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.89 2.80 2.66 2.52 2.37 2.29 2.20 2.11 2.02 1.92 1.8
60 7.08 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.72 2.63 2.50 2.35 2.20 2.12 2.03 1.94 1.84 1.73 1.6
120 6.85 4.79 3.95 3.48 3.17 2.96 2.79 2.66 2.56 2.47 2.34 2.19 2.03 1.95 1.86 1.76 1.66 1.53 1.38
500 6.69 4.65 3.82 3.36 3.05 2.84 2.68 2.55 2.44 2.36 2.22 2.07 1.92 1.83 1.74 1.63 1.52 1.38 1.16
∞ 6.63 4.61 3.78 3.32 3.02 2.8 2.64 2.51 2.41 2.32 2.18 2.04 1.88 1.79 1.7 1.59 1.47 1.32 1.00
Table A.6: F distribution: Fα,ν1 ,ν2 for P (F ≥ Fα,ν1 ,ν2 ) = α
α = 0.05
ν1 1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 ∞
ν2
1 161.40 199.50 215.70 224.60 230.20 234.00 236.80 238.90 240.50 241.90 243.90 245.90 248.00 249.10 250.10 251.10 252.20 253.30 254.30
2 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 19.4 19.41 19.43 19.45 19.45 19.46 19.47 19.48 19.49 19.50
3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.74 8.70 8.66 8.64 8.62 8.59 8.57 8.55 8.53
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.91 5.86 5.80 5.77 5.75 5.72 5.69 5.66 5.63
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.68 4.62 4.56 4.53 4.50 4.46 4.43 4.40 4.36
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 4.00 3.94 3.87 3.84 3.81 3.77 3.74 3.70 3.67
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.57 3.51 3.44 3.41 3.38 3.34 3.30 3.27 3.23
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.28 3.22 3.15 3.12 3.08 3.04 3.01 2.97 2.93
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.07 3.01 2.94 2.90 2.86 2.83 2.79 2.75 2.71
10 4.96 4.01 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.91 2.85 2.77 2.74 2.70 2.66 2.62 2.58 2.54
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 2.79 2.72 2.65 2.61 2.57 2.53 2.49 2.45 2.4
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.69 2.62 2.54 2.51 2.47 2.43 2.38 2.34 2.3
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.60 2.53 2.46 2.42 2.38 2.34 2.30 2.25 2.21
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2.53 2.46 2.39 2.35 2.31 2.27 2.22 2.18 2.13
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.48 2.40 2.33 2.29 2.25 2.20 2.16 2.11 2.07
163
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.42 2.35 2.28 2.24 2.19 2.15 2.11 2.06 2.01
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45 2.38 2.31 2.23 2.19 2.15 2.10 2.06 2.01 1.96
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 2.34 2.27 2.19 2.15 2.11 2.06 2.02 1.97 1.92
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38 2.31 2.23 2.16 2.11 2.07 2.03 1.98 1.93 1.88
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 2.28 2.20 2.12 2.08 2.04 1.99 1.95 1.90 1.84
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32 2.25 2.18 2.10 2.05 2.01 1.96 1.92 1.87 1.81
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30 2.23 2.15 2.07 2.03 1.98 1.94 1.89 1.84 1.78
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27 2.20 2.13 2.05 2.01 1.96 1.91 1.86 1.81 1.76
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25 2.18 2.11 2.03 1.98 1.94 1.89 1.84 1.79 1.73
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24 2.16 2.09 2.01 1.96 1.92 1.87 1.82 1.77 1.71
26 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22 2.15 2.07 1.99 1.95 1.90 1.85 1.80 1.75 1.69
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 2.20 2.13 2.06 1.97 1.93 1.88 1.84 1.79 1.73 1.67
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19 2.12 2.04 1.96 1.91 1.87 1.82 1.77 1.71 1.65
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 2.18 2.10 2.03 1.94 1.90 1.85 1.81 1.75 1.7 1.64
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 2.09 2.01 1.93 1.89 1.84 1.79 1.74 1.68 1.62
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 2.00 1.92 1.84 1.79 1.74 1.69 1.64 1.58 1.51
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99 1.92 1.84 1.75 1.70 1.65 1.59 1.53 1.47 1.39
120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91 1.83 1.75 1.66 1.61 1.55 1.50 1.43 1.35 1.25
500 3.86 3.01 2.62 2.39 2.23 2.12 2.03 1.96 1.90 1.85 1.77 1.69 1.59 1.54 1.48 1.42 1.35 1.26 1.11
∞ 3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.88 1.83 1.75 1.67 1.57 1.52 1.46 1.39 1.32 1.22 1.00
α/(2m)
Table A.7: Bonferroni t statistic tν
α = 0.01
m 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
ν
5 4.77 5.25 5.60 5.89 6.14 6.35 6.54 6.71 6.87 7.01 7.15 7.27 7.39 7.50 7.98
6 4.32 4.70 4.98 5.21 5.40 5.56 5.71 5.84 5.96 6.07 6.17 6.26 6.35 6.43 6.79
7 4.03 4.36 4.59 4.79 4.94 5.08 5.20 5.31 5.41 5.50 5.58 5.66 5.73 5.80 6.08
8 3.83 4.12 4.33 4.50 4.64 4.76 4.86 4.96 5.04 5.12 5.19 5.25 5.32 5.37 5.62
9 3.69 3.95 4.15 4.30 4.42 4.53 4.62 4.71 4.78 4.85 4.91 4.97 5.02 5.08 5.29
10 3.58 3.83 4.00 4.14 4.26 4.36 4.44 4.52 4.59 4.65 4.71 4.76 4.81 4.85 5.05
11 3.50 3.73 3.89 4.02 4.13 4.22 4.30 4.37 4.44 4.49 4.55 4.60 4.64 4.68 4.86
12 3.43 3.65 3.81 3.93 4.03 4.12 4.19 4.26 4.32 4.37 4.42 4.47 4.51 4.55 4.72
13 3.37 3.58 3.73 3.85 3.95 4.03 4.10 4.16 4.22 4.27 4.32 4.36 4.40 4.44 4.60
14 3.33 3.53 3.67 3.79 3.88 3.96 4.03 4.09 4.14 4.19 4.23 4.28 4.31 4.35 4.50
15 3.29 3.48 3.62 3.73 3.82 3.90 3.96 4.02 4.07 4.12 4.16 4.20 4.24 4.27 4.42
16 3.25 3.44 3.58 3.69 3.77 3.85 3.91 3.96 4.01 4.06 4.10 4.14 4.18 4.21 4.35
17 3.22 3.41 3.54 3.65 3.73 3.80 3.86 3.92 3.97 4.01 4.05 4.09 4.12 4.15 4.29
18 3.20 3.38 3.51 3.61 3.69 3.76 3.82 3.87 3.92 3.96 4.00 4.04 4.07 4.10 4.23
19 3.17 3.35 3.48 3.58 3.66 3.73 3.79 3.84 3.88 3.93 3.96 4.00 4.03 4.06 4.19
20 3.15 3.33 3.46 3.55 3.63 3.70 3.75 3.80 3.85 3.89 3.93 3.96 3.99 4.02 4.15
25 3.08 3.24 3.36 3.45 3.52 3.58 3.64 3.68 3.73 3.76 3.80 3.83 3.86 3.88 4.00
30 3.03 3.19 3.30 3.39 3.45 3.51 3.56 3.61 3.65 3.68 3.71 3.74 3.77 3.80 3.90
40 2.97 3.12 3.23 3.31 3.37 3.43 3.47 3.51 3.55 3.58 3.61 3.64 3.67 3.69 3.79
60 2.91 3.06 3.16 3.23 3.29 3.34 3.39 3.43 3.46 3.49 3.52 3.54 3.57 3.59 3.68
120 2.86 3.00 3.09 3.16 3.22 3.26 3.31 3.34 3.37 3.40 3.43 3.45 3.47 3.49 3.58
∞ 2.81 2.94 3.02 3.09 3.14 3.19 3.23 3.26 3.29 3.32 3.34 3.36 3.38 3.40 3.48
m = number of comparisons
α/(2m)
Table A.8: Bonferroni t statistic tν
α = 0.05
m 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
ν
5 3.16 3.53 3.81 4.03 4.22 4.38 4.53 4.66 4.77 4.88 4.98 5.08 5.16 5.25 5.60
6 2.97 3.29 3.52 3.71 3.86 4.00 4.12 4.22 4.32 4.40 4.49 4.56 4.63 4.70 4.98
7 2.84 3.13 3.34 3.50 3.64 3.75 3.86 3.95 4.03 4.10 4.17 4.24 4.30 4.36 4.59
8 2.75 3.02 3.21 3.36 3.48 3.58 3.68 3.76 3.83 3.90 3.96 4.02 4.07 4.12 4.33
9 2.69 2.93 3.11 3.25 3.36 3.46 3.55 3.62 3.69 3.75 3.81 3.86 3.91 3.95 4.15
10 2.63 2.87 3.04 3.17 3.28 3.37 3.45 3.52 3.58 3.64 3.69 3.74 3.79 3.83 4.00
11 2.59 2.82 2.98 3.11 3.21 3.29 3.37 3.44 3.50 3.55 3.60 3.65 3.69 3.73 3.89
12 2.56 2.78 2.93 3.05 3.15 3.24 3.31 3.37 3.43 3.48 3.53 3.57 3.61 3.65 3.81
13 2.53 2.75 2.90 3.01 3.11 3.19 3.26 3.32 3.37 3.42 3.47 3.51 3.55 3.58 3.73
14 2.51 2.72 2.86 2.98 3.07 3.15 3.21 3.27 3.33 3.37 3.42 3.46 3.49 3.53 3.67
15 2.49 2.69 2.84 2.95 3.04 3.11 3.18 3.23 3.29 3.33 3.37 3.41 3.45 3.48 3.62
16 2.47 2.67 2.81 2.92 3.01 3.08 3.15 3.20 3.25 3.30 3.34 3.38 3.41 3.44 3.58
17 2.46 2.65 2.79 2.90 2.98 3.06 3.12 3.17 3.22 3.27 3.31 3.34 3.38 3.41 3.54
18 2.45 2.64 2.77 2.88 2.96 3.03 3.09 3.15 3.20 3.24 3.28 3.32 3.35 3.38 3.51
19 2.43 2.63 2.76 2.86 2.94 3.01 3.07 3.13 3.17 3.22 3.25 3.29 3.32 3.35 3.48
20 2.42 2.61 2.74 2.85 2.93 3.00 3.06 3.11 3.15 3.20 3.23 3.27 3.30 3.33 3.46
25 2.38 2.57 2.69 2.79 2.86 2.93 2.99 3.03 3.08 3.12 3.15 3.19 3.22 3.24 3.36
30 2.36 2.54 2.66 2.75 2.82 2.89 2.94 2.99 3.03 3.07 3.10 3.13 3.16 3.19 3.30
40 2.33 2.50 2.62 2.70 2.78 2.84 2.89 2.93 2.97 3.01 3.04 3.07 3.10 3.12 3.23
60 2.30 2.46 2.58 2.66 2.73 2.79 2.83 2.88 2.91 2.95 2.98 3.01 3.03 3.06 3.16
120 2.27 2.43 2.54 2.62 2.68 2.74 2.78 2.82 2.86 2.89 2.92 2.95 2.97 3.00 3.09
∞ 2.24 2.39 2.50 2.58 2.64 2.69 2.73 2.77 2.81 2.84 2.87 2.89 2.91 2.94 3.02
m = number of comparisons
164