Multivariate Statistical Method
Multivariate Statistical Method
Multivariate Statistical Method
1 "
~
l ( (
¡.
i
1
i
j
,I
¡ .
1I
\
1;-'
,... ... l\l/ultilrariate
TOJEAN
Bryan F.J. Manly
Dep ortmenlof Mothemolics ond Statislics
Universily of Ologo
New Zeolond
,e . BL ,UO T ~ e A
~ DAV.IO __V~~A~~E FALC~Nj
'l\
046~)
~
LONDON NEW YORK
Chapman and Hall
i
I
\
!
!-Inl p"I;lilhrd il1 19M hy
Chl/fll11{,n I/J/{/I/al! tu/
/1 :Vc'" rellcr LI/nr, I.{JJ/{/'l/l EC-II''¡¡':E
l' Pl/olí.liJ e" in Ih e lfS,\ h.l" Contents
C/llIpl/lill1 (lJ/{.' 1fl/I!
I
I
29 l\'CSl 351h .'llrecl, Sc!\" }' orl: sr /000/
f(CfJlillll'd I(¡BB. 1989
I
©/9Sr, n rw /II F.J. M(lI/!}
Preface
0 465 IX
f'rillled in Grea l Bri/llin hy l . \\". Arr(l\l"5milh Ud, BrislO!
I
~ ..
vi Canten ls C Úllt cnl s vii
. ::
"
3.8 Compa rison of variation for several samples 39 7.6 Allowing fo r prior proba bilities of group
3.9 Computational methods 41 membcrship 96
References 41 7.7 Stepwise di scriminant function analysi s 96
7.8 Jackknife classification of individuals 97
4 Meosuring ond testing multivoriote distances 42 7.9 Assigning of ungrouped individuals to groups 97
4.1 M ultivariate distances 42 7.10 ComputationaJ method s 98
4.2 Distances between individual observations 42 7.11 Further reading 98
4.3 D istances between populations and samples 47 References 99
4.4 Distances based upon proportions 52
4.5 The Mantel test on distance matrices 53 B Cluster anolysis 100
4.6 Computational methods 57 8.1 Uses of cluster analysis 100
4.7 Further reading 57 . 8.2 Types of cluster analysis 100
References 57 8.3 Hierarchic methods 101
8.4 Problems of cluster analysis 104
S Principal component analysis 59 8.5 Measures of distance 105
5.1 Definition of principal components 59 _ 8.6 Principal component analysis with cluster
5.2 Pro~edufe for a principal component analysis 61 analysis 106
5.3 Computational methods 71 8.7 Further reading 112
5.4 Further rea~ing 71 References 113
References 71
9 Canonical correlation analysis 114
6 Factor analysis 72 9.1 Generalizing a multiple regression analysis lÍ4
6.1 The factor a~alysis model 72 9.2 Procedure for a canonical correlation analysis 116
6.2 Procedure for a factor analysis 74 9.3 Tests of significance 117
6.3 Principal component factor analysis 76 9.4 Interpreting canonical varia tes 119
6.4 Using a factor analysis program to do principal 9.5 Computational meth ods 124
component analysis 78 9.6 Further reading 125
6.5 Options in computer programs 83 References 125
6.6 The value of factor a¡¡alysis 84
6.7 Computational methods 84 10 Multidimensiona] scaling 126
6.8 Further reading 85 10.1 Constructing a 'map' from a distance matrix 126
References 85 10.2 Procedure for multidimensional scaling 128
10.3 Further reading 140
7 Discriminant function anal}'sis 86 References 140
7.1 The problem of separating groups 86
7.2 Discrimination using M ahalanobis distances 87 11 Epilogue 142
7.3 Canonical discriminant functions 87 11.1 The next step 142
7.4 Tests of significance 89 11.2 Sorne general reminders 142
7.5 Assumptions 90 11.3 Graphical methods 143
Vi;;
.ll Gontcnl5
11.4 Missing values 144
Refcrences 145
Preface
Appendix; example sets of data 146
Data Set 1: Prehistoric goblets from Thailand 147
Data Set 2: Caoine groups from Asia 148
Data Set 3: Proteinconsumption in Europe 154
Refereoces 154
-'"
The purpose of this book is to introduce multivariate statistical
Author index 155
methods to non-mathematicians. It is not intended to be particularIy
Subject index 157
comprehensive. Rather, the intention is to keep details to a minimum
while still conveying a good idea ofwhat can be done. Ín other words,
it is a book to 'get you going' in a particular area of statistical
methods.
1t is assumed that readers have a working koowledge of elementary
statistics, particularly tests of significaoce usiog the normal, t, chi-
square and F distributions, analysis ofvariance, and linear regression.
The material covered in a standard first-year universityservice course
.in statistics shotrld be quite adequate, together with a reasonable
facility with ordinary algebra. Al so, understanding multivariate
analysis requires sorne use of matrix algebra. However, the amouot
needed is quite smaII if one is prepared to accept certain details on
¡: faith. Anyone who masters the material io Chapter 2 wiII have the
required basic minimum level of matrix competency.
-.,.
x ]Jre;!(J (;e;
largely con cerned with general aspects of handling multivariC!.tc data -- CHr\PTER ONE
rathcr than with specific techniqucs. Chapter J introduces sorne
examples that are used in subsequent chapters and brieDy describes
the six multivariate methods of analysis that this book is primarily
The material of
eoncerned with. As mentioned aboye, Chapter 2 pro vides the mini- multivariate analysis
mum level of 'matrix competency rcquired for understanding the
remainder of the book. Chapter 3 is about tests of significance and is
no! crucial as far as understanding the' following chapters is
concerned. Chapter 4 is about measuring distances with multivariate
data. At least the first four sections of this chapter should be rcad
befo re Chapters 7,8 and 10.
Chapters 5 to 10 cover what 1 considcr to be the most important 1.1 Examples of multivariate data
multivariate techniques of data analysis. Of these, Chapters 5 and 6
forro a natural pair to be read together. However, Chapters 7 to 10 The. statistical methods that are described in elementary texts are
can be read singly and still (1 hope) make sense. mostly univariate methods because they are only concerned with
Finally, in Chapter 11,1 have attempted to sum up what has been analysing variation in a single random variable. This is even true of
covered and make sorne general comments on good practices with the ' multiple regressionbecause this technique involves trying to account
analysis of multivariate ·data. The Appendix contains three example for variation in one dependent variable. On the other hand, the whole
sets of data for readers to analyse by themselves. point ofa multivariate analysisis to consider several related random
1 am indebted to many people for their cornments on the various variables simultaneously, each one being considered equalIy import-
draft versions of this book. Earl Bardsley read early versions of ant at the start ofthe analysis. The potential value ofthis more general
several of the chapters. Anonymous reviewers read alI or parts of the approach is perhaps best seen by considering a few examples.
work. John Harraway read through the final version. Their comments
have Ied to numerous improvements. However, 1 tak':! all responsi- Example 1.1 Storm survival o/ sparrows
bility for any errors.
Mary-Jane Campbell cheerfully typed and retyped the manuscript After asevere storm on 1 February 1898, a number of moribund
as I made changes. 1 am most grateful to her. sparrows were taken to the biological laboratory at Brown Univers-
ity, Rhode Island. Subsequently about half of the birds died and
B.F.J. ManIy Hermon Bumpus saw this as an opportunity to study the effect of
Dunedin. November 1985 natural selection on the birds. He took eight morphological measure-
ments on each bird and also weighed them. The results for five of the
variables are shown in Table Ll, for females only.
When Bumpus collected his data in 1898 his main interest was in
the light that it would throw on Darwin's theory of natural selection.
He conc1uded from studying the data that 'the birds which perished,
perished not through accident, but beca use they were physically
disqualified, and that the birds which survived, survived beca use they
possessed certain physical characters.' To be specific, the survivors
'are shorter and weigh Iess ... have longer wing bones, Ionger Iegs,
longer sternums and greater brain capacity' than the non-survivors.
1
'i' 'C x a m ples of niUl tivar ia te data
:\iI 3
,i T abl c 1.1 130dy measurcmcnts 01' fcmale sparrows (X J Table 1.1 (Canld.)
!,-~ : = totallcngt h; X 2 = alar cxtent, X ) = length of beak and
l' head, 2( 4 = Icn gth of humerus, X 5 = Icngth of keel of
i'
;1 sternum; al! in mm). Birds 1 to 2 1 survived, while the Bird Xl X2 X3 X4 X5
H remainder died.
il 41 163 242 3 1.0 18.1 20.7
l'
i.! X2 X3 X4 X5 42 156 237 31.7 18.2 20.3
Bird Xl 43 159 238 31.5 18.4 20.3
:,1; 44 161 245 32.1 19.1 20.8
1 156 245 31.6 18.5 20.5
, .
:1: 2 154 240 30.4 17.9 19.6 45 155 235 30.7 17.7 19:6
153 240 31.0 18.4 20.6 46 162 247 31.9 19.1 20.4
3
30.9 17.7 1,0.2 47 153 237 30.6 18.6 20.4
4 153 236
18.6 20.3 48 162 245 32.5 18.5 21.1
5 155 243 31.5
d' 247 32.0 19.0 20.9 49 164 248 32.3 18.8 2q.9
6 163
¡ir 7 157 238 30.9 18.4 20.2
Dala source: Bumpus (189&).
JI: 8 155 239 32.8 18.6 21.2
l' 9 164 248 32.7 19.i 21.1
1i 18.8 22.0
¡: \O 158 238 31.0
18.6 22.0
He also concIuded that 'the process of seIective elimination is most
" 11 158 240 31.3
" 12 160 244 31.1 18.6 20.5 severe with extremely variable individuals, no matter in which
13 161 246 32.3 19.3 21.8 direction the variations may occur. It is quite as daqgerous to be
;\: 14 157 245 32.0 19.1 20.0 conspicuouslyabove a certain standard of organic exceIlence as it is
l '
15 157 235 31.5 18.1 19.8 to be conspicuously beIow the standard.' This last statement is saying
¡Ii 16 156 237 30:9 18.0 20.3
that stabilizing seIection occurred, so that individuals with measure-
.¡;" 17 158 244 31.4 18.5 21.6
., ments cIose to the average survived better than individuals with
18 153 238 30.5 18.2 20.9
I !'
19 155 236 30.3 18.5 20.1 measurements rather different from the average.
!~
20 163 246 32.5 18.6 21.9 Of course, the development ofmulti variate statistical method~ had
21 159 236 31.5 18.0 21.5 hardly begun in 1898 when Bumpus was writing. The correlation
22 155 240 31.4 18.0 20.7
18.2 20.6
coeflicient as a measure of the reIationships betwcen two variables
23 156 240 31.5
24 160 242 32.6 18.8 21.7 \vas introduced by Francis Galton in 1877. However, it was another
25 152 232 30.3 17.2 19.8 56 years before Hotelling described a practical method for carrying
26 160 250 31.7 18.8 22.5 out a principal component analysis, which is one of the simplest
27 155 237 31.0 18.5 20.0 multivariate analyses that can be applied to Bumpus's data. In fact
28 157 245 32.2 19.5 21.4
245 33.1 19.8 22.7
Bumpus did not even calculate standard deviations. NevertheIess, his
l' 29 165
!'i 30 153 231 30.1 17.3 19.8 methods of analysis were sensible. Many authors have reanalysed his
31 162 239 30.3 18.0 23.1 data and, in general, have confirmed his concIusions.
¡:¡¡,¡ 32 162 243 31.6 18.8 21.3 Taking the data as an example for ilIustrating multivariate
:ji 33 159 245 31.8 18.5 21.7 techniques. several interesting questions spring to mind. In particular:
;.¡ 34 159 247 30.9 18.1 19.0
::li
j •
35 155 243 30.9 18.5 21.3 1. How are the different measurements reIated? For example, does a
.; ~:
·1 36 162 252 31.9 19.1 22.2 large value for one variable tend to occur with large values for the
r
37 152 230 30.4 17.3 18.6
i r 18.2 20(5 other variables?
1 38 159 242 30.8
.1 !i 2. Do the survivors and non-survivors have significant differences for
1 I1
39 155 238 31.2 17.9 19.3
40 163 249 33.4 19.5 22.8 the mean values of the variables?
11:
,111
,
:i
li:
,: ir
( (
Table 1.2 Measurements on male Egyptian skulls rrom various epochs (X, = maximum breadth, X 2 = basibregmatic hcight, X J
= basíalveolar length, X 4 = nasal height; all in mm, as shown on Fig. l.I) .
Early predynastic Late predynastic J2th & 13t" dynasties Prolemaic period Ramal! perind
Skull X, X2 XJ X 4 X, X2 XJ X 4 X, X2 XJ X4 X, X2 X 3 X4 X, X2 X J X,L
1 131 138 89 49 124 138 101 48 137 141 96 '52 137 134 107 54 137 123 91 50
2 125 131 92 48 133 134 97 48 129 133 93 47 141 12S 95 53 136 131 95 49
3 131 132 99 50 13S 134 98 45 132 138 87 4S 141 130 87 49 128 126 91 57
4 119 132 96 44 148 129 104 51 130 134 106 50 135 131 99 51 130 134 92 52
5 136 143 100 54 126 124 95 45 134 134 96 45 133 120 :JI 46 138 127 86 47
.6 138 137 89 56 135 136 98 52 140 133 98 50 131 135 90 50 126 1J8 101 52
7 139 130 108 4H 132 145 100 54 138 138 95 47 140 137 94 (,O 136 1.18 <)7 58
S 125 136 93 4H 133 !30 102 4S 136 145 99 55 139 130 90 48 126 126 92 45
9 131 134 102 51 131 134 96 50 136 131 92 46 140 134 90 51 132 132 99 55
10 134 134 ~9 51 133 125 94 46 126 136 95 56 138 140 100 52 D9 135 92 54
II 129 IJX 95 SO. 1.13 136 IOJ 53 137 129 100 53 132 133 90 53 14J 120 95 51
12 134 121 95 53 I JI 139 98 51 137 139 97 50 134 134 97 54 141 1J6 101 54
13 126 129 109 51 !31 136 99 56 136 126 101 50 135 135 99 50 135 1J5 95 56
14 132 136 100 50 138 134 98 49 137 133 90 49 133 136 . 95 52 137 lJ4 93 53
15 141 140 100 51 130 IJó 104 53 129 142 104 47 136 130 99 55 142 135 9G 52
134 97 54 131 128 98 45 135 138 102 55 134 137 93 52 1:\9 1:14 95 47
16 131 99 51
135 137 103 SO 138 129 107 53 129 135 92 50 131 141 99 55 138 125
17 47 137 1:15 % 54
18 132 133 93 53 123 131 101 51 134 125 90 60 129 1:\5 95
130 129 105 47 138 134 96 51 136 128 93 54 133 125 92 50
19 139 136 96 50
101 49 134 130 93 54 136 135 94 53 l31 125 88 48 145 1'29 89 47
20 132 131
137 136 106 49 132 130 91 52 139 130 94 53 138 136 92 46
21 126 133 102 51
126 131 100 48 133 131 100 50 144 124 S6 50 DI 1'29 97 44
22 135 135 103 47 5·(
134 124 93 53 135 136 97 52 138 137 . 94 51 141 131 9753 143 126 8~
23 134 124 91 55
24 128 134 103 50 129 126 91 50 130 127 99 45 130 131 9853
130 104 49 134 139 101 49 136. 133 91 49 133 128 92 51 132 127 97 52
25 130 137 125 SS
138 135 100 55 DI 134 90 53 134 123 95 52 138 126 97 54 57
26 95 53 129 128 ~ I 52
27 128 132 93 53 132 130 104 50 136 137 101 54 131 142
129 106 48 130 132 <)3 52 133 131 96 49 136 13X 94 55 140 1:15 1(1~ 4!'
28 127 147 12<) g~
4S
29 131 136 114 54 135 132 9S 54 13l:i 133 100 55 132 136 92 52 ' I
130 128 101 51 138 133 91 46 135 130 100 51 tJ6 1J3 97 51
30 124 138 101 46
Dala SlIlIrce: Thomson ami R:lJldall-Múcivcr (¡90S),
/'
d T}¡n rn(Jl~;ri111 o[ Iflultivuriofe anolysi~ CI\~¡m(lIG0 (JI mulUvm(utc uutu í
3, Do survivors ane! non-survivors show lhe same amount of
varialion in measurements?
[or the variables and, if so, do the dirferenccs rcnccl gradual
changes with time?
r
4. Ir the survivors ane! non-survivors díffer wíth regare! to their 4. Is it possible to construct a function f(X l' X 2' X 3' X 4) of the four
distributions for the variables, ís it possible to construct some variables that in so me sense captures most of the sample
function of these variables f(X l' X 2' X 3, X 4' X 5) which separates differences?
the two groups? 1t would be convenient ifthis function tended to be
large for survivors ane! smalI for non-survivors since it would then These questions are, of course, rather similar to the one suggested
with Example 1.1.
be an index of Darwinian fitness.
As will be seen latc'r, there are differences belween the five samples
that can be explained partly as time Irends, It must be said, however,
Example 1.2 Egyptian skulls that the reasons for the changes are not known.-Migration into the
For a second example, consider the data shown in Table 1.2 for
population was probably the most important factor. r
measurements made on maJe Egyptian skulls from the area ofThebes.
There are five samples of 30 skulls from each of the early predynastic Example 1.3 Distriblllioll of a butlerjly
period (circa 4000 Be), the late predynastic period (circa 3300 BC);the
A study of 16 colonies or the butterfly Euplrydryas edicha in California
12th and 13th dynasties (circa 1850 Be), the Ptolemaic period (circa
and Oregon produced the data shown in Table 1.3. Here there are two
200 BC), and the Roman period (circa AD 150). Four measurements
types of variable: environmental and distributional. The environ-
are available on each skull, these being as shown.in Fig. I.L
mental variables are altitude, rainfalL and minimum and maximum
In this case it ¡'s interesting to consider the questions:
temperatures. The distribution variables are gene frequencies for
1. How are the four measurements related? phosphoglucose-isomerase (Pgi) as determrned by the technique of
2, Are !here significant differences in the sample means for the eIectrophoresis. For the present purposes there is no need to go into
varia bies imd, ir so, do these differences reflect gradual changes the details of how ihe gene frequencies were detennined. (Strictly
with time? speaking these are not gene frequencies anyway.) It is sufTicient to say
3. Are there significant diITerences in the sample standard deviations that the frequcncics describe the genetic distribution of E. edi¡/za to
sorne extent. Figure 1.2 shows the geographical distribution of the
colonies.
In this example qucstions that can be askcd are:
.1. Are lhe Pgi frequencies similar for colonies Ihat are close in space?
2. To what extent are the Pgi frequencies related to the environ-
mental variables?
~
These questions are important when it comes lO trying to decide how
Jx~ ~
Pgi frequencies are determined. If frequencies are largely determined
by present and past migration then gene frequencies should be similar
ror adjacent cotonies bu! may show no relationships with environ-
~ x3 mental variables. On the other hand, if it is the environment that is
most important then the Pgi frequencies should be related to the
environmental variables, bu! coJonies tha! are close in space will ha ve
Figure 1.1 Measurements on Egyptian skulls. different frequencies if !he environments are diITerenl. Of course,
"
-~
Exornplcs OJ IJlullivariatc dolo 9
~t
;..,
r¡ ~
c::
e r , rr. r-r. ....::; r'. X. :::.: e - -.. :;, ::::-. -:r e
~I
- :¡:, ~ ~-::
-- 55 (úregon)
,
CJ ~
~ :J
..c I
~~
{) ~
~
~¡
> -
r-- rr, r- ¡-... t.,r, r I r-- --::t o r I (',¡ r"'. - r- -¡ -r ~ e
.E - - ('1 r'". " .... ("'1 - (""'-1 - C-l ("'1
-1 - <ro
c. c:, I
I
" •
:::
..::: ~~ ,
>-'..c
:::: <>
u :>
'o
I
Ir. ~ r"'¡ \O \C "'" - r1 o - \C - ~ r- Q'\ r-. lE) r-. 'g,:::
U
'ü' O Nr-I
E:;
I c:
u
.-
v.._
-
...
~
u
o ~.g
g- "3' COCCOCO:::;'l"C-~::::.r-.c~
1
I ..!:: O v. o-
~ e 58
•
• ", IF
,
u,~
! <:J
c: e'-' "e W58 -----t..'\
1·\
: 1
<:J
t:!¡
'é[¡
""
L..
,:: ~.
r-- r.-.t ~ ~ \C x ::c r- r- r-. r- .:c ...:;:: 'O V) ("'1
.;~
- e
'O =
JRC
AF
I
c. ~ - f"". (""1 r-,¡""'f"'\I rl N N ("1 r¡ ("1 - - 01 .2 o
¡ !: ~ .~ JRH • " GH
<:J
. ,.~ Itt~ 2.§
.• ¡ ~
t:: =~ SJ PI
1
~
./
<J u
l'
E oC-=:
ii o
'7<J
" ...
:> o
:: ~
~ '"~ ..c ~
~c.
CALIFORNIA
u x r"I X X X o. :::;.. - - o-. e; C""I V'¡ ""1"
~ -1 1=
:; C':;"\O'::::--=--::;...::;...OOO"C=.:::::;CXX :CE
C1 .g ~ - <:;) c:::. •
S :;;=
> ;2
c. :: ~
e ,-. ::: - Scole "
c.c §~
~
I : o
:¡ -o 2
~o
::eL
.g .~c
:>
I
50
I
miles
100
!
\
-:: '':: - c o
v.~
:: 2 ~ t:.:c
. <,
II
::c :-:
<J -
:-:: :-::
oc 'c ~:f~
f""""'.~XXXV",-COC\NX-.::t-NC
'l" r-¡ f"'l N C'l - ('1 - - - 01 V"', ,..-. N ~ \1"', íñ§:
~c::~
Figure 1.2 Colonies of Euphydryas editha in California and Oregon,
ii :::¡
;>.E
L.. L..
c:.. ~%~
~E~
1 W :': ~ ~ ~ colonies that are close in space wiJI tend to have similar environments.
¡::u .~ ~ =; It may therefore be difficult to reach a cJear concJusion,
<J
¡:,-
c: = c:
.cuo-
::1
u'
Q) ::-.
::Q~ .2 C:::U"r " 'c° -;;;"
_
VJc:::g:c::é2-.c::OOC.NU, ~:::...J the presento The origin of the prehistoric dog is not certain. It could
01
~;,;,:
::
G VJ!.?;>-.-.VJv::J...JClc.:2::<OO '" - "
u _
c::0.c
;1111 descend from the golden jackal (Canis aureus) or the wolf. However.
:'111
11'
, 1
J.:!
',:
A
)U 11lc nW!CflL" 01 multlV(jflQ/(; O/lO/ySIS
1ah\e lA Mean m".\f\dib\c me'.1<;Uícmcnt~ \mm\ \or modcm lhai Óogs, g()\l\cn cÍ/~
t: (j
.r.
~ \
jackals, wolvcs, cuons, dingos and prchisloric dogs (X 1= breadlh of
mandibk, X 2 = heíght of mandiblc below 1st molar, X 3 = lcnglh of IsI
'2.-
. - C"J
E'u
lJ
f-..
í I -- r--. --
~~~~¿~00¿~~¿d~~¿~~¿~~~¿~d~
r- ("~ CIJ "O" e ..::,. :-- -:- r-- l,("". J., r- r'j r- ::. ....,. o ~ o r'". s -
:;
molar, X.; = breadth of 1st molar, X 5 = Icngth from 1sI lo 3rd molars 11 ~
inclusive, X 6 = Icngrh frolll 1st to 4th prcmobrs inclusive). ;z 11 -"
:;:::Ul
~Cl. VJ o....::: rJ \O oc: - --= r-
C'""t l.tí r-. z x ~ C" (" ~ r I - ~
X3 X5 X6 • Ul ~ ¿ ri
("""'1
Modern dQg
XI
9.7
Xl
21.0 19.4
X-,
'"
-¡
.~ ¡;:;
Golden jackal 8.1 16.7 18.3 7.0 30.3 32.9 ~J 11
:<: f"~ o o oc ~ \O ce r- C' tr. "-::;'" ('- r- o r- ::- rl C\ C\ M ::
Chinese wolf 13.5 27.3 26.8 10.6 41.9 • 48.1 C"l
11 Z ..... I..Ó
lf)
.¿ \Cl v) c-i -.: ~ ~ a.ri V v-: ri -i
tr,
r-i :i ..¿ tri ....:
("'r") -
e -::: . . .: ¿ ó -= d -= V) ,..,...
-1
d
Cuon 10.7 23.5 21.4 8.5 28.8 37.6 Ovi
~ ,,~>
'J
......i
22.6 2i.J 8.3 34.4 43.1 <.~
Dingo 9.6 ~t;
-.'1'
e'S: Ú
'--""" :~~;
.... -1: ..
'" <J
p.<n :<: =-
rl r"", Q\ r"'l. tr. ::::: <""", C\ 0\ ::::::: -: - ...:= ""'!" \1"", C"'I V) 00 =-- :- . .::: rl '7 r---. C'I
··-~I
::: 11 a ~~~~~~~~¿d~~~~~~~~~~~~~~~~
:r. ",;tl
2c:: lJ
the wolf is not nati'lc to Thailand, the nearest indigenous sources elJ
UJ =
being western China (Canis lupus chanco) or lhe Indian subconlinent ....;>o,Ul.
~
'.J 'b~:
~ c::
(C anis ll/puS pallipes). ::: .- o '-",
. -o
r:: ü C" ..c, C\ C\ r-. tr. :c e 7 '7 --o . .: : :.c ...::. r- x
::c - ...::. r l rr, ::-- ~ \C \.C - .:: .~
In order to cJarify the ancestry of ¡he prehistoric dogs, mandible . - ::l ""'~ ¿¿¿¿~¿¿ .... ¿¿¿¿¿¿¿~~~¿¿¿~ = <1
-t;
measurements were made on the availablc specimens. These were ee '"oe "Vi'.
ti
then com[' ; ! wilh similar measurements on the goldcn jackal, the !:::u e
Chinese wolf ¡~;1d Ihe Indian \Volf. Tbe comparisons were made more :o }l .g :<: \.:; :>': VI ::y;. r- ...=; x. in r l ("1 ::- -..: -: 1./"'. 1./"'. ~ X C\ r-'. V-. rl -..::. r- - ce x .~
r-..: ~ r--: tri ¿ r...: .::; e -::5 lr: .-: r: -i z v: r-...: ~ r< .r: -r
....; r~l v: ¿ ¿ -1
useful by considering also the dingo, which mal' ha'le its origins in
<J
.S
e
Z
O.~
(J e
<::
"
~
('~ r l (""1 rj r l r¡ r-.
C""';
r l ( ..... r-. r~ - 14 ("'1 rl rJ rr. ("""". ~. (""~ ror. N
v)
~
India, the cuon (CllOIl alpilllls) which is indigenolls to southeast Asia, c: .;:l -:..
.- ~ E'
and modero villagc dogs from Thailand. -o:.:: E ::.:.1
Table 1.4 gives mean vallles for six of ¡he mandible measurements
for specimens [rom al! of ¡be groups. The main question to be
~~o
e :J u
0.."'"0
E... c:
-
:<:
-<
~-~MC~
e ó o -= -= ¿ r-; O ·
~ 7~~~~vM~~~~-~-~~
~ _. ::;' -::5 ::i ¿ e e -: ri (""i M ('i N .-: ....:
~ ..~
dJ
dJ
~
(";l
_ :::. ..,~;'5:..,
addressed hefe is how these groups are related and, in particular, how <J o .... =: ,
the prehistoric group is related to the others. o..c.8.. c.::: r-. :o r- -=-- r-- r- r- ~ ~ :::: ::.c =' - r- ..c r- - r- r- r- .::
:~
2o.Ul11 ~ ("4
~~¿~~~~~~~~~~~~~~0~~~~~~~~
r"'1 M ex) 1,(""', ("1
. .o. . 0...=
(";l
"" ('~ _ - ~ N C"" \oC ..""'1 - ('~ M ~ N ~
.,,:
.·!:,if
:'''''Y-
Example 1.5 Employmenl in EUTOpean COII/llries ~ .S (J
e.O 11
-
::1"
.-~.
~Jn
~~f ..;
~'
;':.1
Finally, as a contrast to the previous biological examples, consider covro¿ o
the data in Table 1.5. This shows the percentages ofthe labour force in <J~ :::.~
t:; ;. -'"
~ ~ ~~
nine diJTereiH-Types of industry for 26 European countnes. In this case a.. "1 ti , ~
C:~"O
::: e
"O
c:
>-
_ ;:
-
.::
E '" E --
I
"'"
12 Thc material of 1ll1l}tivariaLe ano}ysis Prc!view of mu}Uvoriut e rnc:thocl s 13
í.2 Previcw of rnultivariatc rncthods Xl =aIIF¡ +a¡2 F2+ e ¡
The five cxar.lples just considered are typical of thc ~a~~aterial for X 2 =a 2I F I +an F 2+ e2
multivariate statistical methods. The main thing to note at this point X) = a))F I + a)2F2 + e)
~~ is that in al1 cases there are several variables of interest and these are
cJearly not independent of each other. However, it is useful al so to give X 4 = a~)F¡ + a~2F2 +e 4
and
a brief preview of what is to come in the chapters that fol1ow in
relationship to these examples. X 5 = as¡F¡ + a s2 F2 + es,
tí' J:J1néi PaLcom/Joñéiít árialysií is designed to reduce the number of where the aij values are constants, F ¡ and F 2 are the factors, and e¡
. vanables that need to be considered to a smal1 number of indices
(cal1ed the principal components) that are linear combinations of the represents the variation in X¡ that is independent of the variation in
,
I
original variables. For example, much of the variation in the body the other X -variables. Here F I might be fhe factor of size. In that case
I
I measurements of sparrows shown in Table 1.1 will be related to the the coefficients a l l , a21' a)l' a 41 and a SI would a11 be positive,
I reflecting the fact that sorne birds tend to be large and sorne birds tend
:,¡ general size of the birds, and the total
to be sma11 on a11 body measurements. The second factor F 2 might
. then measure an aspect of the shape of birds, with sorne positive
11 =X I +X 2 +X)+X 4 +X S
coefficients and sorne ' negative coefficients. If this two-factor model
fitted the data welI then it would provide a relativ.e1y straightforward
wil1 measure this quite well. This acc~unts for one 'dimension' in the description of the reIationship between the fivebody measurerneots
data. Another index is being considered.
1:
One type of factor analysis starts by taking a few principal
!; 12 =X I +X 2 +X) -X4 -X s, components as the factors in the data being considered. These initial
' :
.¡ factors are then modified by a special transformation process ca11ed
.i which is a contrast between the first three measurements and the last 'factor rotation' in order to make them easier to interpret. Other
two. This reflects another 'dimension' in the data. Principal compo- methods for finding initial factors are also used. A rotation to simpler
nent analysis provides an objective way of finding indices of this type factors is almost always done.
so that thc variation in the data can be accounted for as conciselv as
Qossibls. It may wel1 turn out that t\\'o or three principal components
® rDlSciIijj,nmiT úmciion allalvsis' is concerned with the probL.n of
'. ~
seeing whether it is possible to ser~a.@1t.ilifTerent grou,Es on the ba~
';;1 pro vide a good summary of a11 the original variables. Consideration ,?fthe availab.1e measuremen~~ This could be used, for example, to see
·il
~
of the values of the principal components instead of the values of the
original variables may then make it much easier to understand what
the data haye to sayo In short, principal component analysis is a
how welI sur\'iving and non-surviving sparrows can be separated
using their body measurements (Example 1.1), or how sku11s fro m
different epochs can be separated, again using size measurements
iR
means of simplifying data by reducing the number of variables. (Example 1.2). Like principal component analysis, discriminant func-
@ !.factoY ·W alvsip lso attempts to account for the variation in a tion analysis is based on the idea of finding suitable linear combin-
number of original variables using a sma11er number of index ~ of the original variables. -
variables or factors. It is assumed that each oricinal variable can be
~ressed as a linear combination ofthese factors, plus a residual term
€ . ~luslef:anaIVSis j! concerned with the l.9..entification of groups of
~lmllar individuals. There is not much point in doing this type of
that reflects the extent to which the variable is inaependent of the . \ analysis with data like that of Examples 1.1 and 1.2 since the groups
other variables. F or example, a two-factor model foI' the sparrow data . (survivors, non-survivors; epochs) are already known. However, in
assumes that Example 1.3 there might be sorne interest in grouping colonies on the
I Thc muU:ríGi oÍ multivariaLt: oT1ulj'sis Computer ¡)f()Qr,arn~'
\~
I 'l4
1
).. basis o[ r:nvironmcntal variables or Pgi frcqucncies, whilc in curve. Many standard univariatc statistical methods are bascd on lhe
Example 1.4 thc main point of interest is in the similarity betwcen assumption that data are normalIy distributcd. "
'. !, " ~ prchistori¿ dogs and other animals. Similarly, in Example 1.5 differ- Knowing the prominencc of the normal distribution with univari-
;.:¡;
-en! Europcan countries can be grouped in terms ofsimilarity between ate statistical methods, it wilI come as no surprise to discover that the
.!:~
., ¡
.:!: employment patterns. multivariate normal distribution has a central position with multi-
:::1: With ~®.iiiéal forr~19(i@ he variaJ21~ (not the individuals) ~ variate statistical methods. Many of these methods require the
)':-):i¡ I '
divided into two groups and interest centres o~ the relationshiE assumption that the data being analysed have multivariate normal
§.tween these. Thus in Example 1.3 the first four variables are related distribu tions.
.- ' Ti,;;¡: to the environment while the remaining six variables reflect the
. gene,tic distribution at the different colonies of ElIphydryas editha.
. The exact definition of a multivariate normal distribution is not too
important. The approach of most people, for better or worse, seems to
::,! Finding what relationships, if any, exist between these two groups of be to regard data as being normalIy distributed unless there is sorne
,¡:¡:¡¡ r
li:¡: , variables is of considerable biological interest. reason oto believe that this is not true. In particular, if alI the
'!,¡ii I
111 . Finally, there isJm4tídtñieñsional1?4!fW! The method begins with . individual variables being studied appear to be normalIy distributed,
1:1 1f [1
: ~!l 11 data on sorne rneasure of the distances apart of a number of then it is assumed that the ¡oint distribution is multjyadate normal.
,. -
:fr 1i ! individuals. From these distances a 'mapl is constructed showing how This ís, in fact, a mínimum requirement since the definition of
;1'iI thc(individuals are related. This is a use fuI facility since it is often multivariate normality requires more than this.
' 1'['
:,:¡:!;,¡1 possible to measure how far apart pairs of objects are without having Cases do arise where the assumption of multivariate normality is
;-
any idea of how the objects are related in a geometric sense. Thus in clearly invalid. -Forexample, one or more of the variables being
, :1: :
'· 1, [ Exarnple 1.4 there are ways of measuring the 'distances' between studied may have a highly skewed distribution with several outlying
~ ~: I: I ¡ modern dogs and golden jackals, modem dogs and Chinese wolves, " high (or low) values; there may be many repeated values; etc. This type
:i-i!
1",1
1,1:
I etc. Considering each pair of animal groups gives 21 distances of problem can sometimes be overcome by an appropriate transfor!'ll-
j!J:¡1 altogether. From these distances multidimensional scaling can be ation of the data, as discussed in elementary texts on statistics. If this
I.I!
l. ,' '
t
used to produce a 'map' ofthe relationships between the groups. ,With does not work then a rather special form of analysis ma:y be required.
, ." 1 a one-dimensional 'map' the groups are placed along a straight line. One important aspect of a multivariate normal distribution is that
1 - With ~ two-dimensional 'map' Ihey are represented by points on a it is specified completely by a mean vector and a covariance matrix.
:I planeo With a thrce-dimensional'map' they are represented by points The definitions of a mean vector and a covariance matrix are given in
within a cube. Four- and higher-dimensional solutions are also Section 2.7.
possible although these have limited use beca use they cannot be
visualized. The value of a one-, two- or three-dimensional map is clear 1.4 Computer programs
1:
j. for Example 1.4 since such a map would immediately show which
¡: 1".,;
, 1
groups prehistoric dogs are similar to and which groups they are Practical methods for carrying out the caIculations for multivariate
: 1 different from. Hence multidimensional scaling may be a useful analyses were developed from the mid-1930s. However, the applic-
l ·¡i
i,1:1 1
alterna ti ve to cluster analysis in this case. A 'map: of European ation ofthese methods for more than smalI numbers ofvariables had
1 ,l.1 countries by employment patterns might also be of interest in to wait until computing equipment was sufficiently weIl developed. It
; Ir
Example 1.5. is only in the last 20 years or so that analyses have become reasonably ...
·¡I
; 1 easy to carry out. Nowadays there are many standard statistical
, " packages available for caIculations, for example BMDP, SAS and
, . ;i~ 1.3 The multivariate normal distribution
"
I,/ ':1
SPSS. It is the intention that this book should provide readers with
:/ 11: The normal distribution for a single variable should be familiar to enough information to use any package intelligently, without saying
ii:1
',; !.! readers of this book. It has the well known 'beIl-shapéd' frequency much about any particular one.
r '"
'¡¡id;
i i'li
¡i':.>,I!!: --~
~~ '"
r O::
li
': .
. ¡, ! "
1G '¡'hc T!J{J/()[ju] úf J/lultivurjul() ufl u!ysis
Mos! multiva ria tc analyscs are still done using the slandaru - C HAPT ER T\VO -
packagcs on mcdium or large computers. However, the increasing ..
;-:.
availability and power o"f microcomputers suggests lhat this will nol
- :!
be the case for much longer. Packages will be 'shrunk down' to lit into
Matrix algebra
micro s and, also, special-purpose programs will become increasingly
available in languages like BASIC. Indeed, it is not difficult to write
BASlC programs to do many of the standard multivariate analyses
:¡ providing advantage.is taken of the availability of published al-
gorithms to do the complicated parts of the calculations. Sorne 2.1 The need for matrix alg~bra
:1
limited instructions in this direction are incIuded in the chapters that
As indicated in the Preface, the theory of multivariate statistical
follow.
j ij methods can only be explained reasonably welI with the use of sorne
matrix algebra. For this reason it is helpful, if not essential, to have a
l·: References certain rninimal knowledge of this area of rnathernatics. This is true
ii even for those whose interest is solely in using the methods as a tool.
Bumpus, H.C. (1898) The elimination of the unlit as illustrated by the
:1' introduced sparrow, Passer domescicus. Biological Lectures, Marine At first sight, the notation of rnatrix algebra is certainly sornewhat
;1 Biology Laborarary, Woods Hole, 11th Lecture, pp. 209-26. daunting. However, it is not difficult to understand the basic
i.: Euromonitor (1979) European Marketing Daca and Scaciscics. Euromonitor principIes involved providing that sorne details are acceptéd on faith ..
J
j;: Publications, London.
r: Higham. C.F.W~ Kijngam, A. and Marily, B.FJ. (1980) An analysis of
prehistoric canid remains from Thailand. Journal oJ Archaeological 2.2 Matrices and vectors
.~I' .,
I .
Science 7, 149-65.
McKechnie, S.W., Ehrlich, P.R. and White, R.R. (1975) Population genetics
.:.
~. , ~: ~ o.,
A matrix of size m x n is an array of numbers with m rows and n
ti;;
I :¡
of Euphydryas butterflies. L. Genetic variation and the neunility columns, considered as a single entity, of the 'form
hypothesis. Genecics 81, 571-94.
¡. :1 Thomson, A. and Randall-Maciver, R. (1905) Ancient Races oJche Thebaid . a 11 a 12 in
O"ford University Press.
an a
A= a fi a 2n ]
[
ami am2 amn
c{J
then this is called a CoT¡[mn11.mQr.1lf there is onlYEh§.row~or instance
:+;i::. 17
.;.
MíJ(rix alí)c/)f(.J
18
~\w¡\~\\~~~ ~~ \\\t\\\\~~~ 1t>
,.., :¡
,
The trar.spose of a rnatri x is found by interchang;'1g thc rows and example,
'
columñs, Thus the transpose of A aboye is
, , :, 1
1 --.t..--
all aZl
a ll
aZI
alZ
a22
a 13 ]
a 2 3 = bZI
[b ll
-bu· b1 3 ]
b22 :' b23
~~ .' ¡ ; "
oo. amll
am2
[ a
31 a 32 a 33 b 31 b3 2 b 33
A' = a~2 an
'oo
':::'Il [
pll a ln a Zn amn only if a ll = b 11' a 12 = b IZ,"" a 33 = b33 •
TheJ~e1of a matrix is the sum of the diagonal terms, Thus
¡¡,!if tr(A) = a l l + a ZZ + oo. + ann for an n x n matrix. This is only defined
Also, c!=(C l ,C 2,': .. ,cm ), and r' is a column vector.
pi:!1 l' for sq,uare matrices.
j j; 1: t ~
There are a number of special kinds of matrix that are particularIy
"1::1;;01,
'::," 1,"
1
important. A zero ~áirix has all elements ec¡ual to zero: J..
Jii:¡;!, • .2.3 Operations on matrices
:,¡I;!;I'1:
'Ii;::ii ,The ordinary arithmetic processes of addition, subtraction, multipli-
, ¡I'¡lj'" '
¡,¡jl'!,I:!! :
O O
0= ~ O
01
O. ::'Cátion and division have their counterparts with matrices. With
:,ll'!:li [ Jaddition and subtraction it is simply a matter ofworking eIement by
::~i¡ ~ :' 1 : :dement,For example, if A and D are both3 x 2 matrices, then
11
1'; :
"'1 '1 1 ;
1;1:
O O o'
1
:';1"¡JII,1'11' ,
:'I¡j'i , A diagonal matrix is a square matÍ-1x that has elements ofzero, except
~. :'r· . A + D = ::~
al' l
' a1Z]
an +
[d
d 21
ll
d
dn
u ]
=
[al1+dl1
a 21 +
d 21
a
an
l2
+d
12
+ d 22
]
"¡i:,.!" I! O O
ti O
1:',;: 1
O' while
1' I O t2 O
f, T=IO O t 3 O a 11 -;- d 11
¡i ~ : a 12 - d l z]
A-D= a zl -d Z1 a 22 -d z2 .
[
O O O tn a31 - d 31 a 3Z - d 32
A symmetric matrix is a square matrix that is unchanged when it is These operations can only be carried out with two matrices that are of
transposed, so that A has this property providing that A' = A. Finally, the same size.
an identity matrix is a diagonal matrix with all diagonal terms being In matrix aIgebra, an ordinary number such as 20 is called a scalar.
unity: MuItiplication of a matrix A by a scalar k is then defined as
muItiplying every element of A by k. Thus if A is 3 x 2, as shown
1 O O O abo ve, then
O 1 O O ka 11
T=IO O 1 O ka l2 ]
.- ; kA = ka 11 ka 12 •
[
ka31 kan
O O O
Multiplying two matrices together is more complicated. To begin
Two matrices are equal onIy if' all their elements agree. For with, A· Bis only defined if the number of columns of A is equal to the
.---.
...
20 Mutri x olgebra QuacJra tic form s 21
nümber of rows of B. If tbis is the case, say with A of sizc m x II an c B Th is can be vcrified by ch ecking th ul
~f sizc n x p, to en
a ll a12 ll bu lP
2 1
[1 2 x
J [- 2/3
1/3 - 2/3
1/3J = [1° 0J 1 .
A·B = arl a22 aaln]
2n x [bb21 b22 bb2P ]
[ Actually, the in verse ofa 2 x 2 matrix, ifit exists, can be determined
ami am2 amn bnl bn2 bnp easil)'. It is given by
jp
¿alibil ¿alh2
J-
l
¿a1jb ]
_ ¿a 2jbjl ¿a 2jbj2 ¿a2jbjp , a 11 a12 1 [ a22/!1 - a l2 /flJ
- . [ a21 a22 = - a2t1!1 a ll /fl '
¿amib jl ¿amjbj2 ¿amjbjp
, " where fl = a 11 a22 - a 12 o'21 . Here the scalar quantity fl is called the
where ¿ indicates the summation for all j = 1,2, ... , n. Thus deter",inant of the matrix. Clearly the inverse is not defined if fl = 0,
the elernent in the ith row and kth column of A· B is since Iinding the elemems of the inverse then requires a division by
, ,. zero. F or 3 x 3 and larger matrices the calculation of an inverse is a
¿aijbjk = a¡¡b lk + amb2k + ... + ainb nk · , ~ , /:;, tedious business best done using a computer programo
;:~.< ': Any square matrix has a determinant value that can be calculated
It is only when A and B are square that A' B and B' A are both ,
, ,, ' c by a generalization of the equatio.n just given for the 2 x 2 case. If the
defined. However, even if rhis is true, A' B and B' A are not generally
equa!. For example,
',' > determinant of a matrix is zero then t~e inverse does not exist, and
~ '.;: vice versa. A matrix with a zero detetminant is described as being
[2 -lJ.[l° Il [2 1J
1 1 1]
=
1 2
:~S'2 singular.
' :::~ Matrices sorne times arise for which the inverse is equal to the
- transpose. They are then described as being orthogonal. Thus A is
while
orthogonal if A -1 ='A'.
[°1 lJ.[2 -IJ=[3 0J
1 I 1 1 l '
2.5 , Quadratic forms
2.4 M atrix imersion Let A be an n x n matrix and x be a column vector of length n. Then
the expres.sion '
Matrix inversion is analogous to th e ordinary arithmetic process of
division. Thus for a scalar k it is, of course, true that k x k - 1 = 1. In a
•
Q= x'Ax
similar way, ir A is a square matrix then its ¡/leerse is A -1 , where
A x A -1 = 1, this being the identity matrix. Inverse matrices are only is calJed a quadratic formo It is a scalar and can be expressed by the
defined for square matrices. However, all square matrices do not have alternative equation
an inverse. When A - 1 exists, it is both a right and left inverse so that
A - 1 A = AA - 1 = 1. '
¿¿
" 11
:;/I
:"~:" '
.if;' iir
;;}t~t"·.
,
... .
'n
•
.Motriy. ol)),(!\lI'(l \J (:('.\M~ ()) ín('/mr.:. onr\ rmJflr}nnfil) mlt,"M'1> !\I{~
I
1:r i= 1
X;/Il
, o, ' ¡
a 11 x 1 + a 12 x 2 + ...
+ a 1n Xn = L'1
a 21 x 1 + C?22X2 + ... + aZnx. = J.x 2 whilc the sample estimate oJ variance is
j:JI
i¡i!i!'li
an1 x 1. +. ~n2xl + ... ...¡,. annx. = J.x. S2 = I"
i= 1
(Xi -:. x)2/(n - 1).
¡ti!!!; i:
", 1" " 1'
which can be written in matrix form as
Ax = -ix, These are estimates of the corresponding population parameters -
ilC::11;¡! -'*~: the population mean p, and the popúlation variance (12.
~il':: ! !l or
,I' ~: '1 ¡: (A - -iI)x = 0,
In a similar way, multivariate populations and samples can be
summarized by mean vectors and covariance matrices. These are
. . defined as follows. Suppose that there are p variables Xl' X 2'''' ,X p
where 1 is an n x n identity matrix and () is an n x 1 zero 'vector. l.t can
II!:,I be showil that these equations can only hold for certain particular
and the values of' these for the ith individual in a sample are
. Xi!' X i2 , ... , x ip , respectively. Then the sample mean of variable j.is
Jiq :' values of the scalar i. that are called the lacenc roocs or eigenw lzies of
,::,!II ,1
, :", 1 , :1 the matrix A. There can be up to IJ ofthese roots. G iven the ith latent
;~!F!! " root ¡'i, the equations can be sol ved by arbitrarily setting XI = 1. The xj =i=" 1 x¡)n, I (2.1)
resulting vector of x values
~¡ ' I :;
«01 " " 1
while the sample variance·is
\,1
¡: i!
" '1 •
'1 Xli
s7 =I (xij - .xl/(n - 1). (2.2)
,~
I! : ' . :'
I •
Xi = I X3i i= 1
,
In addition, the sample CO¡;(lriance between variablesj and k is defined
l' Xni
,!, as
1: •
(or any multiple of it) is called the ith larellc L'eCCor or the ith
;j¡ eigenrector ofthe matrix A. The sum ofthe eigenvalues of A is equal to
cjk = I (x ij - Xj)(X ik - Xk )/(lI - 1), (2.3)
i= 1
:¡,I;:,l the trace of A.
. ; :: ~ :ri this being a measure of the extent to which the two variables are
Finding eigenvalues and eigenvectors is not a simple matter. Like
finding a matrix inverse. it is a job best done on a computer. linearIy related. The ordinary correlation coefficient for variables j
~:I'' ''1''· .
and k, rjk , say, is related to the covariance by the expression
i! ~¡f :
'¡.
..... ..
--..
¡vi (J di XU 19;; iJ ru H:;fcrences 25
Thc veclo/'- of sample means is calculatcd using cquation (2.1): vectors. In addition to thc chaptcr on matrix algebra, th{; othcr parts
of Causton's book provide a good rcview of general mathcmatics. A
_= Xl]
more detailed account of matrix theory, stiU at an introductory levcl,
X
x2 .' (2.5) . ís provided by Namboodiri (1984).
1:, [x p
Those interested in learning more about matrix inversion and
finding eigenvectors and eígenvalues, particularly methods for use on
-.1 . microcomputers, will find the book by Nash (1979) a useful source of
.1.li.. :
This can be thought of as the centre of the sample. It is an estima te of
l i¡ Ínformation.
the population vector of means
'-. 1,.:
p~[j:J
References
, "
(2.6)
1 Causton, D.R. (1977) A Biologist's Mathematics. Edward Arnold, London.
Namboodiri, K. (1984) Matrix Algebra: An Introduction. Sage University
l' } Paper Series on Quantitative Applications in the Social Sciences, 07-
,l·
:). 038. Sage Publications, Beverly Hills.
'1, l'
. j. The matrix of varÍances and covariances Nash, l.e. (1979) Compact Numerical Methodsfor Computers. Adam Hilger,
.~ .:' . Bristol.
.¡ 1
'
1; c 12
'11,li
.lllli.
.• 'I¡ji ·, '
t •~.
: 1'1 .
c=
r"C
r
- Cpi
c 22
Cp2
.
' ']
C2p
cpp
,- (2.7) ~
•II¡;:¡ 1/ sr,
where cií = is caBed the sample covárianée matrix, or sometÍ'mes the
Jij;. ,~ sample dispersion matrix. This reflects the amount of variation in the
sample and also the extent to which the p variables are correlated. It is
.~ I ¡ an estimate of the population covariance matrix.
r l:,; .
I !'
: The matrix of correlations as defined by equation (2.4), is
.-:
r12 r
.;
'''] r
12 lP
li
1I , ,,
, ,¡i,. ;
R= .
r' rr
rpl
r 22
rp2
r 2p = . r~l
rpp
1
lr~l
1
rp2
a..¡
• '1"'" symmetric .
t,l ¡!
2.8 Further reading
,1 :1.
~II ¡¡ A book by Causton (1977) gives a somewhat fuller introduction to
l. matrix algebra than the one given here, but exc1udes latent roots and
~11 :i :
JlII,:
JlIIJ
Jnl)r..
Jr.m) ..
~ .;
Cnmporíson o[ mean vlllu es for two sumpl es 27
,
'j
1
with multivariate data On the assumption that X is nonnalIy distrib uted in bothsamples,
'1 with a common within-sample variance, a test to see whether the two
.1 sample means are. significantly different involves caJculating the
statistic
IJ
,,:1'1 ~.-: "
:/: ;:\1
:.~:;~ (:!
iji' l
.; : . -
..... :..
.',~ ' ~!~ ~ :-~ :
t= (Xl - .Xl ) I{sJell +~12)} (3.2)
j;::¡i, 1 ". and seeing whetber this is significantly different from zero in
3.1 lntroduction
"'j,ll' comparison with the t distribution with ni + n2 - 2 degrees of
,~"+I¡ II The purpose ofthis chapter is to describe sorne tests that are available :freedom (dJ.). Here
:li:1.
" 1
: I~ ,:
, ·; ':" 1
for seeing whether there is any evidence that two or more samples 7
~
.::iP¡: , come from populations with different means or different amounts of S2 = {(ni - l)s~ + (n 2 - l)sD/(n l + n z - 2) (3.3)
;"';, 1.1: ¡
variation. To begin w1th, two-sample situations wiII be covered.
.::.~J¡¡ \ is the ~stimate of variance from the two samples .
;,:,::1. 1 '1 > ;[.lt is lCnown that tbls test IS fairIy robust (o the .assumption of
. :. ~ 1; rt 1 1
.: ! ¡ I : nonnality. Providing that the population distributions of X are not
"1 : ",
,
:
3.2 Comparison of mean values for two samples:
too different from nonnal it should be satisfactory, particularIy for .
.: . I :
:;:1 j
single variable case
sample sizes of about 20 or ..more. The assumption of egual within- r t ~ ~. 1\
o" ,.! '
C6nsider the data in Table 1.1 on the body measurements of 49 ~mple :arian~es i: al~o not ~o~ crucial. Pro:,iding t~at t~u~tio oft~e E: ~ ";.:¡•.J
I femal e spa rrows. Consider' in particular the first measurement, which ..!.Q1e v~nances IS wlthm the hmlts 0.4 to 2.5, m~guJhty 01 vanan_ce wl!L Té~}.) L'~.!. .
is tot ~! length . A question ofsome interest might be whether the mean h~e httle adverse effect on the test. The test lS partlcularIy robust If ., \
¡ ",<Z .
of thi s variable was the same for survivors and non-survivors of the the two sample sizes are equal, or nearIy so (Carter et al., 1979). If the
storm that led to the birds being coJlected. There is then a sample population variances are very different then the t test can be modified
(hopefully random) of 21 survivors and a second sample (again to alIow for this (Dixon and Massey, 1969, p. 119).
' .¡:
hopefully random) of28 non-survivors. We wish to know whether the
' ;¡
two sample . means are significantly different. A standard approach
3.3 Comparison of mean values for two samples:
I would be to carry out a e-test.
.. Thus, suppose that in a general situation there is a single variable X " multivariate case
'"~l'
,' .
' 1 , •
I ,
.
t ll I
1.1 "
and two random samples . of values are a~~~able from ditTerent Consider again the sparrow data of Table 1.1. The test described in
',d '
.:;¡ .: populations. Let XiI denote the values of X ¡n- the !irst sample, for the previous section can obviously be employed for each of the five
!.t¡
lii l, :i= 1, 2, . . . , nI' and X iZ denote the values in the s~cond Sample, for measurements shown in the table (total length, alar extent, length of
';.: !; i= 1, 2, ... , nz. Then the mean and variance for ihejth sample are beak and head, length of humerus, and length of keeJ of sternum). In
!: i .
1::1 , that way it is possible to decide which, if any, ofthese variables appear
,,;
':;; : to have had different mean values for survivors and non-survivors.
; !H ¡
'
¿
Xj= Xi/nj
~ However, in addition to these it may also be of sorne interest to know
i-l "
I:::il' :1:
J I!
...;" ot'j : f'
1'1 :11 i 26
· :1:1 ·
:' !.,'jll':
l ..
~
¡i
I - :,:¡!
l ' ,. .'
! .' ,. ;.
..:.' ; '¡'(;~LS uf s!glli jicancf; COnljJo ris oll o! m ean va/U VE> !or two sampl es 29
2
whether all fivc variables considen:d togethcr suggest a difTerence H ottl ling's T statistic is based on an assumpti on of normality and
, betwecn sUfvivors and non-survivors. In othcr word s: does the total egual within-samplc variability. To be precise, the two samplcs being
evidence point to mean difTerences between survivors and non- compared using the T 2 statistic are assumed to come from multivari-
I survivors? ate normal distributions with equal covariance matrices. Sorne
!',
What is needed to answer this question is a multivariate test. One deviation from multiváfÍate normality is probably not serious. A
possibility is Ho~elIing's T 2 test. The statistic used is then a moderate difTerence between population covariance matrices is also
'¡I generalization of the t statistic of equation (3.2) or, to be more precise, not too important, particularly witb equal or nearly equal sample
1- the square of the t statistic. sizes (Carter et al., 1979). If the two population covariance matrices
In a general case there wiII be p variables Xl' X 2' ... , X p being are very difTerent, and sampJe sizes are very difIerent as well, then a
'1 considered, and two samples with sizes nJ and n 2 . There are then two- modified test can be used (Yao, 1965).
4,;
¡j - sample mean vectors Xl and X 2 , with each one being caIculated as
, shown in .equations (2.1) and (2.5). There are also two-sampJe
J¡ . Example 3.1 Testing mean values for Bumpus's female sparrows
,1 covariance matrices, C I and C 2 , witb eacb one being calculated as
i¡7 " sbown in equations (2.2), (2.3) and (2.7). As an example of the use of tbe univariate and multivariate tests tbat
Assuming that the population covariance matrices are the same for have been described for two samples, consider thesparrow data
17 ,
both populations, a po oled estimate of this matrix is sbown in Table 1.1. Here it is a question of whether tbere is any
I¡;,j difIerence betweeñ survivors ~nd non-survivors with respect to the
C = {(ni - 1)C I + (n 2 - 1)C 2 }/(n l + n2 - 2). (3.4) mean values of five morphological characters.
11::-':
First ofall, tests on the individual variables can be considered,
'1 ::>
Hotelling's T 2 statistic is defined as starting with Xl' ihe total lengtb. The mean of this variable for the 21
t: T 2 = n l n2 (x l X2YC- I (XI - x2 )/(n l + n 2 ). (3.5)
survivors is Xl = 157.38 while tbe mean for tbe 28 non-survivors is
x2 = 158.43. The corresponding sample variances are si = 11.05 and
'\ - -
s~ = 15.07. The po oled variance from equation (3.3) is therefore
l- A significantIy large value for tbis statistic is evidence that tbe mean
I
¡I- vectors are difIerent for tbe two sampled populations. The sig-
nificance or lack of significance of T 2 is most simply determined by
S2 = (20 x 11.05 + 27 x 15.07)/47 = 13.36,
,¡ -
using tbe fact tbat in the nulI hypothesis case of equal population and the t statistic of equation (3.2) is
means the transformed statistic
!, :;:
I ~
1' : I
1"
~ 1' \,
1
• ...:1: ·
:w T es ts uf s ignifican ce M ult ivari at 0 vers u s un iva r io[¡; tests 31
T ablc 3.1 Comparison of mean v¡¡lucs for survivors alld n o ~- s llr vivors for , The inverse of the ma tri x C is found lo be
i !' Bumpu.s's fcmal e sparrows with ~a riables taken orie át a .time. ".:':
~~
1,
' 1: . Variable Xl S2
1 Xl' t (47 d.f.) C- 1 = 1-0.2395 -0.0376 4.2219 - 3.2624 - 0.0181
",
i 1 .., 0.0785 - 0.5517 - 3.2624 J 1.4610 -1.2720
" I Total length 157.38 11.05 158.43 15.07 -0.99
:'\1~ :1i Alar "extent 241.00 17.50 ~ 241.57 , 32.55 -0.39 - 0.1969 0.0277 - 0.0181 ' ~1.2720 ' 1.8068
- ~ . :~ - ..~ . . -- . ~-
Length beak & head 31.43 0.53 - -31 :48 ." 0.73 "':'-Ó.20
::'1 1' Length hume rus 18.50- 0.18 18.45 0.43 0.33 This can be verified by evaluating the product C x C- l and seeing
",'
'::t:
Length keel of sternum 20.81 0.58 20.84 1.32 -0.10 that this is a unit matrix- (apart from rounding errors).
'l[ J
"Substituting the elements of C- 1 and other values into equation -...
(3.7) produces
" ",.¡
, " "
", equation (2.7). For the sample of 21 survivors,
, " 21 x 28
' '' 'l '
':: 1
',T 2 = _. _~ [(157.381 -158.429) x 0.2061 x (157.381 -158.429)
\; ':
:i 157.381 11.048 . 9.100 1.557 0.870 1.286
9. roo-' 17.500
~!
, , ~t
'.
241.000 1.910 1.310 0.880- -(157.318 -158.429) x 0.0694 x (241.000 - 241 ,571)
x1.= I31.433 and C I = 1.557 1.910 0.531 0. 189 0.240 + ... + (20.810 - 20.839) x 1.8068 x (20.810 - 20.839)] "
18.500 0.870 1.310 0.189 0. 176 0.133 = 2.824.
20:810 ,1.286 0.880 0.240 0.133 0.575
. ' JJ ,., Using equation (3.6) this converts to an F statistic oC
For the sample of 28 non-survivors,
F=(21 + 28 - 5 -1) x 2.824/ {(21 + 28 - 2) x 5} = 0.517.,
....- r 158 .429 - 15.069 17.190 2.243 1.746 2.931
241.571 ,17.190 32.550 3.398 2.950 4.066 with 5 and 43 dJ. Clearly this is not significan tly large since a
x2 :=: I 31.479 and C 2 = 2.243 3.398 0.728 0.470 0.559 significant F value must exceed unity. Hence there is no evid ence of a
18.446 1.743 2.950 0.470 0.434 0.506 difTerence in means Cor survivors and non-survivors, taking alI fi ve
2.931 4.066 0.559 0.506 1.321 variables together.
20.839
;'1
C = (20C l + 27C 2)/47 = I 1:951 2.765 0.645 0.350 0.423 It should be noted, however, that it is quite possible to have
1.373 2.252 0.350 0.324 0.347 insignificant univariate tests but a significant multivariate test. This
: i' 2.231 2.710 0.423 0.347 1.004 can occur because of the accumulation . of the evidence from the
;.¡
individual variables in the overall test. Conversely, an insignificant
where, for example, the element in the second row and third column is multivariate test can occur when sorne univariate tests are significant
d: 1 ':
(20 x 1.910 + 27 x 3.398)/47 = 2.765. because the evidence of a difTerence provided by the significant
, ':
~!
1
¡ ¡.
', . :o:
'- í.
1
l'
j~ T est s (JJ sign ifieunc c CamfJuri,<;()Jl af vu r ioti on fOI' twa scrnpir:s ;I J
vari : hlc s is swamped by the cvidcncc of no difference providcd by the be: due ta-the fact that a variable is not norm31ly distributed rather
oth , variables. than to unequal variances, For this reason it is sometimes argued that
¡ .: the' F test should never be used to compare variances.
One important aspect <;>f th~ use of a multivariate test as distinct
,,- , from a series of univariate tests concerns the control of type one error A robust alternative to the F test is Levene's (1960) test. The idea
'- 1 rates, A trpe one error involves finding a.significant result when, in here is to transform the original data into absolute deviations from
1¡J i
the m~an and then test for a significant difference between the mean
reality, the two S<;lmples being com ared come from the same
Irri populatióñ:-Wlt a univariate test at the 5% level there is a .95 ~eviations in the two samples, using a t test. Absolute deviations from
probabilitYof a non-significant result when the population means are ·~the arithmetic mean are usually.used but a more robust test is possible
Ir , by using absolute deviations fron;¡ sample medians (Schultz, 1983).
the same. Rence if p independent tests are carried out under these
Ur : conditions then the pr~babiJity ;;r
getting no significant results Íi The procedure is· i!lustrated·in Example 3.2 below.
\Ir: ~ The probability of at least one significant re,sult is therefore
1fT ' 1 - 0.9sP. With many tests this can be quite a large probability. For 3.6 Comparison of variation for two samples:
Ji " 1 example, if p is 5, the probability of at least one significant result by multivariate case
lin l chance alone is 1 - 0.95 5 = 0.23. With multivariate data, variables are
lit '~I. 1" . usua!!y not independent so 1 - 0.95 P does not quite give tFíe correct Most textbooks on multivariate methods suggest tbe use ofBartlett's
JI
probabili ty of at least one significant result bv chance alone if i>:;.:-,·· test to compare the variation in two multivariate samples. This is
In'! ::! .' yariables are tested one Qy_one with.1!.Jl..Í.Y.ariate t tests Rowever, the
~~. :": "
-: '~:."'
,
described, for example, by' Srivastava añd Carter (1983, p.333).
,. 11 ~". ::. :
However, this test is rather sensitive to 1he assumption that the
In;!' ; principIe still applies: the more tests that are made, the higher the 1: J ' -
Jrl : (1
, probability of obtaining at least one significant result by chanceo samples are from multivariate normal distributions. There is always
On tbe other band, a multivariate test such as Hotelling's T 2 test ; '- the possibility that a significant result is due to non-nonnality rather
Jf:F:: ':;• using the 5% level of sigriificance gives a 0.05 probability of á type one .~"¿' , ~1 -' than to unequal population covariance matrices. '
m:, ',; ,
P" " : ~
error, irrespective of the number of variables involved. This' is a :~~,?:~r- An alternative procedure that should be , more robust can be
!
distinct advantage over a series of univariate tests, particularly when constructed using the principIe behind Levene's test. Thus the data
U1 the number of variables is large. values can be transformed into absolute deviations from sample
In There are ways of adjusting significance levels in order to control means or medians. The question of whether two samples display
the overall probability of a type one error when several tests ,are significantly different amounts ofvariation is then transfonned into a
po question of whether the transfonned values show significantly
carrid out. However, the use of a single multivariate test provides a
n better alterna ti ve procedure in many cases. A multivariate test has the different mean vectoTs, Testing ofthe mean vectors can be done using
I~ added advantage oftaking proper account ofthe correlation between a T 2 test.
variables. Another possibiJity was suggested by Van Valen (1978). This
"
1" involves caJculating
Fi .
In,
111 ¡,
3.5 Comparison of variation for two samples:
single variabl.e case dij = Jttl (X¡jk - xjkf }, (3.8)
With a single variable, the best known method for comparing the
tn 'ji ! variation in two samples is tbe F test. If sJ is the variance in the jth where X¡jk is the value of variable X k for 'the ¡th individual in sample j,
and x jk is the mean of the same variable in the sample. The sample
In: I sample, calculated as shown in equation (3.1), then tbe ratio sVs~ is
compared with percentage points of the F distribution with (ni - 1) means of the dij values are compared by a t test. Obviously if one
mi.! and (n2 - 1) d.f. Unfortunately, the F test is known to be rather sample is more variable than another then the mean di} value will be
sensitive to the assumption of normality. A significant result may-well higher in the more variable sample.
ITII,t
l
II~: \1: .~ :."".::~:'. '
mI' !I ~
1 1t:l~ ',
¡;
.'~" !
L '"! ',"
¡¿
rnr,l : ? ; ;:;,
:,.,t~~~
,~l\,' :~\
~l.\ '
: :' ~
-- - ~:
......j :..' '
f.~ ··
\ ,¡
:
,
\
I
~v\. Tp.sts 0\ Si~J\l\knnce
CG\\\\\~.\\\S\\\\ ~\ ~~~\~\\t~ \~, \\~~ ~Mf\l\\D~ ~~
.. T o ensur~bhat al l. v;.:riables are given egua! weight, they should be
IJ[ L\ +
; ! : ¡ ~
J.,) ¡ ; l'
1 standardizedbefore the calcu!ation bf the dij values, Coding them to l = (2,5 7 -: 3,29) 4.231
1
2 8} ] = - 1.21,
ji have un)t varí~ncéS wilI achieve tnis. Fo'r a more robust test it may be
,:1
¡:j better to use sample medians in place of the sampie means in equation
l' (3.8). Then the formula for dij values is with 47 d.f.
i! Since non-survivors would be more variable than survivors if
' . !'
11:
:. ,
1
1
, :
',
r
!
JLt
dij = M~k)2
(X ijk - }, (3.9)
,stabilizing selection occurred, it is a one-sided test that is required
he re, with low values of t providing evidence of selection. Clearly the
, ob.served value of t is not significantly low in the present instance. The
. ": , , t values for the other variables are as follows: alar extent, t = -L18;
, ~Jength of beak and head, t = - 0.81; length of humerus, t = - 1.91;
J
:i where M j " is the median for variable X" in the jth sample.
{¡ , The T 2 test and Van Valen's test for deviations from medians are , " ,' jength ofkeel of sternum, t = - 1.40. Ouly for the length ofhumerus is
::J¡ iIlustrated in the example that follows. : ' · the result significantly low at the 5% leve!.
One point to note about the use ofthe test statistics (3.8) and (3.9) is <' > 'i Table 3~2 shows absolute deviations from sample medians for the
;'1' 1!
i ' ,:¡
: .: :: : that they are based on an implicit assumption that jfthe twosamp1es '; :",;:d~t~ after it has been standardized. For example, the first value given
~ . ":l ¡
being tested difTer, then one sample will be móre variable than the _ 'ror variable 1, for survivors, is 0.28. This was obtained as follows.
~ il' ::"1 ,
,1
'I,¡i,
1 other for aB variables. A significantresult cannot be expected in a case :.First, the original data were coded to have a zero mean and a unit
, •.i1::,i where, for example, Xl and X 2 are more variable in sample 1 but X 3 variance for a1l49 birds. This transformed the totallength for the first
• ': . I~
and X 4 are more variable in sample 2. The efTect of the difTering survivor to (156 - 157.98)/3.617 = - 0.55. The median transformed
:i!lij¡
variances would then tend to cancel out in the calculation of dijO Thus ','lerrgth for survivors was then - 0.27. Hence the absolute deviatiQ,Il
¡!,!.
" J I'
Van Valen's test is not appropriate for situations where changes in the ' " Jrom the sample median for the first survivor is 1- 0.55 - ( - 0.27)1 = .~
0.28, as recorded.
::¡!i level of variation are not expected to be consistent for a11 variables.
Comparing the transformed sample mean vectors for the five
variables using Hotelling's T 2 test gives a test ,s tatistic of T 2 = 4.75,
:1
Example 3.2 Testing ~'ariationfor Biúnpus'sJemale sparrows
corresponding to an F statistic of 0,87 with 5 and 43 d.f. (equations
.-....
With Bu¡npus's data shown in Table 1.1, the most interesting (3. 7) and (3.6»). There is therefore 110 e\'idence of a significant
i question concerns whether the non-survivors were more variable difTerence betwecn the samples from this test.
,! than the survivors. This is what is expected if stabilizing selection took Finally, consider Van Valen's test. The d values from equation (3.9)
place. are shown in the last column ofTable 3.2. The mean for survivors is
First of aB, the individual variables can be considered one at a time, 1.760, with variance 0.41l. The mean for non-survivors is 2.265, with
., ,: starting with Xl' the total length. For Levene's test the original data variance 1.133. The t value from equation (3,2) is then - 1.92, which is
. :: , values are transformed into deviations from sample médians. The significantly low at the 5% leve!. Hence this test indicates more
,~ ': i
Ji ': I
median for survivors in 157 mm. The absolute deviations from this for variation for non-survivors than for survivors. ,
~ ... :
.' ; i the 2f birds in the sample then have a mean of Xl = 2.57 and a An explanation for the significant result with this test, but no .:..
-: I
::1
•
!I variance of si = 4.26. The median for non-survivors is 159 mm. The , significant result with the T 2 test, is not hard to find. As explained
, !
absolute deviations from this me,dian for the 28 birds in the sample aboye, 'the T 2 test is not directional. Thus if the first sample has large
!I
¡ !¡ have a mean of X2 = 3.29 with a variance of s~ = 4.21. The pooled means for sorne variables and sma11 means for others when compared
,1 ;I variance from equation (3.3) is 4.231 and the t statistic of equation to the second sample, then a11 of the difTerences contribute to T 2 • On
.l. .::;~ 1 (3.2) is the other hand, Van Valen's test is specifically for less variation in
¡: ',.:;!I
:1 ', !:i I
:," ¡j:; I
:hl e
~
I ;:i ' J I;
'"
~
"
i' CúmpOri SO ll (Jf m (; Oll S for sr: verol sO nl ¡>l es 37
¡---:
Table 3.2 Absolut c dcviations from sample med ia ns for Bumpus's fcmal e Table 3.2 (Coned.)
1'-1 dal a anu d values fr om equ alion (3.9).
Lcnglh Lellljlh
1.:1 Leng/h Leng/h Total Alar bcak & Lenlj /h keel
Tota l Alar beak & Leng/h keel leng/h ex/en/ head humcrlls s/erlllllll d
Irn ' Bird leng/h ex /elll head humerus slernum d
Bird
~/ l •
I• \5 0.00 1.00 0.13 0.72 0.82 1.48
-." ,: ¡
" 1 .16 0.28 0.60 0.64 0.90 0.31 1.32 3;7 Comprison of means for several sam ples
\7 0.28 0.80 0.00 0.00 1.02 1.32 ,,'o/.,: '
,
'J¡:HE
." I
•I ~¡
,1.
\8 \.1\ 0.40 1.14 0.54 0.31 1.75 ·Z·¿:< · When there is a single variable and several samples to be compa-red,
19 0.55 0.80 1.40 0.00 0.51 1.78 '- : the generalization of the t test is the F test from a one-factor analysis
Jr: i . 1 20 . 1.66 1.20 1.40 0.18 1.32 2.82 of variance. The caIculations are as shown in Table 3.3 .
I 2\ 0.55 0.80 0.13 0.90 0.92 1.61 When there are several variables and several samples, a so-called.
JO 22 1.11 0.40 0.13 0.90 0.00 1.48
23 0.83 0.40 0.00 0.54 0.10 1.07 'likelihood ratio test' can be used to compare the sarnple mean
1,.
" 24 0.28 0.00 1.40 0.54 1.02 1.83 vectors. This involves caIcula ting the statistic
1'- 25 1.94 1.99 1.53 2.33 0.92 4.04
26 0.28 1.59 0.25 0.54 1.83 2.52 <p = [n - 1 - -!(p - m)J 10gc[ITI/ IWIJ (3.10)
1-'- 27 1.11 1.00 0.64 0.00 0.71 1.77
28 0.55 0.60 0.89 1.79 0.71 2.27
IT 29 1.66 0.60 2.03 2.33 2.04 4.10 where n is the totat number of observations, p is the number of
l:r " , ·30 1.66 2.19 1.78 2.15 0.92 4.02 variables, m í.S th~ f4 urnber of sarnples, ITI is the detenninant of the
3i' 0.83 0.60 1.53 0.90 2.45 3.19 total sur;n pf squares and cross-products matrix, and IW I is the
Irr ¡ 32 0.83 0.20 0.13 0.54 0.61 1.19 detenninant ofthe within-sarnple surn ofsquares and cross-products
33 0.00 0.60 0.38 0.00 1.02 1.24
Wi 34 0.00 1.00 0.76 0.72 1.13 2.26
matrix. This statistic can be tested for significance by compadson
35 1.11 0.76- with the chi-squared distribution with p(m - 1) dJ. In Example 3.3
~1f 1 0.20 0.00 0.61 1.49
36 0.83 1.99 0.51 1.07 1.53 2.90 below the likelihood ratio test is used to compare the means for the
~I 37 1.94 2.39 1.40 2.15 214 4.54 five samples of male Egyptian skulls provided in Table 1.2.
38 0.00 0.00 0.89 0.54 0.20 1.06 : \
~t 39 1.11 0.80 0.38 1.07 1.43 2.28 . ..
The matrices T and W require sorne further explanation. Let Xijk
40 1.11 1.40 2.42 1.79 214 4.10
"
denote the value of variable X k for the ¡th individual in the jth sample,
1m:1,:·
41 1.1 1 0.00 0.64 0.72 0.00 1.46 : ~.
xjk denote the mean of X k in the same sarnple, and xk denote the
~·f
!I}:
" ~o
.. ·,1,f
I
..
'l .
~Illr ~ ~~}~~ ;
-~!: \\"!::Y
k~;'S:
• GOrr\?ClríGOl 0\ vmtüUaü \G~ uGYQ~üt ~~~W\\\~ ~1
,i ;
' ",
,..¡'!'
In::i, ¡ I
=: ~
¡
The e1ement in the rth row and cth column of W is
~~
<::l <::l
.,; <u ::
~:;;
, ':I,1!i 1)
o. 11 m nJ
, , 1: E
tU ~
_
~
N
.,:'1':1:
! '¡ i'
, ."'; V'I q•
"O
e ::}::~he test based on equation (3.10) involves the assumption that the
,(' I1 JI,
: .P~ L'1j:
'"o o: distribution ofthep variables is multivariate normal, witha constant
d:!i ,!;1
:o · ;" ,Within~sample covariance matrix. Ilis probably á fairly robust test in
tri,i: ¡!! '"
'':
tU ." i:tbesense that moderate deviations from this 'assumptiondo not
;; :1:1 :!i¡ > '0-_
o ..., =: =: ' :·t nduly affect the characteristics of the test.
¡ :,I .,i ·
:': ¡¡'i d:
f,
eo
e ~~
... ;¡¡ :<.:;:K~.! ~:.; '~.
.¡¡; ::
', 'Ii¡ll ' <::'lo.. :::
~~ :lldl'l' ':
:':4 :' ;, ., :
tU
...
.oS:
O"::' '}~'Coniparison of variarion for several samples
¡ , ' 1' ,·
!i:l. !"!! : :¡ o
(,)
e
~';::-BartIett's test is tbe best known for comparing the variation in several --..
:¡ ¡::I :\:: '"
'': ....-::; i>:~saniples. This test has aIread y been mentioned for the two-sample
i j :1 :', tU ... ' 'situation with several variables to be compared. See Srivastava and ';. ~ ' ,
l. ' 1: '! !
> ,>< ~
'-
1: o ¡ 'Carter (1983, p. 333) for details of the calculations involved. The test
cr. .... ....
';:¡; 'o-~ ~ ~ can be used with one or several variables. However, it does have the
>. "- '- ~
c:;
e so: :::.?w; ~¡:"';¡l problem of being rather sensitive to deviations from normality in the
c;j<:;' I -
...'" ~ E¡:"';¡;;' EW~ distribution of the variables being considered.
o '" """
ü
11 11 .... 11 Bere, robust alternatives to Bartlett's test are recommended, these
..::¿, c:: ...,.
:;... ~ being generalizations of what was suggested for the two-sample
VI
e e
~ j •
,2 situation. Thus absolute deviations from sample medians can be
, :
O
. : {'
",
."'; ,' .:!
.....
M
::.,"" ., C)
calculated for the data in m samples. For a single variable these can be
VI - C. treated as the observations for a one-factor analysis of yariance. A
.c c. E
Q.I
--..
,,~:;jH !' ::c¡
<11
¡...
o
'o . ~
E
'"""
.J:
e
'"E significant F ratio is then evidence that the samples come from
;¡ Il{~ ;; ]
.J:
.-;::, .-;::, populations witb different mean deviations, i.e., populations with
il:¡:l :: ., E .:
'-
o C;
.... different variability. If the <p statistic of equation (3.10) is calculated
:.. ;¡-1:1:,
. ' ;~ I ';', ~ 1;
::: a. ::l
e e e g
09 E .g '" o from the transfonned data for p variables, then a significant result
1'· ·1 ,,¡ ,
-'"Ec..
o
'"u '" El
! ,: :! l;;~;:¡,
.~ VI tU
tU
11
indica tes that tbe covariance matrix is not constant for the m
.... o. .J:
.-;::, ~ ::., 11 ~
- ,1 25 g¡ E
'Q IJ .c
VI .: >< POpulations sampled .
:,) ..
." !/¡-,, .: ~ e
oo '"'"e U
.: o --><~ .w!- - AIternatively, tbe variables can be standardized to have unit
,:,-;rt". 'i,j,',. '1> ._ .;;;N IW; ~ .W! I W~
~ -¡;
. rlr~ \ !I~ .
'''~¡ "Ij~ i':I
¡1" 1:11 ' ~l¡'11
.~ !"1;' 1. 11'1
1
,
~
:s ~
Ji
-5
~ ~ .,- .,
11 IJ
-
>< 1)(-
11
1)0(
-
, variances for all the data lumped togetlíer and d values calculated
".. f using equation (3.9). Tbese can then be anaIysed by a one-factor
;1;'1 ; ¡JI , ,
L'II'I'~
~ :
¡.
. .'il'l" i ll
1I.' ?! f~?:-r
'::~I' Hi~I,: ~;
:¡¡I/:,!:;
: ;·',"'ni'.-
1 , ::J~¡i!! !:
"
, , ;"'1l'
! '."' ';;'
",;·,.1,;,
1 '1" "
'lO T r:.';/s uí s ig llifir; u/l cr! HcJcrcn ccs 41
1: a nal ysis of variil llCC. This gelleraiizes Van Vale n's tes t that was
p = 4, m = 5 and the values of ITi and IWI inio ¡;quation (3.10) then
suggested for comparing the va riation in two multivariate sampJcs. A
l·' ,f '. yields cP = 61.31, with p(m -1) ""' 16 d.f. Thi~ is significantly large at
significant F ratio [rom the analysi s of variancc indicates lhat some of
the 0.1 % level in comparison with the chi-square distribution. There is
!'J ' the m populations samplcd are more variable lhan olhers. As in the
therefore c1ear evidence that the vector of mean values of the four
lin two-sample situalion, lhis test is onl)' really appropriate when sorne
variables changed with time.
samples may be more variable than others for alllhe measurements
lif1 ! being considered.
For comparing th~ amount of variation in the sa~ples it is a
straightforward matter to transform the data into absolute deviations
Iln i from sample medians. Analysis of variance then shows no significant
I'lf!' II Example 3.3 ComparisolJ of samples of Egyptian skulIs difference between the sample means of the transformed data for any
1
~I, !1 of the four variables. The ljJ statistic is not significant for all variables
As an example of the testfor comparing several samples, consider the taken together. Also, analysis of varianc~ shows no significant
IIT~ ! data shown in Table 1.2 for four measurements on male Egyptian difference between the mean d values calculated using equation (3.9).
~l:
skulls for five sampJes of different ages, ,::#
~.
It appears that mean values changed with time for the four
A one-factor analysis of variance on the first variable, maximum '->7
" variables being considered but the variation about the means
Dr! breadth, provides F = 5,95, with 4 and 145 dJ. (Table 3.3). This is ~~,~ ,. remained fairIy constant.
.n-j'¡"1
significantly large at the 0.1% leve! and hence there is clear evidence
that the mean .changed with time, For the óther three variables,
: :.44~· ·
· .l:f...-
~t ¡ \ analysis of variance provides the following results: basibregmatic '{{: 3.9 Computa~on21 metbods
pm:I·: height, F= 1.45 (significant at the 5~ ~ level); basialveolar length .1':' The multivariate tests discussed in this chapter are not difficult to
tm ll:1:l
F = 8.31 (significant at the O, I ~~ level); nasal height, F = 1.51 (not · .~; . program on a microcomputer if standard algorithms are used where
)nr·
.f
significant). It appears that the mean changed with time for the first ·~>:; possible. The T 2 statistic of equation (3.5) requires a matrix inversion.
l:! ~ three variables. ~: This can be done easily using AIgorithm 9 ofNash(1979). The ljJ test of
~, ,! Next, consider the four variables together. If lhe five samples are ' .~ equation (3.10) requires the ca1culation of two determinants. AI-
combined then the matrix ofsums of squares and products for the 150 gorithm 5 of Nash (1979) will provide these.
rrr observations, calculated using equation (3.11), is
p!.r • References
I~ 3563.89 - 222.81 - 615.16
426.73]
T = - 222.8 1 3635.17 1046.28 346.47 Carter, E.M ., Khatri, CG. and Srivastava, M.S. (1979) The effect ofinequality
1·:- of variances on the t-test. Sankhya 41, 216-25.
[ - 615.16 1046.28 4309.27 - 16.40 '
Dixon, \V.J. and Massey, F.J. (1969) lntroduction toStatisticaI Analysis (3rd
Ir 426.73 346.47 - 16.40 1533.33 edn). McGraw-Hill, New York.
Levene, H, (1960) Robust tests for equality of variance. In Contributions 10
1"" .
for which the determinant is ITI = 7.306 x 10 13 • The within-sample Probability and Statistics (Eds 1. Olkin, S.G. Ghurye, W. Hoeffding,
I1r' W.G. Madow and H.B. Mann), pp. 278-92, Stanford Univ. Press,
WI matrix of sums of squares and cross-products is found from equation
California.
1m í:; (3.12) to be .
?" Nash,J.C. (1979) Compact NumericaI Methodsfo[ Computers. Adam Hilger,
Iml; Bristol.
5.33 11.47 Schultz, B. (1983) On Levene's test and other statistics of variation.
. [3061.07
1m.,;; w= 5.33 3405.27 754.00
291.30J
412.53 EL"oIutionary Theory 6, 197-203. .
~.
Srivastava, M.S. and Carter, E.M. (1983) An Introduction to Applied
bn ,( 11.47 754.00 3505.97 164.33 ' Multivariate Statistics. North-Holland, New York.
lID:!',]!:,
291.30 412.53 164.33 1472.13
;I~ ~·
. :3 ..
Van Valen, L. (1978) The statistics ofvariation. Evolutionary Theory 4, 33-43.
(Erratum Evolutionary Theory 4, 202.)
/II,I,'¡ for which the determinant is IW I = 4.848 x 10 13
. Substituting n = 150, . .'t. Yao, Y. (1965) An approximate degrees offreedom solution to the multivari-
01 11.1r'
'dJI} ate Behrens-Fisher problem. Biometrika 52, 139-47.
1
t;-ii
Im,IHi ~:I~
¡:.
\ \\~ .~
1111'1\ ~. i
•.¡ lt
¡":, "
, j i.
,
D\stant~~
. 1:\le. \ w e(}n ' ,'
m01VlOull l' \ ' o bs€:cvo l: on s 43
i !. CHA. ? TER FOUR
i, ; variables XI' X 2>"' , X p' The valucs for individual ¡"can then be
I ;
1 ,
denoted by XiI' X i2 ,·· · ,X¡p and thosefor individualj by Xjl, xj2 , ... , x jp'
, ,
11
Measuring 'a nd testing The problem is' to measure the 'distance' between individual i and
1 ·1 individual j.
I ·1 multivoriate distances " If there are only p = 2 variables then the values can be plotted as
¡ ::i
, shown in Fig.4.1. Pythagoras' theorem then says that the length, d¡j,
¡: .:j " of the line joining the point for individual j to be the point for
Ij 'l individual j (the Euclidean distance) is
lí ,¡,;
l' ,r, . !.: ..' .
I '~ · '! 1
, ¡¡ 1' 1 4.1 Multivariate distan'ces :' dij = J {(Xii - X j ¡)2 + (X¡2 - X j2 )2} .
, '¡:'111 { wolves,jackafs, cuonsand dingos, ir is sensible.t.9 a,sk 'llo,)V far one of .;~lMi;':,~;,. ' .
l l¡',,<, .'1
' :,, '1; 1 ' )1
,'., '¡" ' j'l1
· ," ,1 t, • •
th~se groups is, ~ro~ lhe othei si~ g.roups. The i.d ea then is thát j[ two ~;:::;y:;: : x
ammals have SImIlar mean mandlbIe. measurements then they are .'
~: ¡i.i :y] 2
r¡ ,-¡.,,, I , 'close', whereas if they have rather dilTerent mean n:easur~rn.e?t~ the.n ~--:..&.h~: .
~~ ~ ;"L
1
'they are 'distant' frOm each other. Throughout thls chapter It IS thls :;)~:;~1~~'~:,; :
d ti : ;:1
¡-! ;i:J; ,:; concept of 'di~tance' that is being used. , ' '";::!, ' Xj2
A large number of distance measures have been proposed and used :;;:. :, ' -~I'¡, .-jj'J
I ', .:
'1
,1 in multivariate analyses. Here only sorne of the most common ones - ,
will be mentioned. Tt is fair to say that measuring distances is a topic
wh :,; ~e a cert:lin amount of arbitrariness seems unavoidable.
l . possible situation is that there are n objects being considered. dij ¡
with a number of measurements being taken on each of these, and the
me~surements are of t\Vo types. For example, in Table 1.3 results are
,
i
given for four environmental variables and six gene frequencies for 16
colonies of a butterfiv. Two sets of distances can therefore be
calculated between th~ colonies. One set can be environmental
distances and the other set genetic d istances. An interesting question . " .. '.
',,1--- __ /
--1 (Xi" X¡2)
. 1
1
1
I
i
! l ' I!" ,
,,;,
1
is then whether there is a significant relationship between these two I
I,, 'l.II, :.:¡:,:·¡' c. , • .•
sets of distances. Mantel's (1967) test which is descríbed inSection 4.5 élS',': ,.:
I
I
: 'I! ;.:'¡
' I!' I r !
1: : :
I; :! ; ~
is useful in this context.
I
I
I
I
... "
4.2 Distances between individual obsenations ~.
XiI
::;.' 1; (I!
Xjl
e' Xl
." ,. "1"
To begin the discussion on measuríng distances, consider the simples! Fi~re 4.1 The Euclidean distance between individuals i and j , with p = 2
:it ,.:¡.i¡I¡ii case where there are n individuals. each of which has values for p ,!~~~les.
f
',."", ,1'.",II::
::1'
"I:i¡
I¡ 42 r1~~}:,
1( ' f"
l;.:.,¡ ', I'/'':¡:;:1':I, !:. ¡
'
, -
,'¡I' í':;:; "1
:i) . :.:¡- r
- l";" Ii .
-1"":1 j,I::[,;;U¡"IIlb un(¡ l i.:S LJ n c; IlilJ ¡d VU r. L(; (j¡:;WlIC;CS D¡slunr:c s i){~ tl vt:CJI iIlcJil'idlJul ()1 )~;cn: (/lj(J !I:> 45
¡ "~ r F rom the form 6f.equation (4.1) it is cJea r thal if one of lhl: varia bies
mcas ured is muele more·. variable than the others thcn this will
Xj2
''- dominate the cal~ulation ' of distanccs. For example, to take an
":j
I
'y
X2 ':
- / - - - - -__ (X. "~o 'l'. 'i,l extreme case, suppose that 11 men are being compared and that X) is
their stature and the other variables are tooth dimensions, with all the
ilri '
I measurements being in millimetres. Stature differences will then be in
I
rn the order of perhaps 20 or 30 millimetres while tooth dimension
differences will be in the order of one or two millimetres. Simple
In dij :
;
calculations of d¡j""will then pro vide distances between individuals that
¡:-¡¡ are essentially stature differences only, wi-th tooth differences having
!
'r¡f negligible efTects. Clearly there will be a scaling problem.
In practice it is usually desirable for all variables to have about the
.WI!, -
'o' tI x¡ú-_ / same influence on the distance ca:Jculation. This is achieved by a
-----~ ( Xi!,
Irn : Xi2. Xi3)
preliminary scaling of the variables to standardize them. This can be
I . : """ done, for example, by dividing each variable by its standard deviation
I - :¡~f
I ,~:, for the n individuals being compared.
I
I "',:f-.S :.
" ~"?.
;'" -Jt~".i.,"
' :O."
.~t::"
~ ¿. Example 4.1 Distances between dogs and related species
""!"-:-
I Table 4.1 Standardized variable values ca1culated from the original data in
ni I
Table 1.4.
;n With more than three variables it is not possible to use variable
values as the coordina tes for physically plotting points. However, the Xl X2 X3 X4 Xs X6
two- and three-variable cases suggest that the generalized Euclidean
-0.50 -0.50 -0.74 -0.74 -0.49 -0.61
distance .. -1.52 -1.93 -1.12 -1.39 -0.86 -1.31
~.~.
~l · ..,
,~ ~.~
-< !t
U\S\m\(lS t~\'N~~\\ \\~~~\\~\\~~~ ~\\~ ~~t~1~~~ \,
'<\..1 (.) standard dcviation of 1.572 mm for the seven groups. The standar-
';:: ~,.,~' '-dized variable vaJues are then calculated as follows : modern dog,
2 "'" (9.7 - 10.486)/1.572 = - 0.50; golden jackal, (8.1 - 10.486)/ 1.572 =
g~
'".... -1.52; . .. ; prehistoric dog, (10.3 - 10.486)/ 1.572 = - 0.12. Standar-
~
dized values for all the variables are shown in Table 4.1.
• Using equation (4.1) the distances shown in Table 4.2 have been
calculated from the standardized variables. 1t is clear that the
o
prehistoric dogs are rather similar to modern dogs in Thailand.
""
::
'0
I~ , 1ndeed, the distance between these two groups is the smallest distance
in tbe whole tableo (Higham et al. (1980) concluded from a more
.,;
c. complicated analysis that the modem and prehistQric dogs are
;;
o.... 000
indistinguishable.)
. ;.-
"""""'
C() § oqr:
(i 1
c;¡
E
e ·i;: ~3·:Distances
--
... \;
.~
between populations and samples -...... ~ .'~
c:
e
v
;-
>/A number of measures have been proposed for the distance between
v .§~ ~
-.-. r- o- :' ·two rÍmltivariate populations when infonnation is available on the
'"e
-
~
""'=
:: ~- ~(,,¡M
v
v
: means, variarices and covariances of the populations. Here only two
~ '. Mil be considered.
Ü
.o ;" -{ Suppose that g populations are ava,iIable and the multivariate " i ":"7
en
V
U ""
~::::"
distributions in ·these populations are known for p variables
e -'<I''''N
C"""'i C' l.f"\ 1.1"')
c: .~ ~ ("¡ -.i ..; vi X¡,X 2 , ... ,X p• Let the mean ofvariable X k in the ¡th population be
.~ D - Jlki, and assume that the variance of X k is the same value, Vk , in all the -----
e populations. Penrose (1953) proposed the relatively simple measure
.g
.:...
Ü
;J
¡¡.¡
'"
;::
~~
-e u
<:
o- C' ' " ..,. 00
\O~~"""'"
¡--:vi""""N
Pij = f (Jlki - JlkY (4.2)
N \;l ....., k=¡ pVk
..
..;
::co: for the distance between population i and population j .
f-
i: A disadvantage of Penrose's measure is that it does not take into
... "'"
""'= o
f"--\O("""'\:ON
ooo-c-c-.or-
account the correIations between the p variables. This means that
0"",= Nvi",,-':-':o
~ when two variables are measuring essentially the same thing, and
hence are highly correlated, they stilJ individuaIIy both .cantribute
about the same amount to population distances as a third variable
that is independent of aIl other variables,
-
""'2::::"::::"
.g g ~"O
"'o"
""'=
.~
A measure that does take into account correlations between
variables is the Mahalanobis (1948) distance,
...
:: ....., ... ~ o
,,' ~ ~ ~ § :: e .~ '""'
-:.;
lo!.:
p
15 ::: .:: ~ ~ ~~ p p
. .- ~t3D~d25~ :,:
'.
L L1 (Jl,.i -
D& = r=l.s= Jlr¡)Ú· (p..i - Jl.j), (4.3) -"'~l
"! i: _....'.;.'-:~~
:; !:
," "" ""
. ¡
4r. Mr: us uring Clnd If: s ting ITIlJ)livuriul!: cJ is lfln r: es Dislonccs D(:il-v' ccn populatiulls and somplcs 49
¡:
1
whcre [ :r s is th(; cJcn-:~nl in the rlh ro\\' ancl sth colurnn of the in\'erse of Jhe equations (4.2) to (4.5) can obviously be used with sampl e data
1- the covariance matrix for thc p variables. This is a quadratic form that iféstimates of population mean s, varianccs and covariances are used
:
I ~ ~
can be written in the alternative w a y : in place of true values. In that case the covariance matrix Y involved
I in equations (4.3) and (4.4) should be replaced with the pooled
11i D"G = (pj - p)'y-I(pj - p), (4.4) estimate from all the samples available. To be precise, suppose that
there are m samples, with the ¡th sample being of size nj, with a sample
ni where covariance matrix of Cj. Then it is appropriate to take
!TI
¡1!
pj = Jit'
JiljJ I
C ~ JI (n¡ - l)C¡ ¡~I (n¡ - 1)
m m
(4.6)
I ~¡ . t
[ Ji p, as the pooled estimate of the eommon covariance matrix. The single-
, .. Pr ,; ! .'. , sample covariance matrix C¡ is said to have n¡ - 1 degrees offreedom,
I~Jf ~ is the vector of means for the ith population and Y is the covariance while C has a total of ¿(ni - 1) degrees of freedom. The sample
Iir! !;
matrix. This measure can only be ealculated if the population t'.. ' covariance matrices should be caJculated using equations (2.2) to
covarianee matrix is the same for al! populations. . ~~' (2.7).
Ifi t~ ! ~ The Mahalanobis .distance is frequently used to measure the '.~':: .::
',- , In principIe the Mahalanobis distance is superior to the Penrose
distance of a single multivariate observation from the centre of the "
lrrr; :; : ~
distance beca use it uses infonnation on covariances. However, this
population·that the observation comes from. If X I ,X 2 , ••. ,x p are the advantage is only present when covariances are accurately known.
rm i! ' values of XI' X 2"'" X p for the individual, with corresponding When covariances can only be estimated with a smaIl number of
~.'~
U· population mean values of Jil' Ji2"'" Ji p , then ::Q. ' degrees of freedom it is probably best to use the sim.pler Penrose
~{-- measure. It is difficuIt to say precisely what a 'small number of degrees
)n:r11; p p
~
1 ~:--- .(
~.~ l'!
~'1~3
·1
~ 1t,¡H
"r.¡I\ ¡i
\q , , o.
~
,,
~.'~"~'
\
•.
fJU 1: 1,.
. t ¡~
SIJ MC;O Si.\\' ir\?:, und tcs\in,¿¡ mul '. ;\!uríalc di5tan cC5 O\~t\\m~~,~ \\~~Wt~,\\ \\~\\\\\\\\\\\\\K ~,\\\\ U\\\\~\~~ , ~~
whilc the cov a ri z¡ ncc matrices, ca1culat~9., as indi cated by equation Pcnrose's distance meél 'i UreS of equ a tioll (·;.2) ca n now be ca l-
. ~ :. cul a tr.c.l bet ween each pai r of samplcs. Thcre are p = 4 variables with
(2.7), are
!$i" , .- variances that are estima ted by VI = 21.111, C'2 = 23.485, V3 = 24.179
. and V4 = 10.153, these being the diagonal terms in the po oled
[26.31 4.45 0.45
-0.79 725]
0.39 covariance matrix. The sample mean values given in the vectors Xl to
" :: el = 4.15 19.97
-1.92 ' , Xs are estimates of population means. For example, the distance
"
j ! ' :' 0.45 - 0.79 34.63
7.64 between sample 1 aJ1d sample 2 is calculatsd as
7.25 , 0.39 - 1.92
r'i : ~ J:'.-,' _ (131.37 - 132.37)2 (133.60 - J 32.70)2 .-.
11L :,
I '!
[3.14 1.01 4.77 5.62
1.84] . . .. '; ._
;' , .. ' ,
P12 -
4 x 21.11 J
+ -'--,---:--~-'--
4 x 23.485
e = 1.01 21.60 3.37
,
'ir:.
l i :
2 4.77 3.37 18.89 0.19 :i .F ' (99:17 - ·99.07f« 50.53 - 50.23)2
, 1:,
, 1, ' 1.84 5.62 0.19 8.74 }.f:?i.: x
+ 4 24.179 + ' -
-
~ ·~~.:-· i. :~ t.:.: : -'
-;r1:i~~~~~?:
'"
¡::
: 1' \1
li' I\' ; ~
: 1:
l :'. ~
e =
3
[IW _
0.79
0.78
0.79 , -0.78
24.79
3.59
3.59
20.72
090]
-0:09
1.67 '
= 0.23.
"'f.;~;;i.This only has meaning in comparison with the distances between the
;~;~t other pairs of samples. Calculating these as well pro vides the
-...
1.67 12.60
!¡':: l¡r: 0.90 -0.09 , , ::fJóllowing dislance matrix:
,""
t 1,
2.05] i~b\'. ~ ,
!I;:!: [ 15.36 Ear/y pre- Late pre- 12/ 13th
"
- 5.53 -217 ..
", 1,·
:1; il e
4 -
_ - 5.53
_ 2.17
26.36
8.11
8.11
21.09
6.15
5.33
; -' -.~)~'.1",:,. ;~....
: .... ;
dynastic 'dynastic dynasties Pco/emaic Roman
[ -28.63 -0.23
24.71
- 1.88
11.72
- 2.15
1.99] Pro/emaic
Romal!
0.493
0.736
0.404
0.583
0.103
0.244 0.066
0.23
¡; and C s = _ 1.88 11.72 25.57 0.40 .
It will be recalled from Example 3.3 that the mean values change
iI ¡¡ ": - 1.99 2.15 0.40 13.83 . significantly from sample to sample. The Penrose distances show that
, I ! . , the changes are cumulative oyer time: the samples that are c10sest in
Although the five sample covariance matrices appear to differ ' time are re\a tive\y' similar whereas the samples that are far apart in
I '1 : 1
somewhat, it has been shown in Example 3.3 that the difTerences are : time are very difTerent.
1:1
:llq not significant. lt is therefore reasonable to pool them using
equation (4.6). Since the sample sizes are aH 30 this just amounts to
<.. }"urning next to Mahalanobis distances, these can be ca\culated
'j"" y, f~om equation (4.3). The inverse ofthe pooled covariance matrix C is
¡'1 1' taking the average of the five matrices, which is - " :~:~~ : :.' ~'
¡.¡" I: 0.0011 0.0001
'1' : 21.111 0.037 0.079 ,'" [ 0.0483 - 0.0099]
ii' , 1 2.009J e- 1 0.0011 0.0461 - 0.0094 - 0.0121 .
't ! e= 0.037 23.485 5.200 2.845
= 0.0001 - 0.0094 0.0435 -0.0022
0.079 5.200 24.179 1.133 .
[ ,?;;¡ ~~ " - 0.0099 - 0.0121 - 0.0022 0.1041
1; 10.153
2.009 2.845 1.133 i:~t?~?;" ,
ir
,lw: with L (ni - 1) = 145 degrees of freedorn.
r:~Using this and the sample means gives the distance from sample I to
:- -:7 ~ . 0,
ri I
i~ ~, i
"ll
1,l"1 1 :1 .-:::..'
'"
M(:(Js~JriTlg O!lc/ ksting I1llJltivoriut u di s tUllC l:S
~ ')
~)~
T}¡c Monte! test on clistoncc mo tri ces re ')
J.;
samp!t: 2 lo bt: Various indices of distancc havc becn proposed with lhis t.ypc of
proportion dala. For example,
r' Di 2 = (131.37 - 132.37)0.0483(131.37 - 132.37)
K
+ (131.37 - 132.37)0.0011 (133 .60 - 132.70)
¡r ¡'
+ ... - (50.53 - 50.23)0.0022(99.17 - 99.07)
dI = ¿ /Pi -
i= 1
Qi//2, (4.7)
ni
+ (50.53 - 50.23)0.1041(50.53 - 50.23) which is half of the sum of absolute proportion difTerences, is one
1, " 1'
= 0.091. possibility. This takes the value' 1 when there is no overlap of classes
ni and the value O when 'Pi = qi for all i. Anotner possibility is
ni, ; Calculating the other distances between samples in the same vJay
1)
ni· i
q;:¡
provides the distance matrix:
'1:1 1¡.:: 1
I
~t( " The merits of these and other distance measures for propo~tion
,'
- ~. Ear/y predynascie
nltr; Lace predynascie 0.091' ~~:: data have beep debated at length in the scientific literature. Here aIl
111 1t~ 12/13th dynascies
Peo/emaie
0.903
1.881
0.729
1.594 0.443
'~r that needs to be noted is' that a large number of alternative measures
S. ·
I',! !
. I
Roman 2.697 2.í76 0.911 0.219 'i;.: existo It must be hoped that for particular applications it does not
.&:~ matter much which one is used.
-
'1 1
Illtl>¡ ' 1~ I A comparison between these tlistances and the Penrose distances i~~,
f:Jr: shows a very good agreement. the Mahalanobis distances are three %3:.
to four times as great as the Penrose distances. However, the relative ::. ·45 The Mantel test 00 disfunce matrices
distances between samples are almost the same for both measures. A rather useful test for comparing two distan ce matrices was
For example, the Penrose measure suggest that the distance from the introduced by Mantel (1967) as a solution to the problem of detecting
~:. . ,'". I¡
~ early predynastic sample to the Roman sample is 0.736/0.023 = 32.0 space and time cJustering of diseases.
timb as great as the distance from the early predynastic to th'e late To understand the nature of the procedure the following simple
predynastic sample. The corresponding ratio for the Mahalanobis example is helpful. Suppose that four objects are being studied, and
, : measure is 2.697/0.091 = 29.6. that two sets of variables have been measured for each of these. The
'1 -1
,¡
first set of variables can then be used to construct a 4 x 4 ma trix where
the entry in the ¡th row andjth column is the 'distance' between object
jr: ¡ 4.4 Distances based upon proportions i and object j. This distance matrix might be, for example,
1;-[
A particular situation that sometimes occurs is that the variables
h'q ;! being used tq measure the distance between populations or samples ,:¡
mil ml2
1
m 3 mI] [0.0 1.0 1.4 0.9]
Ind -¡ are proporiions whose sum is unity. For example, the animals of a
'J "
M = m21 m22 m23 m24 = 1.0 0.0 1.1 1.6
certain species might be cJassified into K genetic cJasses. One colony [ m31 m32 m3) m 34 lA 1.1 . 0.0 0.7
might then nave proportions PI of cJass 1, P2 of cJass 2, ... , PK of cJass -.:,.. m41 m42 m43 m4 0.9 1.6 0.7 0.0
K, while a second colony has proportions ql of class 1, q2 of class
2, ... , qt of cJass K. The question then arises of how similar the It is symmetric beca use, for example, the distance from object 2 to
cQlonies are in genetic terms. t~
~ -;
object 3 must be the same as the distance from object 3 to object 2 .
'f~
"1 :'
,.
MCQf;Uring amI testing multivariatc dislu!1ceS The Mantel test on distancG motrices " ,
' 54
(1.1 un:ts). Diilgonal elcmcnts are zero since thcse represent distances or¿:::red matrices E R , 'I-lence the observed ,.Z , will be a typical
[;om objects to themselves. randomized Z value. On the other hand, ifthe two d istance measures
,~ The second set ofvariables can also be used to construct a matrix of have a positive correlation then the observed Z will tend to be larger
distances between thc objccts. For the example this will be taken as than values given by randomization. A negative correlation between
distances should not occur but if it does then the result will be that the
i e l1 observed Z value wiII tend to be low when compared to the
e12 el3 e14J [0.0 0.5 0.8 0.6J
I e 22 e 23 e 24 _ 0.5 0.0 0.5 0.9 randomized distribution.
¡. E = e21 With n objects there are n! difTerent possible orderings of the object
i ' e32 e33 e34 - 0.8 0.5 0.0 0.4
i [ e31 .c,;;.;;:.numbers·lhere are therefo.re n! possible randomizations of the
¡. e 41 e42 e43 e 44 0.6 0.9 0.4 0.0
:i~;¿:'ei~rñé'nt~'of E, sorne .o fwhich niight 'give thesame· Z ·vaiues.Hence in
I' Like M, this is symmetric with ze!os' down the diagonal.
::-c~":ol{r"eXample with four objects the randomized'Z distribution has
,~A! = 24 equalIy likely values. It is not too difficult to caIculate al! of
Mantel's test is concerned with assessing whether the elements in M
c: ·these. More realistic cases might involve, say,ten objects, in which
and E show correlation. Assuming n x n matrices, the test statistic
I r,
>2i case the number of possible Z values is lO! = 3,628,800. Enumerating
'~,:':;' áll oC these then becomes impractical and there are two possible .
¡ i:
; "
Z
" ¡-1
= ¡=I 2j=I 1m¡J-€ij (4.9) , - ~: approaches for carrying out the Mantel test. A large number of
.~. randomized E Rmatrices can be generated on the computer and the
,
------
d
11. .· resulting distribution of Z values used in place ofthe true randomized
;j
!i
,I
"
is calculated and compared with the distribution of Z thatis obtained ' distribution. Alternatively, the mean, E(Z), and variance var(Z), ofthe .,
by taking the objects in a random order for one ofthe matrices. That is ., ' randomized distribution of Z can be caIculated, and
to say, matrix M can be left as it is. A random order can then be chosen
for the objects for matrix E. For example, suppose that. a 'random g = [Z ..;" E(Z)]/[var(Z)] 1/2
ordering of objects turns out to be 3,2,4,1. This then gives a
randomized E matrix of. can be treated as a standard normal variate.
Mantel (1967) provided formulae for the mean and variance of Z in
0.0 0.5 0.4 O.SJ the null hypothesis case of no cOfíelation between the distance
0.5 0.0 0.9 0.5 measures. There is, however, sorne doubt about the validity of the
¡ ER = 0.4 0.9 0.0 0.6 normal approximation for the test statistic g (Mielke, 1978). Given the
i
[
0.8 0.5 0.6 0.0 ready availability of computers it therefore seems best to perform
l. randomizations rather than to rely on this approximation.
f The test statistic Z ofequation (4.9) is the sum ofthe products ofthe
The entry in row 1, column 2 is 0.5, the distance between objects land
I t' elements in the lower diagonal parts of the matrices M and E. The
l
2; the entry in row 1, column 3 is 0.4, the distance between objects 3
.,~
1f
and 4; and so on. A Z value can be calculated using 1\1 and ER•
Repeating this procedure using difTerent random orders ofthe obje,~ts
only reason for using this particular statistic is that Mantel's
equations for the mean and variance are available. However, if it is
;¡ for E R produces the randomized distribution of Z. A check can then decided to determine significance by computer randomizations there
is no particular reason why the test statistic should not be changed.
be made to see whether the observed Z value is a typical value from
i this distribution.
The basic idea is that if the two measures of distance are quite
Indeed, values of Z are not particularly infqnnative except in
comparison with the mean and variance. It may therefore be more
useful to takethe correlation between the lower diagonal elements of ,---.
unrelated then the matrix E will be just like one of the randomly
I f' i
:1:,
,
".
~I : lo
I
\: j '\
:,f~ Meus urin g lind tesl in'g multivari(ll e di:itunc;P'5
,
\! a nalanobis, p,e (1943) Historical note on th e D 3 -statistic, Sankhya 9, CH A PTE H, FIVE
.!<;.
237, ..
:'"
~I
Penrase, L.W. (1953) Distance, size and shape. Annals of Eugenics 18, 337-43.
Romesburg, H .e. (1984) Cluster Analysisfor Researchers. Lifetime Learning
P ublications, Belmont, California: ' -
'11I ,'l .
\ :1 "; ,, 5.1 Definition of principal components
I ¡:
I
1·
The technique ofprincipalcomponent analysis was first <Iescribed by
! ti Karl Pearson (190t). He appafentIy believed that this ..vas the correct
"l ' solution to sorne of the problems that were of interest to biometricians
\ 1
I¡(ir
¡' at that time, although he did not propose a practical method of
" ·: calculation for more than two or three varüibles. A description oC-
:j~ practical computing methods carne much later from , Hotelling
.1
~ :'
~
"1\
.1 :
. (1933). Even then the calculations were extremely daunting for more
than a few variables since they had to be done by hand. It was not
;1 : until electronic computers became widely available that the technique
¡t! • achieved wi despread use.
Principal component analysis is one of the simplest of the
multivariate methods that will be described in this book. The object of
the analysis is to take p variables Xl' X 2"'" X p and find combin-
ations ofthese to produce indices Z l ' Z 2"'" Z p that are uncorrelatcd.
The lack of correlation is a useful property beca use it means that the
indices are measuring different 'dimensions' in the data. However, the
indices are also ordered so that Z 1 displays the largest amount of
variation, Z 2 displays the second largest amount of variation, and so
::!¡: ~
¡ on. That is, var (Z ¡) ~ var (Z 2) ~ . .. ~ var (Z p), where var (Z¡) denotes
: I I
, I
the variance of Z¡ in the data set being considered. The Z¡ are called
í/I'l .
,: ' 1
~
, the principal components. When doing a principal component
.' analysis there is always the hope that the variaÍlces of most of the
indices will be so low as to be negligible. In that case the varia tia n in
!i¡I the data set can be adeq~a.tely described·by the few Z variables with
I,1:l'!'
variances tbat are not negligible. Sorne d~gree of economy is then
achieved since the variation in the p original X variables is accounted
. for by a smaller number of Z variables.
Id
I, 1i 1 '"
i :
! ~ , (jO Principol r:urJI¡JUnr.nf ollol}'sis 1 rncec1urc ful' o priIlcip 1 CUm]iOIlcnt onol y ,;i,c; GJ
! ¡ .... ¡.
after they have bcen standardized to have zero mcans and unít
Jt must. be stresscd that n píincipal component analysis doe s not
l' standard deviatíons. Cleady 2 I is essentially just an average of the
[
) ,
alwayswork in the sensc that a large number of original variables are
redueed to a small number of transformed variables. Indeed, ir the standardized body measurements. lt can be thought of as a simple
~J . index of size. The analysis given in Example 5.1 leads to the
original variables are uncorrelated then the analysis does absolutely
nothing. The best results are obtained when the original variables are conc1usion that most of the dilTerences between the 49 birds are a
very highly correlated, positively or negatively. If that is the case thcn matter of size (rather than shape).
it is quite conceivable that 20 or 30 original variables can be
adequately represented by two or three principal components. Ifthis 5.2 Procedure for a principal component a~alysis
desirable state of alTai rs does oecur then the important principal
components will be of sorne interest as measures of underlying A principal component analysis starts with data on p variables for n'
'dimensions' in the data. However, it will also be of value to know individuals, as indicated in Table 5.2. The first principal component is
that there is a good deal of redundancy in the original variables, with then the linear combination of the variables X l' X 2"'" X P'
1 :¡ most of them measuring similar things.
ni: '1 :: Before launching into a description ofthe calculations ,involved in a
~
2 1 = a 11 X 1 + a 12 X 2 + ... + a 1;XP
, : ¡ j
.....'
n11; 1:':
i 1\ "
principal component analysis 'i t mayt>e of sorne value t~ look briefly ',:}
' -.. '"
ltd : F" that varies as muchas possible for the individuals, subject to the
u.. ,",¡i at the outcome of the analysis when it is applied to the data in
:t" condition that
Table 1.1 on five body measurements of 49 female sparrows. Details of
illt;
. ¡'. !
': i
1 ~ the analysis are given in Example 5.1. In this case the five measure-
-;;.;' .
I !
2 2
all+a12+· 2'
··+alP= 1.
ilI;: ~ ~ .~ ments are quite bighly correlated, as shown in Table 5.1. This is -,
f:; ' 1'
~:: Thus
) 11[ , •
therefore good material for the analysis in question. It tu ros out, as we the variance of Z var (2 is as large as possible given this
''l,<! ! l' 1 ),
. 1( r shall see, that the first principal component has a variance of 3.62 constraint on the constants a The constraint is introduced because
nI" ~ I . lj•
whereas the other components all have variances very much less than ir this is not done then var (2 1 ) can be increased by simply increasing
'¡
ni : this (0.53,0.39,0.30 and 0.16). This means that the first principal any one of the a¡j values. The second principal component,
)
component is by far the most important of the five components for
,,,' ¡
' . -1
'
representing the variation in the measurements of the 49 birds. The
Z2 = a21 X 1 + a22 X 2 + .. . + a2p X p'
j!-;-: '
! " first component is caJculated to be
is such that var(Z2)is as large as possible subject to the constraint
!, ,
2 1 = 0.45X 1 + 0.46X 2 + 0.45X 3 + OA7X~ + OAOX 5. that
"
aL + a~2 + .. , + a~p = 1.
where X 1 ,X 2 "",X 5 represent here the measurements in Table 1.1
r¡:,i i
Table 5.2 The form of data for a principal compo-
nent analysis.
I;¡-- ! Table 5.1 Correlations between the five body measurements oC female
sparrows caIculated from the data of Table 1.1. Individual Xp
XI Xl
Variable , XI Xl X3 X4 X5 x lp
Xli X12
X l' total length 1.000 2 Xli X21 X lp
X l' alar extent 0.735 1.000
X 3. length of beak & head 0.662 0.674 1.000
X 4' length of humerus 0.645 0.769 0.763 1.000 ; ~"
n Xol X.2 X. p
X 5' length of keel of sternum 0.605 0.529 0.526 0.607 1.000
'JI:
~. ~.-.
: ~;
: ~) .
~'~~"
,",
(l',,i',\ 1:12 Principal r.omponcnt ol\n'ysis i ProGeduro ¡al' u principal componenl anaJysis S3
,; i
and also to the condition Ihal Z I and Z 2 are uncorrelalcd, The Ihird
I
'¡
, ---"""
An important property ofthc eigenvalues is that they{~~ d up\o the
,
Li principal component, sum of the dtágonal elemcnts (the trace) of C. That is ~S"J ,r )'
:;! ' " r,
¡ ,'.':,:" ,'~:,:
?rincipal componen ts account for all of the variation in the
ongmal data.
In order to avoid one variable having an undue influence 'o n the
'"
~,
in Chapter 2. The important equations are (2.2), (2.3) and (2.7). The ""
matrix is' symmetric and has the form Cpl Cp 2 1
I
e ll el lP where cij = cji is the correlation between X¡ and X j' In other words,
C= e 71 e2 C2p
e ] the principal component analysis is carricd out on the correlation
. ,
, [ matrix. In that case, the sum of the diagonal terms, and hcnce the sum
"
/' : c pl Cp c pp of the eigenvalues, is equal to p, the number of variables.
:~ . The steps in a principal component analysis can now be stated:
" ,¡ where the diagonal element Cji is the variance of Xi and Cij is the
1. Start by coding the variables XI' X 2'"'' X p to ha ve zera means
< ~ : ,! 1 covariance of variables Xi and Xj'
;¡ , ' ! and unit variances. This is usual, but is omitted in sorne cases.
':1 : 1 The variances of the principal components are the eigenvalues of
2. Calculate the covariance matrix C. This is a correlation matrix ir
the matrix C. There are p of these, sorne of which may be zera.
step 1 has been done.
i!il l . 1 Negative eigenvalues are not possible for a covariance matrix .
Assuming that the eigenvalues are ordered as ;'1 ~ ;'2 ~ ... ~ ;,p ~ O,
¡ ,
3. Find the eigenvalues ).1' A2 •••• , Ap and the corresponding eigenvec-
I tors a l' a 2 , .•• , al" The coefficients ofthe ¡th principal component are
.1111\ I I ", then ;.¡ corresponds to the ¡th principal component 1,
.:; I ¡ ,'-
. then given by a i while J.¡ is its variance.
4. Discard any components that only account for a smaIl proportion
, I Z¡ = ailX I + ai2 X 2 + ... + Q¡pX p.
ofthe variation in the data. For example, starting with 20 variabfes
~
it might be found that the first three components account for 90~~ ~ '"
In particular, var(Z¡} = J' i and the constants ail'ai2 ... . ,aip are the
~ , .' of the total variance. On this basis the other 17 components may
elements of the corresponding eigenvector.
,~, 3ir '
reasonably be ignored.
r • ------
'~f ..,
.. ~: .,
:' ~:'
1 :,!- "
'lO . .. .
---.
;'
f
G'l l'ri /le i!w I: C()/lI fJO /, (~ n Ion CJ ly;; i s Proce durc [(JI' o prin ci[Jo] r:o mJ! () /I ~ nl ur: c ]}'s i ~ GS
1 ..
E X(//llplc-5.1 Body nwaSlIrcmenI S r{!cmale sparro\\".<; Anothcr way of .1 ooking al the relalive importan cc of princi pal
l' Some mentíon has alrcady bcen made of what happens when a
componcnts is in terms of their variance in comparison to the
variance of the original variables. After standardization the original
principal component analysis ís carried out on the data on fivc
variables all have variances of 1.0. The first principal component
body measurements of 49 female sparrows (Table 1.1). It is now worth
therefore has a variance of 3.616 original variables. However, the
while to consider the example in more detail.
second principal component has a variance of only 0.532 of that of
It is appropriate to begin with step 1 of the four parts ofthe analysis
one ofthe original variables. The other principal components account
that have just been described. Standardization of the measurements
for even less·va~riation. .
ensure that they all have equal weight in the analysis. Omitting
The first principal component is
. 1 standardization would mean that the variables XI and X 2 ' which
vary most over the 49 birds, would tend to dominate the principal
components. ZI = 0.452X 1 + 0.462X 2 + 0.451X 3 + 0.471X 4 + 0.398X 5'
The covariance matrix for the standardized variables is the
correlation matrix. This has already been given in lower triangular ~: where Xl to X 5 are standardized variables. This is an index'ofthe size
form inTable 5.1. The eigenvalues ofthis matrix are found to be 3.61 6, . ofthe sparrows. Jt seems therefore that about 72.3% ofthe variation in
0.532, 0.386, 0.302 and 0.164. These add to 5.000, the sum of the ::i' the data are related to size differences.
.diagonal terms in the correlation matrix. The corresponding eigen- Tbe second principal component is . .
'-
vectors are shown in Table 5.3, standardized so that the sum of the
squares of the . coe"fficients is unity for each one of them. These Z2 = -0.051X l + 0.300X 2 + 0.325X 3 + O.l85X 4 - 0.877X 5 •
eigenvectors then provide the coefTicients of the principal ... -
components. ", .~:::- Tbis appears to be a contrast between variables X 2 (alar extent), X 3
a
The eigenvalue Tor principal component indicates the variance (length of beak and head), and X 4 (length of humeros) on the one
i: that it accounts for out of the total varianc.e.sJ)f 5.000. Thus the first hand, and variable X 5 (length ofthe keel ofthe sternum) on the other.
- "-
Inqi, :~ principal component .accoun ts fqr (3.616/5.000) 100% = 72.3~~, the
'. second for 1O. 6~~. the th rrd for 7. 7,/~, the fourth for 6.0,/~, and the fifth
That is to say, Z 2 will be high if X 2' X 3 and X 4 are high but X 5 is
11: .'
l
i
for 3.3%. Clea rly the first component is far more important than the
lov.'. On the other hand, Z2 will be low if X 2' X 3 and X 4 are low but
X 5 is high. Hence Z2 represents a shape difference between the
others. sparrows. The low coefficient of X 1 (totallength) means that the valúe
of this variable does not affect Z2 very mucho The principal
Table 5.3 The eigenvalues and eigenvectors of the correlation matrix for live
I. , !' measurements on 49 female sparrows. The eigenvalues are the variances of
the principal com ponents. The eigenvectors give the coemcients of the
components Z3,Z4 and Z5 can be interpreted in a similar way. They
represent other aspects of shape differences.
lll:, ; standardized variables. The values of the principal components may be useful for further
Irm: ; : analyses. They are calculated in the obvious way from the standar-
Eigent'ecror, coefficient of dized variables. Thus for the first bird the original variable values
~I!,, : . : are Xl = 156, X2 = 245, X3 = 31.6, X 4 = 18.5 and X s = 20.5.
Component Eigent'alue XI X2 Xl Xs
X" =
Tbese standardize to Xl (156 - 157.980)/3.654 =-
0.542, x 2 =
1 3.616 0.452 0.462 0.451 0.471 0.398 (245 - 241.327)/5.068 = 0.725, X3 = (31.6 - 31.459)/0.795 = 0.177,
2 0.532 -0.051 0.300 0.325 0.185 - 0.877 X 4 = (18.5 - 18.469)/0.564= 0.055, and X s = (20.5 - 20.827)/0.991 =
3 0.386 0.691 0.341 - 0.455 - 0.411 - 0.179 - 0.330, where in each case the variable mean for the 49 birds has
V
I
:.::..
, ,;(
~
i '.
66 Principul Gomponenl (JJlajy~i5 \ ProGcdure lor a ?rlnG¡~,ql ~Gm~\\\\~\\\ ~~~\~~\~ ~\
compo¡;cnt for th e lirst bird is th ercfon:; \
"
1' ;
¡ Zz
2
j ~ ..
Z 1 = 0.452 x ( - 0.542) + 0.462 x 0.725 + 0.451 x 0.177
+ 0.471 x 0.055 + 0.398 x ( - 0.330) .1
I" •
•
11
i¡ ! ';.
0
"
= 0.064.
I •
•• • •
• 0.
•
• o·
••
,
' t¡~
"
first 21 of them recovered while the other 28 died. A question oJ sorne
interest is therefore whether the survivors and non-survivors show
I :-:;;;;; '. pnnclpal components, ZI and Z2' (Open Clrcles mdlcate surVlvors, c\osed
! ;'jt>.-· circles indicate non-survivors.) .
"
T1: any differences. It has been shown in Example 3.1 that there is no
j ?:~f:'
:;.:~::t.;.
.
. :>? :~.{..
; 1,; evidence o(any differences in mean values. However, in Example 3.2 A ~ "'" ,
deviations ·from medians (described in Chapter 3) gives-a significant
j: ...~;..:~•. difTerence (just) between the variation of principal component 1 for
i
¡¡ i'
1 it has been shown that the survivors seem to have been less variable .""
than the non-survivors. The situation can now be considered in terms . survivors and non-survivors on a one-sided test ~at._the 5~~ leve!. The
r o"
of principal corñponents. assumption for the one-sided test is that, if anything, non-survivors
!t
¡; The means and standard deviations of the five principal compo- were more variable than survivors. The variation is not significantly -..
ncnts are snown in Tabie 5.4 separately for survivors and non- different for survivors and non-survivors wilh Levene's test on the
survi vors. t:one of the mean differenccs bet\Yeen survivors and non- other principal components. Since principal component 1 measures
sl!r\'ivors is significant on a t test and none of the standard deviation overall size, it seems thal stabilizing selcction may have acted against
differenccs is significant on an F test. However, Levene's test on very large and very small birds.
Figure 5.1 shows a plot of the values of the 49 birds for the first t\Yo
principal components, which between them account for 82.9~~ of the
Table 5.4 Comparison between survivors and non-survivors in terms of
means and standard deviations of principal components.
variation in the data. The figure shows quite clearIy how birds with "
extreme values for the first principal com ponent failed to survive.
1" .' ! Afean Standard dedacion Indeed, there is a suggestion that this was true for principal
;, "
com ponent 2 as wel!.
....-.
; ..
!! Principal Non- Non-
1,
~J-:-
componenc SurriL'ors sun';L'ors SurviL'ors surL'Írors
:i l -0,100 • 0.075
~:
Example 5.2 Employmem in European countries
j! 0 1.506 2.176
As a second example of a principal component analysis, consider the
I ,
1: '
2
3
0.004
-0.140
-0.003
0.105
0.684
0.522
0.776
0.677 data in Table 1.5 on the percentages of people employed in .nine "
4 0.073 -0.055 0.563 0.543 "'.
,.' ; 11: : industry secto.~~ in Europe. The correlation matrix for the nine
, 5 0.023 -0.017 0.411 00408 ¡ .~ :: .;...~ -,
variables is shown in Table 5.5. Overall the values in this matrix are
:1 1¡" ~ ~?f: -..
'o I
;;Ir
;{!~("
; Il i: ~:F;'
I
. : :~. .
·,n ·
li ; I_';¿
P['uc edure fuI' O prj¡l(;ipo/ compoIH:nl UI1u/y si s "O
u.,
1...
l' . C'" ~
nol particularl y hig:h, which indica tes lha! several principal compo-
'1; Ú
eIJ :- nents will be requ ired to account ror the variation.
.r= The cigenvalues of the correlation matrix, with percenlages of the
rl'
,
"O
....
total of 9.000 in parantheses, are 3.487(38.7%), 2.130(23.6%),
'¡ rili '"o~ v:: ;;2:00:.
":,-.c
1::.. CC> oro 1.099 (12.2%), 0.995 (11.1 %), 0.543 (6.0%), 0.383 (4.2%), 0.226 (2.5%),
'C v:: ....:¿
nli, , ó
0.137(1.5%), and 0.000(0%). The last eigenvalue is exactly zero
beeause the sum of the nine variables being analysed is 100% before
c..
§ X-.c
o....
;:l
(.:.J
C
-
<:
~
O"<T
-
"":ci¿
C'I
I
standardization. The eigenvector corresponding to this eigenvalue is
precisely this sum which, of course, has a zero variance. If any linem
combination of the original variables in a principal component
'"
o
'C
r:
e:::
I.:..l § \O r- occe
\o · N
MV) -
analysis is constant then this must of necessity result in a zero
;:l v:: ~óó¿
o
(,)
eigenvalue.
0..0
\O\OOCX
This example is not as straightforward as the previous one. The first
N
r: <:
o §"":cicici¿
V)-V)X ' ..{
principal component only accounts for about 40~~ of the variation in
t~~~
MO-r"""t
'"O- u the data. Four components are needed to account for 86~~ of the
;:l
....Oel) '"":.1,.....
. variation. It is a matter ófjudgement as to how many cornponents are
>.
!::
v::
1::..
g ONON'"
-aO-Mr-
ON--M
... -. ~:
~ .J:~~ : .
important. It can be argued that only the first two should be
"":cicicici¿ :?:~ .
'"
;:l
considered since these are the onlyones with eigenvalues much more
"O
:: ;,~%~: than 1.000. On the other· hand, the first four components all have
<: § co o~ . '" ~ -
V) V)
0\
:~~;¡¡f- eigenvalues substantially larger than the last five components so
V ) V) U")
U ~-.:tN--M
r:
'E;
.::
""~ "":óciócició
I :;¡'ft- · perhaps the first four should aIl be considered. To sorne extent the
"O
",\C\Or-,.,... -r-
choice of the number of components that are important will depend
U
§
;,I ,1;'1
!t
,
1
>.
O
c..
E
o
-
<:
~
~ON"'~OC'"
~~OM~N-
"":óócicicici¿
I I I I
on the use that is going to be made ofthem. For the present example it
wiIl be assumed that a smaIl number ofindices are required in order to
), -\
present the main aspects of differences between the countries. F or
¡: ¡ '"oeJJ O-a-ooor-Or-",
c:: e::: OMr-O<'".MN~\O simplicity only the first two components wiIl be examined furth cr.
e :;:) \.O """ V') r- N r- V",
;,":. i ¡::
o
....o
(,) """ "":cicicicicicici¿
I I I I I I I
Between them the)' account for about 62% of the variation.
The first component is
c..
....
~
>< • Z 1 = 0.52(AGR) + O.OO(MIN) - 0.35(MAN) - O.26(PS)
.... .
' - V)
-c;~ '"
r: - 0.33(CON) - 0.38(SER) - 0.07(FIN) - 0.39(SPS)
E-
.c
CfJ.2
0-
(,) ::
r: ¡...
.~
'" .-
;>.-u - 0.37(TC),
... r:
O=E U
"0
;:l
... enE
....... ....
~ ~ E where the abbreviations for variables are stated in Table 1.5. Since the
8]
u_ e.o ~ "5 ~ 8 analysis has been done on the correlation rnatrix, the variables in this
:: . .•
oC'"
¡...:; ~
'E.:: :a. g~
0-";:: -g
~ ot!
~ 't: equation are the original percentages after they ha ve each been
(,) a g .-
(,)
I{)-
I{)(,). '" 2
- Cl)~ .... u
U
u
Q;d o
c.. standardized to ha ve a mean ofzero and a standard deviation of one.
.!!é" .~.-
::Sc::s'--Uc:-cn
.... c: e~ ~
~ c:
"">~ "'.~
e u r:
~
,~, . Fnrm the coefficients of Z 1 it can be se en that it is prirnarily a contrast
~ ...
¡...~ <~.¿ ~ 8 ~ ¡¡: a ¡: between numbers engaged in agriculture (AGR) and nurnbers
".j ..
" - ):'
~~f
'-i-:'
':"1
~.
I
\ "..
70 Principn¡ component uncl'lSiS
) ~~\~l~nt~~ 7 1
engaged in manufacturing (MAN), power supplies (PS), construction 5.3 Computational methods
I
(CON), service industries (SER), social and personal ser vices (SPS)
¡,¡ti '; and transport and cornmunications (TC). In making this interpret- Princip<.d component analysis is ~"né of (he multivariate tcchniques
:'/'II.i;.: ation the variables with coefficients c10se to zero are ignored since that can be programmed reasonably easily on a microcomputer. The
' 1:).;l. ¡r;! they will not affect the value of Z 1 greatly. eigenvalues and vectors ofthe covariance or correlation matrix can be
!/
l':"lj'1 p::
1I ,.1 ;, i
':'¡ ", The second component is determined readily using a standard algorithm such as Algorithm 14
¡ili 'I' }¡: Z; = 0.05(AGR) + 0.62(MIN) + 0.36(MAN) + 0.26(PS)
of Nash (1979).
Alternatively, many standard statistical packages will carry out a
,:(1 ;!J',j principal componen t analysis as one of the multivariate options. In
+ 0.05 (CON) - 0.35(SER) - 0.45 (FIN) - 0.22(SPS)
Ii" ¡.-I" .
' ,.iji¡; cases where principal component analysis is not specialIy men-
+ 0.20 (TC),
~
1:;' :1 :
l ' yj tioned it may stilI be pos'sible to do the cak ulations using the.factor
, ; '::¡:I: analysis option. ~lore wilI be said about this in the chapter that
which primarily contrasts..numbers in rnining (MIN) and manufactur- follows. Brief1y, principal component analysis amounts to a certain
~! ;'11 ing (MAN) with numbers in service industries (SER) and finance type of factor analysis without any rotation of factors.
;:jH.¡I.
',11 01
(FIN).
' I¡I)!! Figur"! 5.2 shows a plot of the 26 courúriesagainst their values for "
Z 1 and Z2' The picture is certainly rather meaningful in terms ofwhat 5.4 Further reading
":l:i¡Jlj
'},'ji, is known about the countries. Most of the Western dernocracies are
groúpedwith low values of ZI and Z2' IreIand, Portugal, Spain and
A short monograph by Daultrey (1976) is recommended for a furt her
" ~ ,. :' discussion of principal component analysis. This is suitable for the
! 1 1; <:I~i:
Greece·have higher values of Z l ' Turkey and Yugoslavia stand out as general reader although it is aimed particularIy at geographers. "'
., . t
!
1
. being very high on Z l ' The communist countries other than
" !, ' ........
: I'¡ .:::: Yugoslavia are grouped togelher with high values for Z2'
References
; '11 ; l . "'
:" :: I¡1I ,. ,•
i 11; Daultrey, S. (1976) Principal Compollent Allalysis. Concepts and Techniqucs
. ' ;1 of Modern Geography, 8, Geo. Abst racts, University of East Anglia,
:- ¡! ! ~ 12 ,
UK.
HungOr Y
3r Hotelling, H. (1933) Analysis of a complex of statistical variables into
" ji ¡, /~ C1eChOSlo~Yakia
principal components. Journal of Educationa/ Psychology 24, 417-41 ,
."':ll'
: ,."
'o" 'j
'
i .
E Germony 2 Polond
8u190"0 Romomo
498-520.
Nash, l.e. (1979) Compacl Numerical '\-lethodsfor Computers. Adam Hilger,
::¡:: .. :
:' ,
USSR
LuxemburQ Ir Bristol. -----
¡ I"
AustrlO
Pearson, K. (1901) On lines and planes of closest tit to a system of points
l ___ !
J,!l:: .:1 -6 -5 -4 -3 -2
.
-lW.Germany 1 2 3 4 5 6 in space. Phi/osophical Magazine 2, 557-72.
~Irelond
1:'h:.' SwifZerl~tol
UK Greece
, Portugal Yugoslavia
z, -,
Ili1 :,k: Finland Nor.waYF ran~~
8elg,um
\.
SpOIn
Turkey
: : ¡'::, Sweden
Netherlands -
"
·· t ·: Oenmark
' 1: 1,',
':::'
-3
•• 1, , ' ":1,::::
"
Figure 5.2 European countries plotted against the first two principal compo- .. ,
nents, Z 1 and Z 1 for employment variables. '~--:"
_
.. ~,·c
.. \
.·~r·· ·
.' ,.
:i'
~;;- .
T11C fu c l or onojys is modd 73
~ : - C¡-!J\PT E R SI X Spearman proposed the idea tha! the six test seo res are alJ of ¡he fo rm
j¡
Factor analysis X¡=aiF+ e¡,
l· j J,:
I
. where Xi is the ¡th standardized score with a mean of zero and a
standard deviation of one, a i is a constant, F is a 'factor' value, which
JI
has mean of zero and standard deviation of one for individuals as a
fT! whole; and e¡ is the part of Xi that is specific to the ¡th test only. He
jTí ' : 6.1 The factor anaJysis modeJ showed tha( a constant ratio between rows of a correlation matrix
follows as a consequence of these assumptions and that therefore this
¡rr ¡ . Factor analysis has similar aims to principal component analysis. The
is a plausible model for the data.
basic idea is still that it may be possible to describe a set of p variables
1m : • X ¡, X 2' _ . ., X p in tenns of a smaller number of indices or factors, and
Apart from the constant correlation ratios it also follo ws that the
variance of X¡ is given by
1m: : hence e1ucidate the relationship between these variables. There ¡s,
mi l :: however, one important difTerence: principal component analysis is . t-.
var (X;) = var (a¡F + e¡)
not based on any particular statistical model, hut factor analysis is
m'(¡i
IJ lft ! t , l. based on a rather special mode!.
~::
= var(a¡F) + var(eJ
'>n' i ;.
!!1fl i , The early development of factor analysis was · due to Charles = afvar(F) + var(e¡)
Spearman. He studied the correlations between test scores ofvarious I af + var(e¡),
.IfJt;:.. i
; ,
n: ·· for by a simple model for the sco¡;,es (Spearman, 1904). For example, in since a¡ is a constant, F and e¡ are independent, and the variance or F is
! .¡- one case he obtained the following matrix of correlations for boys in a assumed to be unity. But var(X¡} is also unity, so that
l'
preparatory sch001 for their seo res on tests in Classics (C), French (F),
1.. ..
" -,
'
English (E), Mathematics (M), Discrimination of pitch (D), and ~1 usic 1 = af + var(eJ
I (Mu):-
Hence the constanta¡, which is called the/actor loadillg, is such that its
I
e F E M D Alu square is the proportion of the variance of X¡ that is accounted for by
l. ,
e 1.00 0.83 0.78 0.70 0.66 0.63 the factor.
i:,. .1. F 0.83 1.00 0.67 0.67 0.65 0.57 On the basis of his work Spearman formulated his two-factor
E 0.78 0.67 1.00 0.64 0.54 0.51 theory of mental tests: each test result is made up of two parts, one
,1 M 0.70 0.67 0.64 1.00 0.45 0.51 that is common to all tests ('general intelligence'), and another that is
:!1 D 0.66 0.65 0.54 0.45 1.00 0.40 specific to the test. Later this theory was modified to allow for each
Mu 0.63 0.57 0.51 0.51 0.40 1.00 test result to consist of a part due to severa1 common factors plus a
'1:1:
" ~o
, 1 .. 1
part specific to the test. This gives the general factor analysis model
He noted out tha! this matrix has the interesting property that any
ni;! two rows are almos! proportional if the diagonals are ignored. Thus
for rows e and E there are ratios:
X¡ = ailF¡ + a¡2F2 + ... + a¡mFm + e¡,
where X ¡ is the ¡th test seo re with mean zero and unit varíance;
0.83 ::::: 0.70 ::::: 'Ü.66 ::::: 0.63 ::::: 1.2. a¡¡,a¡l, ... ,a¡m are the/actor loadings for the ith test; F¡,F 2,Fm are m
0.67 0,64 0.54 0.51 . uncorrelated common/actors, each wíth mean zero and unít variance;
72
fa c tor analys is
71 ~toc~durp. tor Q tactor OI'lolys i s 75
and C¡ is a f~lctor spccific only to the ¡th test, which is uncorrebted with can he constructcd that are uncorrelated and 'expl a in' the d a ta just as
any of the common fact ors and has mean zero . I wel!. There are an infinite number of allernative solutions for the
I
With this model l factor analysis model, and this leads to the second stage in the
l'
var (X¡) = 1 = at¡ var (F ¡) + at2 var (F 2) + .. . + atm var(Fm) + var (e¡) ,
..
analysis, which is caHedJactor rotation. Thus the provisional factors
are transformed in order to find new factors that are easier to
= at¡ + atl ... + atm + var(e¡), interpret. To 'rota te' in this context means essentially to choose thé dij
values in the aboye equations.
where al¡ + all + ... + al", is called the communality of X¡ (the part of The last stage of an analysis involves caIculating the factor scores.
it:; variance that is related to the common factors) while var (e¡) is , These are the values of the factors F l' F2 • • •. ' Fin for each of the
called the specificity of Xi (the part of its variance that is unrelated to ' ,;' individuals.
the common factors). 1t can álso be established that the correlation :§? ~, GeneraIly the number of factors (m) is up t.o the factor analyst,
between X¡ and X j is ' - ~l~t, although it may sometimes be suggested by the nature of the data.
1:'1 rij = aiiajl + a¡laj2 + .. , + a¡majm' :ST' When a principal component analysis is used to find a provisional
Hence two test seores can onIy be hi-ghly correlated if they have high
;j¡~~: solution, a rough 'rule ofthumb' is to choose m equal to the number of
¡.: ~~t eigenvalues greater than unity for the correlarion matrix of the test -...
loadings on the same factors. Furtherrnore, - 1 ~ aij ~ + 1 since the
1-
,I ~i~~;!~ scores. The logic here is the same as was explained in the previous
communality cannot exceed one.
I ·~tJf r chapter: a factor assocÍated with an eigenvalue of less than unity
,i
, :f~', 'explains' less variation in the overalI data thao óne ofthe original test
:1 6.2 Procedure for a .factor analysis {' : scores. In general, increasing m will increase the communalities bf
')
.;~/' variables. However, communalities are not changed by factor
The data for a factor analysis have the same fonn as for a principal retation. -...
component analysis. That is, there are p variables with values for these Factor rotation can be orthogonal or oblique. With orthogonal
for n individuals, as shown in Table 5.2. rotation the new factors are uncorrelated. like the old factors, With
There are three stages to a factor analysis. To begin with, oblique rotation the new factors are correlated. \Vhichever type of
provisional factor loadings a¡j are determined. One way to do this is to rotation is uscd, it is desirable that the factor loadi ngs for the new'
do a principal component analysis and neg!ect all of the principal
factors should be either cIose to zero or ver)' different from zero. A
components after the fírst //l, \vhich are themselves taken to be the m
near zero aij means that X¡ is not strongly related to the factor Fj' A
factors. The factors found in this way are then unco rrelated with each large (positive or negative) value of aij means that X¡ is determined by
other and are also uncorrelated with the specific factors. However, the Fj to a large extent. If each test score is strongly related'to some
specifíc factors are not uncorrelated with each other, which means factors, but not at aH related to the others. then this makes the factors
that one of the assumptions of the factor analysis model does not easier to identif)' than would otherwise be the case.
hold. However, this wil! probably not matter much providing that the One method of orthogonal factor rotation that is often used is
communalities are high. ', called varimax rotation. This is based on the assumption that the
Whatever way the provisional factor loadings are detennined, it is 1:~' interpretability of factor j can be measured by the variance of the
, 1;'
,I
", possible to show they are not unique. If F l ' F 2" " ,Fm are the
provisional factors, then linear combinations of these of the form
aL, aL.... ,
, r~:'; square of its factor loadings. i.e., the variance of a;j' If this
!'
::,', variance is large then the af¡ values tend to be either close to zero or
F'I =d¡¡FI +d¡2Fz+ ... +d¡mFm -; :, cIose to unity. Varimax rotation therefore maximizes the sum ofthese
--..
1: ~:.' variances for al! the factors. H.F. Kaiser first suggested this approach.
F~ = d21F¡ + d 22 F2 + ... + d 2m Fm
'J
t!",
~;;~; Later he modified it slightly by norrnalizing the fac~or loadings befo re -...
11'!:
, :'~,:m. aximizing the variances of their squares. since this appears to give
r j
1::
F'".=dmIF¡ +dm2 F 1 + ... +dmmF",
"1"
-, :~ improved results (Kaiser. 1958). Varimax rotation can therefore be
"
.,
r( ~.".
rf/
~:.'
-,
'jil ~}
,lB ~~~:
11'
t.J 1:
1
'' O.", : I
!' 76 ·Faclor cnal ysis I
1•
jJriIlí:i¡J(¡} cnm ¡}(Jnc nl jW:/or anulys is 77
carried out with or without Kaiser normalizati on. Numcrou s other F or a fac tor analysis on!)' 111 of the princiYi*l componcnts are retain ed,
..
,.1
:·1; methods of orthogonal rotation ha ve becn proposed . H owever, so the las! equations becom c
lll'
It I!
varimax is recommended as the standard approach .
Sometimes factors analysts are prepared to give up the idea of the Xl =bllZ l + b1l Z 2 + '" + bmlZm+ el
/1; ; factors being uncorrelated in order that the factor loadings should be X 2 = b l2 Z l + b U Z 2 + '" + bm2 Z m+ e2
Ir. :. as simple as possible. An obligue rotation may then give a better
J
solution than an orthogonal one. Again, there are numerous methods
11r ' ' ¡¡ available to do the obligue rotation. X p = blpZ l + b2p Z 2 + ... + bmpZ m+ ep
Ir. ' I ~
, . :1. Various methods have also been suggested for calculating the
Irr ; :1
!; factor scores for individuals. A method for use with factor analysis AII Ihal needs lo be done now is to scale the principal components
based on principal components is described in the next section. Two Zl.ZZ··· ·.Zm to have unil variances and hence make them into
other more general methods are estimation by regression and proper faclors. To do this Zj musl be divided by its standard
Bartlett's method. See Harman (1976, Chapter 16) for more details. :.~ deviation, which is jT;, thesquare root of the corresponding
~:: eigenvalue in the correlation matrix. The equations then become
.:;{~. ,
6.3 Principal component factor analysis :~:
:: : Xl =.jilbllFl +fl.2 b21 F 2+ ···+fimbmlZm+el
It has been remarked aboye that one way to do a factor analysis is to ~~L
begin with a principal component analysis and use the first few
'(~ .
X 2 = .}J.lb12 F ¡ + fi2 b22 F2 + ,.. + fim bm2 Z m+ e2
principal components as, unrotated factors. This has the VÍrtue of ..
~"j-
simplicity although since the specific factors el.e1,···.e p are corre- ;i:"
lated the factor analysis model is not quite correct. Experienced factor
analysts often do a principal component factor analysis first and then
t X p = jJ.lblpF l + jJ.2 b 2p F 2 + ... + J'imbmpZm + el'
~I! I JI
, where Ft represents the new ¡th factor. The original factors F¡ could
~
:r:
be expressed exact!y as linear combinations of the X variables by
rrnl! X p = b1p Z 1 + b2p Z 2 + ... + bppZp. ~'.' scaling eguations (6.1). The rotated factors can also still be expressed
'.t:
!
tn¡1.I.
PI' 11 $"
~.
~\l
~
',,"
~~' ·.I
",,1:: ~
_ 1
*J,I ~ k
78 [<octar onuly 's
\
e}~,\ct\~ c'omb\natic)í\e:, el t\\c X ~at\a'\J\e'2>l \\w. i ,: \í.\.\}(:m~\\\~ ,,\
~ 1 t~~~~*'~~~
(\.s \\ne?,r
bcing givcn in malrix form as \ r-. (""".,j r--l rol V) __ o V) o
, ':::;;00000000
1 1 1
F* = (G'G)-l G'X (6.4) ": ., r- r-l N N o C'""'i - \O -
:t; r-l - _ N \Q 0\ o lr)
-
["'o-V)_V) OO-N 00 1""")
: I ;'\ ~~~~".:'-" <:;
~
'" ....,,.....-O"''''N,.....· N
' ' Since many computer programs for factor analysis allow the option of
E
>. ~
ooóoooóoo
; .-L: e .;:¡ I I I I I - I
,' !'I using principal components as initial factors, it is possible to use the c.. §
E ~
V'I-,.....ooNO-",M
, iI' 1 programs to do principal t;:omponent analysis. AH that has to be done u <:)
n ("""ViV)\Or---M-V)OO
l ' 1: .
t :: ¡ is to extract the same number of factors as variables and not do any
'"
.: ~ ' r:~-:~",:-:<'!r:~
000000000
ü . 0
c.. <:; I I I I I I
1 !1'1.1
,I11 rotation. The factor loadings will then be as given by equations (6.2) o.... ~
Ü
:..: t""'.M_MMN~_M
000000000
<J
> 1 I I I
In Example 5.2 a principal component analysis was carried out on the -..
data on th~ percentages ofpeople employed in nine industry groups in N
-X-V~-\OOOC\
O-O'-O'DONoo","
:r.
<J ~ O.c;NO--r--OO -,
26 countrics in Europe (Table 1.5). 1t is of sorne interest to continue 000000000
the exami,¡ation of these data using a factor analysis mode!. :; I I I
The cOI','dation matrix for the ninc percentage variables is given in <J
<.(¡ ":'"<ro-o.MM-oo'D
L:..: '" V'I "<r C"l - V'I '" o o
V'lOOON-OOoo
Table 5.5. The eigenvalues and eigenvectors are shown in Table 6.1. ~
000000000
There are three eigenvalues greater than unity so the 'rule of thumb' ..,¿ I I 1
suggests that three factors should be considered. However, the fourth ~
eigenvalue is almost equal to the third so that either two or four ¡:
~
factors can also reasonably b.: aIlowed. To begin with, the four-factor o=>
.i-~ r--O 0.""''''''''''''''''
:ClMO'C'\~CONM r--
"<r-O"' ..... ,....."'-
solution wiII be considered. ~
::1 ,
! The eigenvectors in Table 6.1 give lhe coefficients bij of the sel of 4;r., ~'"
r-i1"i"':000000
¡II'
o ,¡
;: 1:
•
equations (6.1). These are changed into factor loadings for four factors
as indicated in equations (6.2) to give the factor model:
·: ~tf;·
' . -:",
tf~' "
\, , 1 ~fi' .
0.98F 1 + 0.08F 2 - 0.05F J + 0.03F 4 (0.97)
" Xl =
~:;
X2 = O.OOF I + 0.90F 2 + 0.21F J + 0.06F 4 (0.86) -",,.....,,:,,,,,,,r--ooo.
":11:
: Ül
BO Fa r: l.o[ anal ysi.c; Usillg a fueto r olloly si.c; pfogrcJm 81
); J = -·0.65f ¡ + O.52f 2 + 0.16F 3 - O.3 5F 4 (0.83) X6 = -O.53F ¡ - 0.03F 2 + 0.62F 3 - O.33F 4
i 1.
:¡-.
l X.¡ = - 0.48F¡ + 0.38F 2 + 0.59F J + 0.39F4 (0.87) X7 =- 0.07F¡ + 0.03F 2 + 0:91F 3 + 0.05F 4
¡lb
~IIT - ~X 5 = - 0.61F ¡ + 0.08F 2 - O) 6F J - 0.67 F 4 (0.84) X8 = -O.9~;F¡-0.05F2+0.17F3-0.04F4
rr ; ,; X6 =- 0.71F¡ - 0.51F 2 + 0.12F 3 - 0.05F 4 (0.78) X 9 = -0.77F¡ +0.23F 2 -O.33F 3 -O.23F 4
1[; j-:
X7 =- 0.14F¡ - 0.66F 2 + 0.62F 3 - 0.05F4 (0.84)
The communalities are unchanged (apart from rounding errors) and
;)
1111 : :
the factors are still uncorrelated. However, this is a slightly better
I ,,
mi) Xg= -0.72F¡-0.32F2~0.33F3+0.41F4 (0.90)
soIution than the previous one since only XI' X 2 and X 6 are now
In~ ¡: X9 =- 0.69F¡ + 0.30F 2 - 0.39F 3 + 0.31F 4 (0.81) appreciably dependent on more than.x:>ne factor.
:~': At this stage it is usual to try to put labels oniactors. It is fair to say
1111 1 . t
~II"! );;1
that this often requires a degree ofinventiveness and imagination! In
"11
The values in parentheses are the communalities. For example, ttie .~ . the present case it is not too difficult.
~ HI.I .~. ~
communality for variable Ji 1 is 0.98 2 -1- 0.08 2 + ( - 0.05)2 + 0.03 2 = :;:'i Factor 1 has a high positive loading for XI (agriculture) and high
m;¡ ~I
0.97, apart from rounding errors. It can be seen that the cómmun- ~. negative loadings for X 6 (service industries), X 8 (sócial and personal
alities are fairly high. _That is to say, most of the variance for the ~ services) and X 9 (transport and communications). It therefore ·
variables Xl to X 9 is accounted for by the four common factors. ~:~ measures the extent to which people ·are employed in agriculture
Factor loadings that are greater than 0.50 (ignoring the sign) are
underlined in the aboye equations. These large and moderate
ti- ~ rather than services, and communications. It can be labelled 'em-
~. phasis on agriculture and a lack of service industries'.
loadings indicate how the variables are related to the factors. It can be ." Factor 2 has high positive loadings for X 2 (mining) and X 4 (power
seen that X ¡ is almost entirely accounted for by factor 1 alone, X 2 is supplies). This can be labelled 'emphasis on ' mining and power
I -
accounted for mainly by factor 2, X 3 is accounted for by factor I and supplies'.
:Pf: ¡ factor 2, etc. An undesirable property of this choice of factors is that
' 1 Factor 3 has high positive loadings on X 6 (service industries) and
¡~ T ¡ four of the nine X variables (X 3' X 5' X 6, X 7) are related strongly to X 7 (¡¡nance) and a high negative loading on X 2 (mining). This can be
- I
,
.
-.;
two of the factors . This suggests that a rotation may provide simpler
factors.
A varimax rotation with Kaiser normalization was carried out.
labelIed 'emphasis on financial and service industries rather than
mining' .
FinalIy, factor 4 has high negative loadings on X 3 (manufacturing)
This produced the following model: and X s (construction) and a high positive loading on X ¡ (agriculture).
'Lack of industrialization' seems to be a fair label in this case.
X ¡ = 0.68F 1 - 0.27F 2 - 0.31F 3 + 0.57F 4
The G matrix of equations (6.3) and (6.4) is given by the factor
X2 = 0.22F 1 + 0.70F 2 - 0.55F 3 - 0.13F4
loadings abo ve. F or example, g 1¡ = 0.68 and g 12 = - 0.27, to two
decimal places.
. Carrying ..,.out the matrix multiplications
-,
and inversion
X 3= 0.13F¡ +0.49F2 -0.12F 3 -O.75F4 ....
~
of equatlOn (6.4) produces the equations
Illtli1,1
[D~ ¡I,!
t
' In
Dp),ons . comput e r progro ms 83
¡:: and
,~
~~ . ' .~ "'
e ,~
"'t- ~-
~~~~~~6~~~~~~~~~~8~~~~~~~~
ooooo~~oooOOOO~O~MOOOOOOO~
* --
F4 O.175X 1 - O.031X 2 - 0.426X 3 -+ .. . + O.088X 9
u 'S!
I .:3t;:::: I I I I I I 1 I 1 I I I 1 I I .
for estimating factor scores from the data values (after the X variables
]
,I have been standardized to have zero mean and unit standard
I
1
deviations). The factor scores obtaine'd from these equations are given
.. [ I in Table 6.2 for the 26 European countries.
From studying the' factor scores it can be seen that Jactor 1
.~ ~
ti:i ! " .......
"" -
.., '-,;:
emphasizes the importance of agriculture rather than services and
communications in Yugoslavia, Spain, and Romanla:The values of
"I ~::SE
...., '¡;¡] '-& ~~~~~8~~~~~~~b~~~~~~~8~=~g
ooooooo~oooooooooO~~~~~~N~ , factor 2 indicate that countries like Hungary and East Germany have
"
I ~~.:.:
I
el>
<)
t: <; <;
:,. ' . I I I I . I I 1, I I I I I
:~ _, large numbers ofpeople employed in mining and power supplies, with """
'C
i: t;:; 'e oS .
::.':,. j
, ~
, ~
,
:::Í ;,::' the situation being reversed in countries like 'Turkey and Italy. Factor ""
;:)
~~
II <;~~;: 3, on the other hand, is mainly indicating the difference between the
"
O
(J ~
=
ce
O tfEt:\,communist bloc and the other countries in terms of the numbers '"
I ¡..,
¡-,, ' 8-
... ....
·0
(,J
~
:;:35\: employed in finance and service industries. Finally, the values for
:'::tt:< facto'r 4 mainly indicate how different Turkey is [rom the other
~~ """
;:)
Ul
\C>
N
O>.
' <::lO>. ~g~;~i~~~~~~~=~~~~~~g~~g~~
<t.f,:: countries because of its lack of manufacturing and construction -"
... <'< ,;:::!
o~ooo~oo~oO~O~OOO~O~~NOOO~ ;;¡: , workers and relatively large number of agricultural workers.
~ :5 ~ I I 1 I I I 1 1 I I I 1 I 1 1 --.,.
~ ~ Most factor analysts would probably continue their analysis of this
'"....
<)
ec:.. set of data by trying models with fewer factors and different methods --."
'O
U
'".... of factor extraction. However, sufficient has been said already to
O
Ü
indicate the general approach so the example wi11 be left at this point.
i ¡" 1; ~
I !' ¡,..
, N ....:j ~
¡~:. I
I
!
..
lO
:c
~
'- ..,
l... ::>
1... , '"
~ ~ !:
_
~g~3~~~g~~~~~~~~~~~~~~~~~~
6.5 Options in computer programs
;r
0:1 ~ o~oooooccooO~O~~C~OOOOO~ON
1: E-
-."",:::: L
Ü O'~ 1 I I 1 I I I I I 1 I I Computer programs for factor analysis often allow many different
.~~ .S options and this is likely to be rather confusing for the beginner. For
-<::
,. ~¡ , "- example, B~IDP4M, which is the factor analysis option in the - ,
1,\ Biomedical Computer Programs package (Dixon, 1983), allows four
p, j
1 1 11 .-..,
'I,!! , ~: : '
84 Fa c /or analysis Rcferr.n ¡;¡; :, 85
Th c program BMDP4M aJlows evell more optiolls whcn it comes 6.S Furthcr rcading
to factor rotation, including no rotation at al!. The standard option is
For those aboul to- embark on a facto r analysis using a compu ter
varimax rotation. If an obligue rotation is desired, allowing corre-
program, particularly the Biomedical program BMDP4M, the article
lated factors, then rotation for simple loading (DQUART in
·tIr:r,:. BMDP4M) is recommended.
by Frane and Hill (1976) should prove of value. The introductory
texts by Kim and Mueller (1 978a, b) will also be helpfu!. Those
Finally, there is the question of the number of factors . Most
interested in more details should consult one of the specialist texts
1-:: computer programs have an automatic option which can be changed
such as Harman (I976).
at the user's discretion.
References
6.6 The value of factoranalysis
Chatfield, C. and Collins, AJ. (1980) Introduction to Multivariate Analysis,
Factor analysis is something of an art. It is certainly not as objective
Chapman and Hall, London.
as most statistical methods. For this reason many statisticians are Dixon. W.J. (Ed.) (1983) BMDP statistical software University of California
rather sceptical about its value. For example, Chatfield and Collins . Press, Berkeley.
(1980, p. 89) list six problems with factor analysis and concJude that Fraenkel, E. (1984) Variants of the varimax rotation method. Biometrical
'factor analysis should not be used in most practical situations'. AIso, Journal7,741-8.
Frane, WJ. arid Hill, M. (1976) Factor analysis as a tool for data analysis.
Kendall (1975, p.59) sta tes that in his opinion 'factor seo res are Communications in Statistics'" Theory anJ Methods AS, 487-506.
theoretically unmeasurable'. Harman, H.H. (1976) Modern Factor Analysis. University ofChicago Press,
On the other hand, faclor analysis is widely used to analyse data Chicago.-
and., no doubt, will continue to be widely used in future. The reason Kaiser, H.F. (1958) The varimax criterio n for analytic rotation in factor
for this is that the technique does seem to be userul for gaining insight analysis. Psychology 23, 187-200.
Kendall, M.G. (1975) Multivariate Analysis. C4arles Griffin, London.
into the structure of multivariate data. If it is thought of as a purely Kim, 1. and Mueller, ew. (l978a) Introduction to Factor Analysis. Sage
descriptive tool then it must take its place as one of the important University Paper Series on Quantitative Applications in the Social
'f !¡ multivariate methods. Sciences, 07-013. Sage Publications, Beverly Hills.
"!
Kim, J. and Mueller, ew. (l978b) Factor Analysis. Sage University Paper
Series on Quantitative Application in the Social Sciences, 07-014. Sage
6~7 Computational methods Publications, Beverly Hills.
Spearman, e (1904) 'General intelligence', objectively determined and
1 L This chapter has stressed factor analysis based on carrying out a measured. American JournaI of Psychology 15, 201-93.
¡ . principal component analysis to find initial factors . Ifthis approach is
l,r.l'
: •• '!
adopted then the main part of the calculation is the finding of the
lí ll ,.; eigenvalues and eigenvectors of the correlation matrix. (See the
previous chapter for a suggestion of a suitable algorithm.) Other
methods of initial factor extraction are not so straightforv.:ard and are
probably best done using one of the standard statistical packages.
Varimax has much to recommend it as the standard method for : '
factor rotation. It is quite easy to program since it is done iteratively
taking two fac"tors at a time (Harman, 1976, p. 294). Note, however,
that Fraenkel (1984) has shown that this may not provide a unique
solution and has suggested sorne modifications to the usual
algorithm.
j,
J"'. ,
¡.i ;,
l' Discriminant functio n standardized to have zero means and unit variances prior~ the st art
of the analysis, as is usual with principal component and factor
analysÍs analysis. This is beca use the ou ti6me~ of a discriminant ff1nction
I¡ , o' analysis is not affected in any important way by the scaling of
individual variables.
t
l. - ,
L' ·· . . . . . . .
.. . : ". ~ ~
k.: :: z:~ Discrim,inati?:n :;~~i~g~a.~alan,obis distances . .
f' ::'.One approach to dlscnmmahon IS based on Mahalanobls dlstances,
7.l ·'fhe problem of ~~pa~ating groups .V'~~· as defined in Section 4.3. The mean vectors for the m samples can be
¡t:~l~: ,regarded as estimates of the true ~ean vectors for th,e·groups. Tbe ""
' c ,'.
., " ' ¡"";""'-1~' " . . o' · ·· ..· ·: " . . , ' .
The problem that is addressed w ith discriminant fu~ction analysis is ~~~~~" Mahalano~is distances of individuals to .group centres can then be ----
how well it is possible to separate two or more groups of individuáIs, ~~,?,;caIculated' and each individual can be allocated to the group that it is
, given measurements for these inQlviduals on several variables. For. :ff.~t'~losest too This may or may not be the group that the individual
example, with the data in Table l.i on'five body measurements of 21 ~f¿:::ractually carne from. The percentage of correct allocations is cIearly an
.
f}?~~jndication Of how well groups can be separated using the available
~
. surviving and 28 non-surviving sparrows it is interesting to consider """
... .. .... -
r.:::~ ..
whether it is possible to use the body measurements to separate ~t:~·~":, variables. ' .
survivors and non-survivors. AIso, for the data shown in Table 1.20n r: :~::'." This procedure is more precisely defined as follows. Let i; =
four dimensions of Egyptian skuIIs for samples from five time periods " (.ilj, X2j, . .. ,xpJ denote the vector of mean values for the sample .-."
it is reasonable to consider whether the measurements can be used to from the ¡th group, calculated using equations (2.1) and (2.5), and let
'a~~' the skuJIs. C j denote the covariance matrix for the same sample calculated using
In the general case there will be ni random sampJes from different equations (2.2), (2.3) and (2.7). Also, let C denote the pooled sample
groups, of sizes nI ' /1 2 "'" " m , and values wiII b<'! avaiiable for p co variance matrix determined using equ at ion (.1,6). Then the
variables XI ' X 2"'" X p for each sample member. Thus the data for a Mah alanobis distance from an observation x' = (x l ' X 2 , • • • , xp)' to the
centre of group i is estimated as
Table 7.1 The form of data for a discriminant function analysis.
X2 Xp D¡ = (x - xJC- I (x - x¡)
Individual XI •
p p "
1 XIII X I 12
x,,, } =I I (x r - Xrj) e" (x. - x.;). (7.1)
X~lp
r=I.=1 ..
2 ,X211 X 21 2
Group 1
ni XJl1l p
L. where é' is the element in the rth row and sth column of C-I. The
;" ~ ~bservation x is allocated to the group for which D¡ has the smallest
X. , I I X.,12
1 XI2~
2
XI2I
X2:!1 x 22 ! X"'}
X ~ lp
Group 2 f ~.
. value.
nz X. , 21 X.~2P
X,.,} !,[;riF.,:' 7.3'. Canomeal discriminant functions
.
X"122
1 X I .. I X I .. 2
2 Xl .. 1 x2.. Z Xl!"p
It is sometimes useful to be able to determine functions of the
Group m """
x._, li,4V~variables X l ' X 2" ' " X p that in sorne sense separa te the m groups as
".. X ..... I X._Z
¡f'&1; well as is possible. The simplest approach involves taking a linear """
86 &"?J;'
J'"
~!
l
.,
BU Dis cri¡ lI inUIlI funetion oIluj ~' si s : Tests af signifú;(1IlCe Ha
combination of tnc X variables Thus the ¡th canonical piscriminant function
L.
"
~
'.
;:: ' 1 i 11
h~~\Jml)\lDM
~ 1d
• I !! 1
,
90 Discrim[nanl funClio n una!ysls ,~
\\', \ \ 91
'~ .
: '::\;\\
can be u~cd to test for overall diffcr:cnccs between the means of the 111 Ex ample 7.1 The stCmn survival oJJem ule s.'Jarro\Vs
! ' :,
:.. ; ,"
",I'i;1
~ ; '!:: ~ groups. .
As a first example, a case will be considered where a discriminant
L", ;-' :1':1' In uddition, a 'test is available to see whether the canonical
discrim jl1ant function Zj varíes significantly [rom grou p to group. function analysís does not produce useful results. This concerns
:;1:: 1
·1';;
'PI:'
1 ' :.': , ' 1
. :
t ¡J'
! '1 I ation·is ver:ysigníficantly far from the centre of its group on the basis ,:~:calculating' c/>i from equation (7.2) as
j I
~ ;1- I
-'--
r:1
of the.chi-square distribution the~ trus brings ioto questio.n whether
{he observation 'really comes from this group (see p. 48). ,t~tc.:, : "·
f;:W!~~ f '.' '
. '
c/>i = {21 + '28 -
. ""
" 1 1. e . .1 -1{5 + 2)} log,. (1 + 0.033)
3061.67 5.33 11.47 291.30] Table 7.2 Means and sta ndard deviations for thc canon ·
ical discriminant function Z 1 for fi ve samples of Egyptian
5.33 3405.27 '754.00 412.53
w== 11.47 754.00 3505.97 164.33
skulls.
[
291.30 412.53 164.33 1472.13 Sample Mean Sld. del: ,
""
11
'
Z3= -0.0068X 1 +0.0010X2+0.0000X3+0. 0247X4~ (7.3) Tab1e 7.3 Results obtained when 150 Egyptían skulls are al·
Wl located to the groups for whích they have the mínimum
l'
l'
I, "
and Mahalanobis dístance.
howcver, vr; ~y much an average change. Jf the 1SO skulls are allocated
i::
I'! : " '1 1:
ilj'I
I :.; :; : 1¡ ! ~ ¡
to the sarr:pJcs to which t:1ey are closest accordi¡;g to the Mahalanobis
distanci:: flil1ction of equation (7. J) then only a fairly sma]] proportion
are alloc;ted to the samples that they really belong to (Table 7.3).
and
'1"
'1 , . :"
"l' . Thus although this discriminant function analysis has been successful . + 1.17CON + 0.83SER + O.84FIN + l.05SPS,
~: i ;:.
i ! l ' .
i' ¡;~ : 1 ': ,~ , 1
in pinpointing the changes in skull dimensions over time, it has not
¡i,¡:,)¡ ¡j ,:
the corresponding eigenvalues of W- I B being )'1 = 7.531 and
¡"~el'lil!!
¡j;! f·!: ~
:ij ¡ produced a satisfactory method fO,r 'ageing' skulls.
)'2 = 1.046. The corresponding chi-s·quare values from equation (7.2)
"!':' ! ¡ ¡. ¡
~I1,,:1
1
i '
.. : !.¡
I !!'
,:,, i:I
;¡:
Example 7.3 Discriminating between groups of European coulltries
. are cfJi = 41.80, with 9 degrees of freedom, and cfJ~ = 13.96, with 7
.' degrees of freedom. The chi-square value for Z 1 is significantly large
~"1I "ji! 1I. ·':L:
1.· i'
The data shown in Table 1.5 on employment percentages in nine ···· .at the O. 1/~ IeveL. The chi~square value for Z 2 is not quite significantIy
i ~¡ 1 d I ;' .' large-at the 5% .leve!. .
'lí' . 1 ¡ groups in 26 European countries have already been examined by
1I,.mIl
.. ni 1;;1 1 1'.,
,Ir!, principal component análysis and by ' factor analysis (Examples 5:2
and 0.1). Here they wiIl be considered trom the.point of view of the
, ;:-,::: From the coefficients in the'equation for Z1 it can be'seen that this
':. ~-: . variable will tendto be large when there are high percentages "'
11"
·1
1.1:'\
.
'1"
1,"1 >'
extent to which it · is possible to discriminate between grotips oC- .,:.:;¡¿>, empk>yed in everything except PS (power supplies). There is a
1, ., ¡!jI
¡¡f,· .'- ~, . ' , '
.l.
countries on the basis ofemploymentpatterns. In particular, three ":~~':~i.:partiCularly high coefficient for SER (service industries). For Z2' on
I nI~I;¡I. ! ' ;" ' 1
natural groups ex.isted iIi 1979 when the data were collected. Ihese - .~.r.: the other hand, a11 the coefficients are positive, with that for MIN "
;¡i ' . )~t;Jrl1'¡ningJ being particularIy high. . . .
'.11,1
: ~.1i'I1 I i
were: (1) the Europe3:nEconomicCommunity (EEC) countries. at the
: :[r~: ' Á pIot of the countrÍes against their values for Z 1 and Z 2 is shown
"'
~ ''f ;: ¡ time of Belgiun'l; Denmark, France, West Germany, IreJand, Italy,
~I'" , I!",' o ,' .
fl!:
·11 ;i ; p .':1: Luxembu~g, the Netherlands and the United-Kingdom; (2) the other ., - - i~ Fig. 7~L The eastern European comf!1unist countries appear on
. the left-hand side, the non-EEC western European countries in the
.,"'"
western European countries of Austria, Finland, Greece, Norway,
¡r,l!'1
, l'
Portugal, Spain, Sweden, Switzerland and Turkey; and (3) the eastern
European commun~st countries of Bulgaria, Czechoslovakia, East
centre, and the EEC countries on the right-hand side of the figure. It
can be cJearly seen how most separation occurs with the horizontal .......,
, Germany, Hungary. Poland. Romania. USSR and Yugoslavia.
,i These three groups C ~iD be used as a basis fo r a discriminant funclion I r--~
,i analysis. 3-
, /
ILu lI,e mbl,lrQ'
\ EEC
Zz / \
The percentages in the nine industry groups add to 100% for each of / \
the 26 countries. This means that any one of the nine percentage 2fL EAST:'~':~R~~E':~ / UK \
/HI..I": oC¡zeCt'lOSlovOkiO\ / \
variables can be expressed as 100 minus the rcmaining variables. It is I ~ y Polond \ / \
, I Romonio \ / - - - \ Dtnmork
therefore necessary to omit one of the variables from the analysis in I E..~rmony \ I L. . . -- S_eden)[Irflond
F,onceNethtt'londs
order to calculate Mahalanobis distances and canonical discriminant o.: Bulgario 1 - - - I{W Germony 1
......-
Belqiu m Italy
I
/
YUq"'IaY;~1 Fi~:~~"~ÓrlUQal
and communications, was omitted for the analysis that wiIJ now be
1 , !Spain ......
-1
described. '....USSR \ Turk.y \ NON-EEC WESTERN
.... - - - - - - - - \ Nor....,y \ EUROPEA N
The number of canonical variables is two in this example, this being -2
\ SWltZtt'IO~l
-
the minimurii of the number of variables (p = 8) and the number of \
\ G... c• ...! ___ -
groups minus one (m - 1 = 2). These canonical variables are -3 "-- I
-5 -4 -3 -2 o 2 3 4 Z, 5
ZI =0.73AGR+O.62MIN +O.63MAN -O.16PS
J:',~: . Figure 7.1 Plot of 26 European countries against their values for two
+ O.SOCON + l.24SER + O.72FIN + O.S2SPS, :'P:;'Z:'" '. canonical discriminant functions. _"
o, o o, o ~ ),
. .f¡:
. .,:. ~
~~~ -"
,g'
.~.
·~;~o
. ~ - Un;r:rijj¡jllulll !1I1l r: tirJII onuiysi.'i fls;;igning n[ ungrouped inc1iviuuo!s lo groups 97
,r':
values of Z J' As far as valucs of Z2 are conccrned, it appears tha! thc stepwisc analysis is carrieo out then it is advisable to check its validity
n'n-EEC western European countrics tend to have lower valucs [han by-rerunning it several times with a random allocation bf individuals
Ti thc other two groups. Overall, the degree of scparation of the three to groups to see how significant are the results obtained. F or example,
li1' groups is good. The only 'odd' cases are West Germany, which with the Egyptian skull data the 150 skulls could be allocated
~I j l ¡ . .
appears to be more like a non-EEC western European country than completely at random to five groups of 30, the allocation being made
¡-.,1,'
1:
an EEC country, and Sweden, which appears to be more like an EEC a number of times, and a discriminant function analysis run on each
1111141: , ¡:. • country than a non-EEC western European country. random set of data. Sorne idea could then be gained ofthe probability
II j "
The discriminant function analysis has been rather successful in of getting significant results through chance alone.
'1 11r 1:
l'lli, ¡ ,
'1
, lo
j'
this example. It is possible to separa te the three groups of countries on
the basis of their employment patterns. Furthermore, the separation
It should be stressed that this type of randomization to verify a
discriminant function analysis is unnecessary for a standard non-
11'(; using the two canonical discriminant functions is much clearer than stepwise analysis providing there is no reason to suspect the
-1 11 : :~ ¡. the separation shown in Fig. 5.2 (p. 70) fbr the first two principal assumptions behind the analysis. It could, however, be informative in
lO tÍ I : components. cases where the data are clearly not normally distributed within
11 Ij¡I ¡:, 1 . 1-
j ~ J groups or where tbe within-group covariance matrix is not the same
JI l' ,1
. 1ji;
!: : 7.6 Allowing ror prior probabilities of group membership
for each group.
11, ,j ll! ° Computer programs allow many options for varying a discriminant
". 7.8 Jackknife classiñcatioD of individuals
11t~, '.1\:IL
I J11 .. functioo analysis. One situation is that the probability ofmembership
A moment's reflection will suggest Ihat an aIlocation matrix such as
·
.: '1\\: '
,t
is inherently different for differeot groups. For example, if there are
, H 1:: 1' two groups it mighi be that it is known that most individuals fall into that shownin Table 7.3 must tend to have a Oías in favour of
. ' nI:
l'" : . ' .0
groupi while very few fall into group 2. In that case if an individual is aIlocating individuals too the group that they really come from. After
all, the group means are determined from the observations in that
11\• .¡ ::
' '1
0 , to be allocated to a group it makes sense to bias the allocation
, I
procedure in favour of group 1. Thus the process of allocating an group. It is oot surprising to find that an observation is closest to the
AH i: . l'
l' l individual to the group to which it has the smallest Mahalanobis centre of the group where the observation helped to determine that
n;r)
- 1-
'
distance should be modified. To aIlow for this sorne computer centre.
Ir: '. programs permit prior probabilities of group membership to be taken To overcome this bias, sorne computer programs carry out what is
· l' iDto account in the analysis. called a )ackknife classification' of observations. This involves
Ir:: :'l"' allocating each individual to its closest group without using that
!r.' 1 1, individual to help determine a group centre. In this way any bias in
II"I! ' 7.7 Stepwise discriminant fun'.:tion analysis the aIlocation is avoided. In practice there is often not a great deal of
1m 1; Another possible modification ofthe basic analysis involves carrying difference between the straightforward classification and the jack-
Dlb ! ; : it out in a stepwise manner. In íhis case variables are added to the knife classification. The jackknife classification usually gives a slightly
lI!T¡i ll, ' : discriminant functions one by one until it is found that adding extr..a smaIler number of correct allocations.
variables does not-give significantly better discriminatiC?n. There are
,..1;, :~ .' .!
~ ,'
many different criteria that can be .used for decidin'g 00 which
: ::.: :\ . variables to include in the analysis and which to miss out.
7.9 Assigning of ungrouped individuals to groups
. .lr: ,: 1
. A problem with stepwise discri,!TIinant function analysis is the bias Some computer programs allow the input of data values for a number
¡
a.r,! : '\
that the procedure introduces into significance tests. Given enough
"": ; 1· of individuals for which the true group is not known. It is then
1111":\ ':l I variables it is almost certain that sorne combination of them will possible to assign these individuals to the group that they are closest
;¡i r:; I '.,; ~t
produce (significant) discriminant functions by chance alone. If a to, in the Mahalanobis distan ce sense, on the assumption that they
.!~l ¡¡d!
.1 ' :¡: Ii
. .. ~ l '
'-: ."
;r.
¡, :g" I
.'.
I '~ :"" :
:, :1 , ?~. "~,:'
1 ¡j i problem , ,',;' '.' ,
., .;c
.; ., ,' . ."..
'1"''",\1:
~
'-
, ~ ¡ t~!; ro I I
. . (W-l B -JJ)a.= O
." . ' - -
-;
" '-
fU:"
~ PI~.¡. , ;
~ Ii'¡I
¡, " ,I
" :'':11 (BW-l B - ),B)a = O.
';L~. ':
;: :¡I" '1
The matrix inverse C- 1 of equation (7.1) ean be found using
A)gori~m 5 or AIgor:thm 9 from the sa me book,
"', .
..
7.11-Furthcr rea(E¡~
be required. Lachenbruch and Goldstein (1979) and Fatti el al. (1982) f~·
provide references to the various procedures that have been
suggested. ,', ._',:' - . ' :,' ::,' ;,,' '":,.: : :: ' , ,':: .
H¡(~ r(J rc hi c: lll ct h ods 101
- C!I/\PTER EIGHT - di vision. With ag 61omcrati on alI objccts starl by bcin g al one in
groups of one. Close groups are then gradually m.::rged until fin all y all
índividuals are in a single group. With division all objects start in a
Cluster analysis single group. This is then split into two groups, the two groups are
then split, and so on until all objects are in groups of their own.
The seeond approach to cluster analysis involves partitioning, with
objects being allowed to move In and out of groups at different stages
of the analysis. To begin with, sorne more or 1ess arbitrary group
centres are chosen and individuals are allocated to tRe nearest one.
New centres are then calculated where these are at the centres of the
8.1 Uses of cluster analysis
individuals in groups. An individual is then moved to a new group ifit
The problem that cluster analysis is designed to sol ve is the following is closer to that group's centre than it is to the centre of its present
one: given a sample of n objects, each of which has a score on p group. Groups 'close' together are merged; spread out groups are
variables, devise a scheme for grouping the objects into classes so that split, etc. The process continues iteratively until stability is achieve.d
'similar' ones are in the same class. The method must be completely with a predetermined number of groups. Usually a range of values is
numerical and the number of classes is not known. This problem is tried for the final number of groups. The results of a partitioning
,- .
c1early m~re difficult than the problem for a discriminant function cluster analysis are considered in Example 8. 1.
analysis since with discriminant function analysis the groups are
known to begin with.
8.3 Hierarc)Jic methods
There are many reasons why cluster analysis may be worth while.
Firstly, it might be a question of finding the 'true' groups. For As mentioned aboye, agglomerative hierarchic methods start with a
example, in psychiatry there has been a great deal of disagreement matrix of'distances' between individuals. All indi\'iduals begin alone
over the classification of depressed patients, and cluster analysis has in groups of one and groups that are 'close' together are merged.
been used to define 'objective' groups. Secondly, cluster analysis may (Measures of'distance' will be discussed later.) There are various ways
be useful for data reduction.For example, a large number of cities can to define 'close'. The simplest is in terms of nearest neighbollrs. F or
~- j
potentialIy be used as test markets for a new product but it is only example, suppose there is the following distance matrix for fi ve
1 ¡ íeasible to use a few. Ir eities can be groupd into a small number of objeets:
r· ¡,
• j'
groups of similar cities then one mt.:mber from ea eh group could be
:,
_· ,I , used for the test market. On the other hand, if cluster analysis 2 3 4 5
1:'; 1 generates unexpected groupings then this might in itself suggest
relationships to be investigated.
2 2
3 6 5
8.2 Types of cluster analysis
4 10 9 4
Many algorithms have been proposed for cluster analysis. Here 5 9 8 5 3 - ~ . ",.., .
attention wilI be restricted to those following two particular approa-
ches. FirstIy, there are hierarchic techniques which produCe a The calculations are then as shown in the following table. Groups are
dendrogram sueh as the ones shown in Fig. 8.1. These methods start merged at a given level of distance if one of the individuals in one
with the calculation of the distances of each individual to all other group is that distance or c10ser to at least one individual in the second
individuals. Groups are then forroed by a process of agglomeration or :~
group.
100 :;,; .
• 0--o'j. •
. >.;:
.::f~
':;', . •
~.
.jt
.l U~ LJu~ler (: !l(dysjs IIi erorchic melho ds 10:i
\ \1 \ \\
¡i
---------------------
Dislance Grollp s
(o)
6L
l· .. · ". ~ 5L Neores!
I'!: § 4
!I
...... .~ O 1,2,3,4,5 u; 3 neighbour
:1; Ci 2 linkage
2 (1,2),3,4,5,
'"l.: 1í'~· 1
o' , ,
3 (1,2), 3, (4,5) ........ I
At a distance ofO all five objects are on their own. The distance matrix
shows that the smallest clistance between two objects is 2, between the '"
(J
¡:: Furthest
·0 neighbour
first and second óbjects. Hence at a distance level of 2 there.are four u; linkage
groups (1,2), (3), (4) and (5). The next smaIlest distance between o
objects is 3, between objects 4 and 5. Hence at a distance of 3 there are .' ~J. '':_, '.
, . . -:;-;'~ '.'
-:-_.;.
three groups (1,2), {3) and (4,5). The next smaIlest distance is 4,
between objects 3 and 4. Hence at tbis level of distance there are two ._.'~.
groups (1,2) and '(3,4,5). FinaIly, the next smallest distance is 5, :~f~ ;: ..
between objects 2 and 3 and between objects 3'and 5. At tbis level the ': ~'i;; . ::
"':~
::~: : , .
(c)
two groups merge into the single group (1, 2, 3,4,5) and the analysis is
complete. The dendrogram shown in Fig. ·8.l(a) ilIustrates how (J
Graup
e average
agglomeration takes place. ~ 4 linkage
.. '! ;' Withfimhest neighbour linkage two groups merge onl)" if the most .~
O
.L
distant members of the two groups are c10se enough together. With
the example data this work as follo\\'s:
,¡ Figure 8.1 hamples of dcndrograms from cluster analyses of ¡¡ve objects.
i'
'. Distallce Groups
"~ ~ . distance between them is small enough. With the example data this
1, .
O 1, 2, 3, 4, 5 gives the following result:
¡[ 2 (1,2),3,4,5 -.
""
3 (1,2),3, (4,5)
5 (1. 2), (3,4,5) Distance Groups
10 (1,2,3,4,5)
O 1,2,3,4, 5
2 (1,2),3, 4, 5
Object 3 does notjoin with objects 4 and 5 until distance level5 since 3 (1,2),3, (4,5)
this is the distance to object 3 from the furthest away of objects 4 and ..~- 4.5 (1,2), (3,4, 5)
5. The furthest-neighbour dendrogram is shown in Fig. 8.l(b).
With group average linkage two groups merge ir the average. ;!k"
.fg.:.
7.8 (1,2", 3, 4, 5)
~ .;;'
IJ..
J[t "
" . ~
'I"·:-r.~;..
.~
~ . ..
',\
~
104 CJuslcr ollo/ysi s !vieosurcs o[ di·;t UIl CC 105
l ' F or instancc, groups (1,2) and (3,4 , 5) merge al di slanc:c leveI 7.8 sincc
i~ , (o) (b) (e)
this is the average distance from objecls 1 and 2 lo objects 3, 4 and 5, .' .'
.' .
,'
1" the actual distances being:
;1 .
11 '.
.,
l .;' .
.....
'. '
....
. .e e.
I i, ' 1-3 6
¡1:IT 1-4 10
Itl
1
j:ql 1-5 9
'1 1 2-3 5 (d)
1,li 11 2-4 9 , "
. . . . . ..
':' ' ' . (e) " '
(f)
1'1
,¡I¡ ~¡¡ 2-5 8 "
.........-.
" '
, ,
'"
.- .....
.....
!11:1 1' 1·
1 ..
' .... ..
Mean = 7.8
ITlI :! .~ •
, 1 :
¡TI! :¡I: The dendrogram in this case is shown in Fig. 8.I(c). Figure 8.2 Sorne possible patterns of points with two c1usters.
;1:1 · Divisive hierarchic methods nave been used les s often than
r.11 ·1"':
...
'. 1:'1 , agglomerative ones. The objects are all put into one group initially,
T1 .'¡';
:: , !:
~ and then this is split into two groups by separating off the object that possible patterns of points are iIlustrated in Fig.8.2. Case (a) is likely
. , 1; ; 1, is furthest on average from the other objects. Individuals from the ': . to be found by any reasonable algorithm, as is case (b). In case (c)..sgme
1,1li ¡; !
algorithms might well fail to detect two clusters ~ecause of the
I ~ ,1; i
:¡ f I ~ > : ¡ .; :
main group are then moved to the new group if they are closer to it
than they are to the main group. Further subdivisions occur as the intermediate points. Most aJgorithms would have trouble handling
,¡ "
Ji ¡! ¡( distan ce that is allowed between individuals in the same group is :c - ~ cases like (d), (e) and (f).
! ;'; reduced. Eventually all objects are in groups of their own. Of course, clusters can only be based on the variables that are give'n
:!I :11'" in the data. Therefore they must be relevant to the classification
n1 J." . !
wanted. To cIassify depressed patients there is presumably not much
8.4 Problems of cluster analysis point in measuring height, weight, or length of arms. A problem here
:, 1
" '
is that the clusters obtained may be rather sensitive to the particular
lt has already been mentioned th at there are inany algorithms for
'.' cluster anaIysis. However, there is no generally accepted 'best' choice of variables that is made. A different choice of variables,
i":! 1:";; method. Unfortunately, different algorithms do not necessarily apparently equaIly reasonable, may gi\'e rather different clusters,
produce the same results on a given set of data. There is usually rather
a large subjective component in the assessment ofthe results from any 8.5 Measures of distance
particular method.
A fair test of any algorithm is to take a set of data with a known The data for a cluster analysis usually consists of the values of p
group structure and see whether the algorithm is able to reproduce variables X l' X 2" ..• X p for n objects. F or hierarchic aIgorithms these
this structure. It seems to be the case that this test only works in cases variable values are then used to produce an array oC distan ces
where the groups are very distinct. When there is a considerable between the individuaIs. Measures of distance have aIread)! been
overlap between the initial groups, a cluster analysis may produce a discussed in Chapter 4. Here it suffices to say that the Euciidean
solution that is quite different from the true situation. distance function
In sorne cases difficulties wiIl arise beca use of the shape of clusters.
For example, suppose that there are two variables Xl and X 2 and
individuals are plotted according to their values for these. Sorne ;,
dij = JLt 1
(Xik - X j J2 } (8.1)
:(
~ _.
1 0f-i Cluster anolysis
,. is most f, ~ quently uscd for quantitative variables. Hefe X¡k is the value
.... .
" ' ,.
"1: , -
j'
of variable X k for individual i and x jk is the valuc of the same variable
for indiviciualj. The geo"melrical interpretationofthe distance d¡Jrom
1 :o.., vi
-' \.)
r-.
~O-NN-~OOO~~-~-~oo-O-~~~~-W~
~~~~~~N--~~-O~~-~~-~~ON-~OO
oáooooOOOOOONOOOONOO~~O~~~
1 1 1 1 1 J 1 1 1 1 1 1
· individual i to individualj is ilIustrated in Figs 4.1 and 4.2 for the cases
.""¡-.
of two and three variables. " S ~OOOOM--N~-~~N-~O~OO~~-O-~NN~
. o V¡ ~~MM-O-NN~~M_vN~~-NMM~~N~-
.!:: Cl., o~ooooo~~oo~~o~~o~oOOCO~ON
Usually variables are standardized in son1e way before distances "t:l V¡
4) 1 1 1 1 1 1 1 1 1 1 1 I 1
are calculated, so that alf p variables are equally important in ;.
- " ' 0·
.•
~i
el)
p.
e
-
~
tt.
OO~-~M~-O-N~~~~O-~MOO-O--~
~OO~M~OONO~M~~N~~~~O--O--~N~
ooooooo~oooooo~oo~~~~~~O~N
1 I I 1 I 1 1 1 1 I 1 1
· 'O
:~~ Jl
group differences since if groups are separated weIl by X¡ then the ~~~N~N_O~~OON~~ _ _ ~OOONOOOO~~Om
~i
~ MMOOMOO-N-oooo~~OOO~M~~OOOM~-~~~
· variance of X¡ wilI be large, and indeed it should be large: lt would be t:J ~oooo~~~ooooooooo~~ooo~~~~
V¡
best to be able to make the-variances equal to one within clusters but I . 1 . 1 1 1 1 1 11 1 1
this is obviously not possible since the whole point ofthe analysis is to , " " .
' ,i:! account for a high pcrcentage of variation in the data a plot of ... ~ 1 1 1 1 1 1 1 11 1 1 1 1 1 1
individuals against these two c01lÍ ponents is certainly a useful way for -E
'¡'i .1:
looking for clusters. For eXilmple, Fig. 5.2 (p. 70) shows European .' . 0
countries plotted in this way for principal components based on ~i ftI ~ 8~~~~Ñ~~~v~~~~~~~8~~~~~8~g
'~ 1.:) ~cociciccici~oo~ociooo~ococci~o~
employment percentages. The countries do seem to group in a :,i:o .~
1 1 1 1 1 I 1 1 1 1 1 1 I I
meaningful way.
.• "O
::·· U
:Z!'o!:!
.... "0
.'
~ :;
Example 8.1 Clustering o/ European countries :~;1 ~ ~~ ~
:'~fi5 :;¡. ..,. c:
The data just mentioned on the percentages of people employed in ..;," ,,"": E co.!g
loo c:
"O
c: o~ '"
E '"
.-
nine industry groups in different countries of Europe [rabIe 1.5) can a,:,,:Qoo "".. ::C ~E8 O]
·W 4J
~
•
E::; .~]
] .:: ~ :;
8 ~. ~ c:
~ ~¡; ~ .~
Vi
13 ~ ~ . ;;¡ ~ o ~ --g ~ c::: ]
~
be used for a first example of cluster analysis. The analysis should
~ ::: :§ e t;; ~ ~ ~ -= ,~ ~ ~
~ t: ~ ~ t: .- ~ .: ~ .!:!l ~ C;; c: ~ E c.n CIl
... u~uu"'::"~::.5 ... oo~~~::::Nd::oO~::l
show which countries have similar employrnent patterns and which ~ ~O~~~=~Z~<~OZ~~~~¡-'~U~Z~C:::~>
countries are different in this respecto It rnay be recalled from
.'"
: 108 Clu s ter ano!ys is Pr incipa l component anaJys is wi th clust er anoJysis 109
l· :
Example 7.3that a sensible grouping into EEC, non-EEC western e N
¡:¡ European countries and eastern European countries existed in 1979, o 1 11111 2222221121
:1 A 138624905732<;675201439568
-.1 1 -
when the data were coIlected. s l 8 F N S o WF U A S N G P l U H C E R P 8 S Y T
A ertweei u woro uSuzoooupgu
:l " An analysis was carried out using the BMDP program 2M (Dixon, E8 lohen nKs f.e x S n c milos.
." ' \
l' E9n dm I f zwe mR h 009il«
1983), with pr-eassigned options for the various computationaJ
I!n d
L le ee o rO ccuyb 0 0 nnonve
uenn.Gn in ye9 .~Gid. ir
methods. Thus the first step in the analysis involved standardizing the m d k. d O d n O 9 i e o i o
d i o • o
r¡r~ nine variables so that each one had a mean of zero and a standard AMkG
deyiation of one. For example, variable 1 is AGR, the percentage OISTANCE
'* • .. .. .. .. '* .... .. '* .. .... .. .... ...... '........ ..
¡un11 ; employed in agriculture. For the 26 countries being considered this 1 135
1 479
I
-+-
r 1 -+- 1 1 1 r 1 1 I 1 11
1 1 T 1 1 1 1 1 1 1 1 1
1 1 1 1 r
1
1
1 1 1 1 1
1 1 1
lIT
1
1
r
1
[Ir: variable has a mean of 19.13 and a standard deviation of 15.55. The 1.5 37 1 1 1 1 1 1 r 1 1 r 1 r 1 1 -+- 1 r
1 1 r 1 1 1
1. 6 27 -+-- 1 1 111 1 1 1 111 1
1 1 1 1 1 1 1 1 1
data value for Be1gium for AGR is 3.3 which standardizes to (3.3- 1.6 3 1 --+--- 1 1 1 1 r 1 1 1· 1 1 r 1 T r I 1 1 I 1 1
~fI . 1 . 780 -+----- 1 1 1 I 1 1 1 1 1 1 1 1 r 1 1 1 1 1 1
¡ 1.801 1 1 111 1 1 1 1 1 1 1 1 1 I 111
I 19.13)/15.55 = - 1.02; the data. value for Denmark is 9.2 which 1 . 807 -+------ 1 1 1 1 1
-+-
1 1 1 1 1 1 1 I 1 1 1 1
lIT1 ] 1¡: I . 8 34 -+---~--- 1 1 1 1 r r 1 1 r 1 1 1 1 1 r 1
standardizes to - 0.64; and so on. The standardized values are shown 1 . 8 <; 3 I 1 1 1 1 -+- - 1 1 1 1 I 1 1 1 1 I
·1 . 1.84 3 ! 1 1 1 1 1 1 1 1 -+- 1 1 1 1 1
~I !: j . in Table 8.1. 1 . 882 r 1 1 r 1 1 111 1 - - + - 111
1 . 887 -- - ------- 1 1 1 1 1 1 1 1 1 1 1 1
:11+ The next step in the analysis in volved caIculating tbe Euclidean 1. 9 7 2 111 1 rrl ---4--- 111
!Ir/
,.. ,:['1: disúmces between all pairs of countries. Tbis was done using equation ..
1 . 982
1 . 996
-+---------
-+----------r 1r r
1
1 r 1
lIT
1
1
1 r 1
1 1 1
1!r1
11'
:1,'
:¡'!
,1 (8.1) on the standardized data values. FinalIy, a dendrogram was
~ 2 006
2098
2224
-+-----------
---+---
1
---------1 1 1 1
111
I 1-----+-
1
1 1 1 1
1·1 1
11 1
· ¡~U ¡ formed by the agglomerative, nearest neighbour, hierarchic process 2 459 1 . 1 ------.,. - 1 1 l·
1Ir¡ i\: : . ~. 2 529 -+------- -------- 1 1 1 1
~ :1 :; described aboye. . 2 . 557 -------+---------------- I I 1
-+-----------------------
~ l ' l ¡;
3 . 021
4 733 -+------------------------1 l1 ·
The dendrogram is shown in Fig. 8.3, as output by tbe BMDP2M 5 . 019 -+-------------------------
1ft"I·: :.
¡ji' I computer programoJt can be seen that the two closest countries were
iñ:·; : ' •
~ i" ;I
Sweden and Denmark. These are distance 1.135 aparto The next
c10sest pair of countries are Belgium and France, which are 1.479
Figure 8.3 Dendrogram obtained from a nearest neighbour, hierarchic
'\
"' !":
!
.
cluster analysis of data on employmen t in European countries.
apart. Then come Poland and Bulgaria, which are 1.537 aparto
Amalgamation ended with Turkey joining the other countries at a
distance of 5.019.
Having obtained the dendrogram, we are free to decide how many construction. Yugoslavia is un usual because of the large numbers in
r, c1usters to take. F or example, if si x c1usters are to be considered then agriculture and finance and low numbers in construction, social and
¡r :i1, these are found at an amalgamation distan ce of2.459. The first cluster personal services, and transport and communications. Turkey has
l. is the western nations of Be1gium, France, Netherlands, Sweden, extremely high numbers in agriculture and rather low numbers in
JI :!¡, Denmark, West Germany, Finland, UK, Austria, Jreland, Swit- most other areas.
zerland, Norway, Greece, Portugal and Jtaly. The second cluster is An alternative analysis oftbe same data can be carried out using the
¡ji ' li, : Luxemburg on its own. Tben there are the communist countries of BMDPKM program for a partitioning cluster analysis. This follows
" . ¡" ,
Ilr·: I'·1 :
1, USSR, Hungary, Czechoslovakia, East Germany, Romania, Poland the-iterative procedure described in Section 8.2 which starts with
and Bulgaria. The last three c1usters ar~ Spain, Yugoslavia ando
I . 1 , '
,Ir: 1
!'l... arbitrary cluster centres, allocates individuals to the closest centre,
~·" ¡'I · .
~I . , ':
Turkey, eacb on tbeir own. These clusters do, perhaps, make a certain
amount of sen se. From the standardized scores shown in Table 8.1 it
recalcula tes cluster centres, reallocates individuals, and so on. The
number of clusters is at choice. For the data being considered, from
can be seen that Luxemburg is unusual because of the large numbers two to six cIusters were requested. With two clusters the program
in mining. Spain is unusual because of the large numbers in produced the following ones:
~{~
"" ...........
-- s'
".":.;
,. ........
,~
~k·.
.ª
,
Ir,:~l'
110 Cluster analysis \ t Principal com?n\~n~ Qn~¡\Ql~ r(l~G ~lU~~~\ ~~~\~~\\ \\\
I .
\¡l.',: (1)
Belgium
(2)
Greece
I . mean mandible rneasurements of seven canine groups. As has bee~·
explained befcre, these data were originally collccted as part of a
Denmark Spain
j ": I: ; :~ !:! : study on the relationship between prehistoric dogs, whose remains
France Turkey
;¡~ !:~!i. West Gcrmany Bulgaria have becn uncovered in Thailand, and the other six living groups.
!¡¡ ¡ L Ireland . Czechoslovakia .\
•
This question has already been considered in tenns of distanq:s
:1; 11' Italy East Germany
between the seven groups in Exarnple ~.1. Table 4.1 (p.45) shows
Luxemburg Hungary
Netherlands Poland ..
mandible rneasurements standardized to have means of zero and
UK Romania standard deviations of one. Table 4.2 (p.46) shows Euclidean
Austria USSR " . distances between the groups based on these standardized
Finland Yugoslavia
Norway Portugal measurernents.
Sweden . With only seven obJects to be c1ustered it is simple to carry out a
Switzerland 2·· nearest-neighbour, hierarchic cluster analysis witho.ut usinga CO!TI-
For six clusters the choice was:
!::" Jputero Thus it can be seen from Table 4.2 that the shortest distance
.~!y:. between the seven groups in- Example 4.1. Table 4.1 (p. 45) sho~s
(1) . (2) ; .." . (3) ( 4 ) · ·· - (5) (6) . ~1!;::~ ·mandible measurernents standardized to have means of zero ana
Luxemburg East Germany .. .Turkey
.
Spain . Denmark
Hungary · . . . Yugoslavia Bulgaria Netherlands
France
West Germany ~Rr standard deviations -of one. Table 4:2 (p. 45) shows Eucljdean dis-
. .Czechoslovakia Poland UK Ire1and . ··~~;~f:tances between the groups based on these standardized
Rornania Finland ltaIy I tÚhis distance le'¡~l
the cuonjoins the prehistoric dog and modem dog .
:~\~' in a cluster ofthree. The next largest distance is 1.63 between cuon and
o"" __
USSR
_
Norway . Austria
Portugal Sweden Switzerland
. -....;,". Greece Belgium ~~.::- the modero dogo Since these two groups are already in the ' same
.~•.. cluster this has no elJect. Continuing in this way produces the c1usters
This is not the sarne as the six-c1uster solution given · by the at difTerent distance levels that are shown in Table 8.2. The corte-
denurograrn of Fig. 8.3, a1though there are sorne sirnilarities. No sponding dendrogram is given in Figure 8.4.
dou ~)! other algorithrns fCí cluster analysis·wilJ give slightly difTerent
.
-
discriminant functions. ~ 1.80
··1.84
2.07
(MD, PD, eu, DI), GJ, ew, IW
(MD, PD, eu, DI), GJ, ew, IW
(MD, PO, eu, DI, GJ), ew, IW
4
4
3
Example 8.2 Relacionships becween canine species 231 (MD, PD, eu, DI, GJ), (CW,IW) 2
. . (MD, PD, eu, DI, GJ, ew, IW) 1
As a second example, consider the data provided in Table '1.4 ror
Tl~ Cluster onolysis He f crcncrJs 113
8.7 Further reading
I
There are a number ofbooks available that pro vide more information
about clustering methods and applications. Those by Aldenderfer
and Blashfield (1984) and Gordon (1981) are at a level suitablc for the
2.0I Dovice. A more comprehensive account is given by Romesburg (1984).
References
1.5 Aldenderfer, M.S. and Blashfield, R.K. (1984) Cluster Analysis. Sage Univers<
CIJ
U
ity Paper Series on Quantitative Applications in the Social Sciences,
e 07.001 Sage Publications, BeverIy Hills.
o
-;¡;
Ci
Dixon, WJ. (1983) BM DP Statistical Software. U niversity of California Press,
Berkeley.
1.0 Gordon, A.D. (Ed.) (1981) Classification. Chapman and Hall, London.
::- Romesburg, H.e. (1984) Cluster Analysisfor Researchers. Lifetime LearniAg
Publications, Belmont, California.
-- '·t ",:. "
.!',.
05 \~
C' C' e o o
o o o C' ex "O "O
i ," "O
e
~
"U
u
0g "
U
e
Ci
u
.~
:J
CIJ
:J
e
o
"O
-;¡;
c: '"c:CIJ :oc:
O
:::E :c '"
"O
:c -
"O U
a:'" \.:)
Figure 8.4 Dendrogram produced from the c1usters shown in Table 8.2.
.,...
It appears that the prehistoric dog is closely related to the modern
Thai dog, with both ofthese being somewhat related to the cuon and
dingo and less closely ~elated to the golden jackaI. The Indian and
Chinese wolves are closest to each other, but the difIerence between
them is relatively large.
It seems fair to say that in this example the cluster analysis has
produced a sensible description of the relationship between the
difIerent groups.
t
......
':.~ .
~.
:~
,=-,
...
.~:,,:...
.~{¿ .
t
..: y.
.,.:~ . . \ C~n~ra\~'llng Q muh~ ;:» e
»,1.1'
- - - CHAPTER NINE - - - \
r cgres s ion analys is
large as possible, This is somewhat similar to the idea in a principal
component analysis except that here a correIation is maximized
11 5
~i ::
-'-,
", 1- Canonical correlation instead of a variance.
With X ¡, X 2' Y¡ a nd · Y2 standardized to ha ve unit variances,
:1,1:
:i l!, ¡
analysis Hotelling found that the best choices for V and Vare
'11 · ' .
= -2.78X¡ + 2.27X 2 + l.OOY2 ,
::1:::J V and V= -2.44Y¡
canonical corrdation analysis. . and q variables Y¡, Y2 , ••• , Yq then there can be up to the minimum of p ....
Another example was provided by Hotelling (1936) in one of the and q pairs of variates. That is to say, linear relationships
papers i!1 which he described a canonical correJation analysis for the '""'
first tim<::. This example involved the results oftests for reading speed V¡ = Q¡¡X¡ + a 12 X 2 + oo , + a¡p X p
(X d, re:<ding po wcr (X 2), arithmetic speed (r¡) and arithmetic power V 2 =a 2 ¡X¡ +a 22 X 2 + oo· +a 2p X p
(Y2) that were gi\'en to 140 :ocventh-grade schoolchildren. The specilic
question that was addressed was whether or not reading ability (as '"""
1 ,'
measured by X ¡ and X 2) is related to arithmetic ability (as measured V, = a,¡X ¡ + a,2X 2 + oo. + a,pX p
by Y¡ and Y2 ). The approach that a canonical corre1ation analysis and
':'!, takes to answering this question is to search for a linear combination
¡ ¡ji .
:.
v = b¡ Y¡ + b2 Y2,
}
"
v,
= b' l Y¡ + b,2 Y2 + ... + b,q Yq
t'~.. can be established, where r is the smaller of p and q. These
"\
-,
11 ' lú. relationships are ehosen so that the eorrelation between U ¡ and V¡ is
where these are eh osen so that the eorrelation between U and V is as
114
"~t .:',: ' , '.::~ . ..,' ....;;:\~ .: .: ...:' ..,, ;
l
~
i~j
11 , Canonicol co rn;)otion ono)ys is T es ts ef significo nce 117
;'7""l i! a ma ximum; the correlation b:tween V 2 and V2 is a ma ximum, of V j , the ith canonicaJ va ria te for the X variables, are given by the
" subject to these variables being uncorrelated with V¡ and VI ; the elements of the vector
correlation between V 3 and V3 is a maximum, subject to these
= A -¡eh
variables being uncorrelated with V ¡, VI' V 2 and V2 ; and so on. Each
of the pairs of canonical varia bIes (V ¡, V¡ ), (V 2 ' V2 ), ••• , (V" v,.) then
... aj j
• (9.2)
represents an independent 'dimension' in the relationship between the In these caJculations it is assurned that the original X and Y variables
two sets of variables (X¡,X 2 , • • · ,XI') and (Y¡, Y2 , .•• , Yq ). The first are in a ' standardized fórrn with rneans of zero and standard
pair (V ¡, V¡ ) have the highest possible torrelation and are therefore deviations of unity. The coefficients of the canonical varia tes are for
2;
the most important; the second pair (V V2 ) have the second highest these standardized X and y variables.
correlation and are therefore the second most important; etc. . Frorn equations (9.1) and (9.2) the ¡th pair of canonical varia tes are
~, caJculated as
-~( '::
9.2 Procedure for a canonic~1 correlation analysis
It is fairly easy to program the calculations for a canonical correlation
~-- -
-:;: ..
~- ~..
'~.':~ ..
u, ~ a:X ~ (aH' a", ... , a"l [J:]
analysis on a microcomputer, providing that suitable routines are
available for JIlatrix manipulations. .l}~~-:~ ."
.~1
Assume that the (p + q) x (p + q) correlation rnatrix between the
variables X ¡, X 2"'" X P' Y¡, Y2 , • • • , Yq takes thdollowing form when
*
j.and
it is calculated frorn the sarnple Tor which the variables are recorded:
~r
\:
v, ~ b;Y ~ (b"'b", ... ,b',l~:l
X¡X 2 ···X p Y¡ Y2 • •• Yq
Xl
X2 P X P matrix p x q rnatrix
A e where X and Y are vectors of standardized data values. As they st~nd,
Xp Vi and Vi will have variances that depend upon the scaling aGopted
for the eigenvalue bi . However, it is a simple rnatter to caJculate the
Y¡
;! Y2 q X P rnatrix q x q rnatrix standard deviation of Vi for the data and divide the aij values by this
.;! standard deviation. This produces a scaled canonical variate Vi witlí
C' B
unit variance. Sirnilarly, if the bij values are divided by the standard
Yq deviation of Vi then this produces a scaled Vi with unit variance.
This form of standardization of the canonical varia tes is not
Frorn this rnatrix a q x q rnatrix B -¡ C' Á-¡ e can ~ calc~lated,and
;;. essential since the correlation between Vi alld Vi is not affected by
the eigenvalue problern
.. - scaling. However, it rnay be useful when it co'mes to exarnining the
(B-l C' A-¡ e - AI)b = o (9.1) r. nurnerical values of canonical variates for the individuals foÍ" 'which
data is available.
', '0
.ti.
can be considered. It turns out that the eigenvalues A¡ > A2 > .. . > )'r
are then the squares of the correlations between the canonical f 9.3 Tests of significance
variates. The corresponding eigenvectors, b¡, b2 , ••• , br give the co- f" Uthere are r eigenvalues frorn equation (9.1) then there are r pairs of
efficients of the Y variables ror the canonical variates. The coefficients
. n~nical variates. However, sorne of these rnay reflect correlations
¡.-·. t
: ~. '
;,
. !.. .
¡' ,
Con onica] correJa lion ano lysi3 lnterprctins GUnarÜG(ll XQ.\K\\t~ \\~
r ). .
.:.\i .
''l'
118
11'; ,i set of data, the next problem is the interpretation of these varia tes:
!I¡ 1: ::
what are they measuring? At first sight it may seem that this is a
il !¡:;'~ ¡! ! is calculated, where n 5s the number of cases for which data are relatively easy question to answer. If
available. ihis is' compared with the percentage points of the chi- .•,,'
~I' .
!!: ,~' í..1 l1
squared distribution with pq degrees of freedom. If rp5 is signifi~ntly
·ji! ¡ :!l i large then this establishes that there is at least one significant . V¡ =.ail XI + a¡2 X 2 + ,.. + Q¡pX p
.I¡, .~.;;~ 1";' 1 i.
canonical correlation. If P6 is not significaritly ¡arge then there is no
1. 1','" . i I
..1~it.1, L¡ ~ .: ');I !i evidence of any relationship between the X and Y variables . ::. <~md
~'~~~~.:.-. ~'
.Assuming that eP6 is significant, the nextstep involves remo'ving the " . . '.
1\ 1. W U¡
!I.¡I,
¡h¡l
1
.'
effect of the first canonical correlation from the -test statistic and
considering
. .: " .:>: ; . " ," .,.!, ~"
it.~,E ... .'::.:, . : :~,Ji¡~:~¡l ;l f :-i2 2'~.~:'~. ~~jq l:
Y q
- . .: '
,I ¡'11'1: .
1 : :'";...- : ' '. -.,:-. : ' .= '~ --' .' '. .' ¡:j1':~~;¿~n i~ ~eems tha¡ Ú;'canbedescrib~d i~ term~ ofthe X ~ariables with "'
I I!III ePi = -{n'-1{p+q+lJ} L Jog~ (I~;'¡}, f./rJarge coefficients aiJ and Vi can be described in terms ofthe Y variables
1 rirH 1=2 ~. "' with large coefficients bij,'Large' here mean s, of course, large positive
......
i~ ' ;; , or large negative.
.. with (p - 1) (q - 1) degrees of freedom, 'If this is significantly large in : Unfortunately, correlations between the X and Y variables can
.1 compmison with the chi·squarcd percentage points then there arc at upset this interpretation proccss. For example, it can happcn that Q¡¡
least l','; O significant ca::')nical correlations. If rpi is not signifieantly " is positive and yet the simple correlation betwe.:n Vi aL J Xl is - '-.
large lL ~ n the first cano:1ic:-J correlation can be considered to aecount actually negative. This appi! rent contradiction can come abo ut
for al! -Jf ¡he rclationships between the X anel Y variables. becalJ.5e XI is highly correlated with one or more of the other X
If rp ~ and <Pi are both significant then the effect of the first two variables and part of the elIect of Xl is being accounted for by the
canonleal correJations can be removed from the test statistic to see if coefficients ofthese other X variables. In fact, if one of the X variables
1'\
I;.! .. I
.. any of the remaining correlations are significant. This process can
contim:e until it is found that the remaining correlations are no longer
is almost a linear combination of the other X variables then there will
be an infinite variety of linear combinations of the X variables, sorne
'n!!.
:1 1< significant and hence can be negIected. The test statistic for the of them with very different aij values, that give virtually the same V j
I¡- remairring.correlations after the first j have been removed is __ . values. The same can be said about linear combinations of Y "'
'í . :4J:::·variables.
' r."
~.
~;.; ..- .
t!~ ·~
. ': . ~
-...
12U Cri n on ¡cul correla t ion a T/U Iysi s lnt erp: ting canonical varia/ es 1 ~') -l
highly corrcJated ¡ht:n thcre can be no way of di sentangling thcir should be noted that there is no necd 10 think of the X variables as
i; contributions to canonifál vú¡';nes, Howcvcr, no doubt people will 'causing' the y variables, or vice versa. From this point of view thc
f', ,.... -;- . eontinuc to try to make interpretations under these circumstanees, labelling of variables as X 's or Y's is arbitrary.
r:"'r -d
¡
'1 Some authors have suggested that it is bettcr to describe canonical The standardized variables are shown in Table 9.1. These produce
" '1
varia tes by l,ooking at their correJations with the X and Y variables the correlation matrix shown in Table 9.2, whieh is partitioned into
Jj: Til rather than the coeITicients Gij and bij' For exampJe, if Vi is highly the A, B, C and C' matrices as described in Section 9.2
1~I! Tq
positively correlated with XI then Vi can be eonsidered to renect XI The cigenvalues obtained from equation (9.1) are 0.7731,
to a large extent. Similarly, if Vi is highly negatively eorrelated with YI 0.5570,0.1694 and 0.0472. Taking square roots gives the canonical
l'
,a¡I¡TI! ' then Vi can be considered to renect the opposite of YI to a large extent. correlations that these pro vide, 0.879, 0.746, 0.412 and 0.217. The
I,t '! i I
¡~'T'!
, I
I\
I ',. i!
"
.. ,'' l'.
This approach does at least have the meritof bringing out aH of the
variables to which the canonical varia tes seem to be related.
corresponding cano ni cal variates are obtained from equatións (9.1)
and (9.2). After standardizing to have unit vaTiances these beco me:
'111 ql , : í
, I ,;,'i:.: VI = - 0.675X ¡ + O.909X 2 + 0.376X 3 +1.442X 4 + 0.269X s' '
1111 ~;!::i Example 9.1 Environmenral and genetic correlations VI =- 0.1 14Y¡ + 0.619Y2 - 0.693Y3 + 0.048Y4 ,
\ JI¡i:
TI11'1 " Jor colonies oJ Euphydryas editha
i :. 11, U 2 = - 1.087X 1 + 3.034X 2 + 2.216X 3 + 3.439X 4 + 2.928X s
nH; f!1 The data in Table 1.3 can be used to ilJustrate the procedure for a
nl~¡1 1 1
canonical correlation analysis. Here there are 16 colonies of the V]. = - 0.777Y¡ + 0.980Y2 - 0.562Y3 + 0.928Y4 ,
....0.·
butterfIy Euphydryas editha in California andOregon. These vary
with respeet to four environmental variables (ahitude, annual pre- V3 = l.530X I + 2.049X 2 + 2.231X 3 + 4.916X 4 + 3.611X S'
id:::¡rli!
' 'ji . cipitation, annual maximum temperature, and annualn'linimum V3 = - 3.654 Y¡ - 0.601 Y2 - 0365 Y3 - 3.623 Y4 ,
temperature) and six genetic variables (percentages of six
Idj'.,¡·!i\l phosphoglucose-isomerase genes as determined by electrophoresis). U4 = 0.284X 1 - 2.331X 2 - 0.867X 3 -1.907X 4 - 1.133X S'
' ' .j'!'
I;!I Any significant relationships between the environmental and genetic
V4 = 1.594Y¡ + 0.860Y2 + 1.599Y3 + 0.742Y4 •
i'
ll ' :' h ' variables are interesting beca use they may indieate the adaption of E.
l
0' 1:
~r::
. ~~;~
.:~..~
~';.
:1t~
::~
':-~_--.--=' p"!' .s-_:r . ~::;" ~"_'~_-;'~-;~" ~:~_ ~~~~'E'~ --- ', .',·-· '-', ,••", ." .,,;-~~::~:1i::~¿~~;;::f!:~;;~-,~:';'-;;~:·'
.. _._._.._. ..
-'--"--'-==:=:...~=:=::.-'---"--'--..:...:..:..-"-'-"'-'-.:......:..._-- - - - - _ .- .. _- . _ _ ._-----=- _.- _.:....:-. --- .
Gellctic var;ah!es
.'
Environmcntal variables
..
--
_---------
Co/rmy XI Xz X.I :<4 Xs }'I }'2 Y:, }r
~
Tllhle 9.2 Correlalion matrix for Ihc variables X, lo X ~ and Y1 lo Y4 -for Ih,c E/lphydrYilsedillllldatn , partilioncd into A, n, e ar1 d
C' malri<;cs .
)
) ) ) ) )
U-1 CUil (; nÍ cu] curre; '(¡Liun ono]ysis Heferenr;r;s 12:i
interprctation can be giyell to the flrst pair of canonical vari~les Here C' A - 'e and B are symmctric a~d hCllcc lhe cigenvalucs and
(U ,_ V,), From t~c equation for U, it appears that this is a contrasl eigenvectors can be found using Algorlthm 15 of Nash (1979), This
r bctween X, and the olhcr X variables, It represents a lack of genes approach requires that the number ofpositivc eigenvalues is equallo
kr> " ~l '
with mobility 0.40, On the other hand, VI has a large positive
coefficient for Y2 (precipitation) and a large negative coefficient for Y
q, the number of rows and columns in the matrices C' A -1 e and B.
This will nol be lhe case if there are more Y variables than X variables
r:i'I:¡; (maximum temperature), It would seem that the 0.40 mobiJity gene is
3
in a canonical correlation analysis. Hence the variables should be
1'1:
"1 , ,, !
m" ,! I
il'li: lacking in colonies with high precipitation and low maximum
temperatures,
labelled so that the number of Y variables is less than or equal to the
number of X variables. This is no real problem since there is no
The correlations between U¡ and the five X variables are as follows:
nll ¡ I! ! " U I and X.i, - 0.57; U I 'a nd X 2' - 0,39; U I and X 3, - 0,70; U ¡ and
implication in a canonical correlation analysis that one of the sets of
variables is dependent on the other set, although this may sometimes
11111 111' ,.l.:',, i,
X 4,0.92; U I and X 5, - 0.36. Thus U I is highly positively correlated be the case.
1l ' 1: ' ¡ with X 4 (the percentage of mobility 1.00 genes) and negatively The inversion ofthe matrix A can be done using either Algorithm 5
:11 T'1II l·; I:"';
,'! ! correlated with the other X variables. This suggests that U ¡ isbest
, '·: 1 or Algorithm 9 of Nash (1979).
rnl 1; 1':
,i . 111
I interpreted as indicating a high frequency ofmobility 1.00 genes. This
is a somewhat difIerent interpretation . from the one given by- a
111 '1 lIi i
1
. ,1 '11 · consideration of the coefficient of U I for the X variables. On the . ~: 9.6 Further reading
i Al 1 :I ~ ¡ll ! whole, the interpretation based on correlations seems best. However, /. Giffins (1985) has written a book on canonical correJation intended
"\'" , ::l¡
. l' , as mentioned in the previous section, there are real problems about mainly for ecologists. About half ofthe work is devoted to theory and
1!;,. :;I¡¡. :1\
'1 • interpreting canonical variates when the variables that they are .' . methods, and the remainder to a 'number of specific examples in the · '
1 constructed from have high correlations: Table-9.2 shows that this is
11 ' , , area of plant ecology. Less detailed introductions are provided by
I :, _ il1i indeed the case with the present example.
11111:1
" '::1
l' ¡',:1: The correlation between V¡ and the four Y variables are as follows:
elarke (1975) and Levine (1977).
1111
!I::i; ¡:,li
I i:
V¡ and Y¡, 0.77; V¡ and Y2 , 0.85; V¡ and Y3, - 0.86; V¡ and Y4 , - 0.78.
References
: I'! : . Thus V¡ seems to be associated with high altitude and precipitation
,,! T', ¡,'. . and low temperatures. Bartlett, M,S. (1947) The general eanonical correlation distribution. AllIw/s of
t ... _, : Taken together, the interpretation of U ¡ and V¡ based on Marhcl1latica/ SWlislics 18, 1-17.
,; Clarke, D. (1975) Underslanding Canonica/ Corre/alion Ana/ysis, Concepts
correlations suggests that the percentage of mobility 1.00 genes is high
;1 '0 and Techniques in Modern Geography 3, Geo, Abstracts, Norwich, UK.
for colonies with high altitudes and precipitation and low tempera-
Gifiins, R, (1985) Canonica/ Ana/ysis: A Redel\' wilh App/icalions in Ec%gy.
\) j; I;::
I '¡ .
tures. This does indeed show_ up in the data to sorne extent. For Biomathematics 12, Springer-Verlag, Berlin,
example, colony 16 has the highest value of X 4' high values for Y¡ and Hotelling, H. (1936) Relations between two sets of variables, Biomelrika 28,
:1"
, ,,! "1'" , Y2 , and low values for Y3 and Y4 (Table 9.1). 321-77.
Levine, M.S, (1977) Canonica/ Ana/ysis and Facror Comparisons, Sage
:'11""I' !' '
1'h 'i . 1 :" University Papers on Quantitative Applications in the Social Sciences
n, : i:'¡ !:; 9.5 Computational methods 07-006. Sage Publications, Beverly Hills.
Nash., J.e. (1979) Compacl Numerica/ Merhodsfor Computers. Adam.Hilger,
:;": ¡i ¡:,¡ ~ . A canonicaI correlation analysis involves the inversion of a matrix BriitO!. .
and the soIution of the eigenvalue problem of equation (9.l). -This
1],,! H!il : equation can be rewritten as
I¡¡:¡ .: 1¡IJi:
111: lIlilí ¡ (C' A -¡e - AB)b = O.
uij IIII! 7;:
.::;;'~ .
= I~l
{~~:
~~ ..
.~; .
~,.
'.\~
,
tll ~'.
,;;¡;.-
¡{~\1¡"I¡.;: CHAPTER TEN - - - -
Cons~ructJng a 'map' from a dis!clnce matrix
8~ - - - . : /\
127
'1 "
e and o shown in Fig. 10.1. Here the distances apart are given by the reversal of this type.
:·!¡:W::, array:
•. w ' .i ~;
It is also apparent ¡hat if more than three objccts are imolved then
the)' ma y not lie on a plar-e. In that case theií distance mr,líix will
A B e o implicitl y conta in Ihis information. For exam p\c, the distancc: <liray:
?
!~ .,
A o 6.0 6.0 2.5 A B e o
B 6.0 o 9.5 7.8
o
e
o
6.0
2.5
9.5
7.8 3.5
3.5
o
A
B
°
1
J2J2
1 o1
1
e Al 0J2
8
o J21 J2 o
is such that three dimensions are required to show the spatial
relationships between the four objects. Unfortunately. with real data
it is not usually known how many dimensions are needed for a ----
representation. Hence a range of dimensions has to be tried.
The usefulness of multidimensionaI scaling comes from the fact that
e situations often arise where the relationship between objects is not
'c known, but a distance matrix can be estimated. This is particularly the .,
Figure 10.1 A map of the relationship between four objects. .,, ' case in psychology where subjects can say how similar or different
126
,
.
IViU I lJ ííJ ¡¡¡en:;i IJIIU! .';(;nh:i;';
ProccdurP. for milllidirn ensioIlol scuJjng 129
-.
,', " 1 individual pairs of ohjcclS are, but thcy canno! dra." .. un ovcrall called 'stress form ula 1', whicn is
pictun; of lh e rcJationships b'~ lwcen the objcc.:ts, Multidimcnsional
,'"
I l '
scaling can then provide a piclurc. Thc main applicalions lo date have
!:. ~ H becn in psychology and sociology. STRESS 1 = OJd¡j-Jj Y/Iél0}1/2 (10.1)
I
At the presen! time there are a wide variety of data analysis
The cleseription 'stress' is used since the statistic is a measure of the
techniques t~at go under the general heading of multidimensional
extent to which the spatial configuration of points has to be
scaling, Here only lhe simplest ofthese wiIl be considered, these being
stressed in order to obtain the data distances óij'
the cIassical methods proposed by Torgerson (1952) and Kruskal
(1964), 5, The coordinates (XI ' x 2 , ••• , XI) of each object are changed slightly
in such a way that the stress is reduced,
Steps 2 to 5 are repeated until it seems that the stress cannot be
10:2 Procedure for multidimensional scaling
further redueed. The outcome ofthe análysis is then the coordinates of
A cIassical multidimensional scaling starts with a matrix of distances the n individuals in t dimensions. These coordina tes can be used to
between n objects which has Ójj , the distance from object i to objectj, draw a 'rnap' which shows how the individuals ¡ue related.
in the ith row andjth column. The number of dimensions, r, for the It is desirable that a good solution is found in three or fewer
'mapping of objects is fixed for a particular solution. Different dirnensions, since a grap"hical representation of the n objects is then
i
computer programs use different methods for carrying out analysis straightforward, Obviously this is not always possible.
1.. "
'but, generally something like'the foIlowing steps are involved: '
I
Example 10.1 Road distan ces between N ew Zealand tow1Ís
L A starting configuration is set up for the n objects in t dimensions,
Í.e. coordinates (X I ,X 2 , .. • ,xl ) areassumed for each objeet in a t- {' ', '
t......
)~
~
~ '.
:i:·.
~i : : :¡ ::, ",. 130 Multidímens íonaJ scoli r!g
~ . ' ,.
,'! , '\'"
!l • .: ,,
1\ \ '' .,. ;
'5.\ :'\' 1,\ ':il'\
~\ . .~\ \ \\:\",1
....
v
c:;
e'" .¡::
:::o
Bl enheim
.~ §
~ ,.... ~
-;:
/ '-
1' 1:[\ '! t: / o ,!,
,! I '¡I I .: '"e ...
~
~ ~.
1~., FPi .2 " -
6'"
r,
"1 ¡, P;¡
1 ;:· ¡I .;¡;
J!,,, o
c..
H
.., ';; 1 r-
M oc '..::;.
"
..c ~I
V"I ...:::: r-•
f-
~ I
:::: !
§.. \oC
V'¡
r- -
:.o
«-
V"l r-
r- r-
-',
;j < 1 4""";
N I
..,::: ~~ ( ..,. ("'1 00 a- ...:: "
.z ~ e
r-.-. -C'o-'::'
-..o - (".
"'"
a
~~ ,-,II"\r- cV'¡ "
~ e~ O"V'¡X..,.XN
~ "" - 1'""\..,. 1".
'"
..c
o ~'- .....,
:. .. . ~ ~. :!:~~8~~
.,g'" ....
' ..
..., C'""l ~ f""". ("'.1 f""'"•
-,
.::
'\
1
..,. ("'~ C" "r, -. ("'1 o:: r-
'"
'" V'". V"l r"'; --=::--:r- x f""l
~
t""'I ""'" r l'o;t-
. :1: o "'"
.-.
-2 1 ~I-OO
(' ¡ ..,. V"':
""I=-
\oC : -:
*=- -= ,::)
r-- V"¡ ..,.::
-...:,':"<
~~
("'~ ( -1 _ f ~. (¡ t""". -:-
,, ~
~
~
o 50 100 -;:
~
r-r-'7r-
Or-lr-l
...... ""1'
-c=-r-o- _X
; rl oo:t' f"'. V"I"::: 'V -.c ......
miles e . ~
~M
Invercorgill
"'o
~-
:::
] 00 ...o o ...... r ~ e-
V) o- o oc
r- -e oc r-
~ .=.0
Figure 10.3 The south island ofNew Zealand. Main roads are indícated by r- r- r-. '" r-. c-
~ ~ N 'V' ~ V'I
broken lines. The 13 towns used for example 10.1 are índicated. ~t.t.. ¿¡
-o e
::; c: ~t:
towns o'n the 'map' produced in the analysis. These are shown in o.... O:t ~
8~~~~~~~~~~~
-"!f"'4-NM N V'\
~
Table 10.2. A plot of the towns using these coordinates is shown in .;~ "<:
Fig. 10.4. A comparison of this figure with Fig. 10.3 indicates that
¿ e "
the multidimensional scaling has been quite successful in recovering -
d~
'"'"
~ ~~..c:::: i
the map of South Island. On the whole the towns are shown with the ..
- -.::
;:Q"O
~~ ·§~ .E~ ~ ~~
a :s ~ - ~~~ E ~ ~
~~~~:: § ~~s,¿ ~"'= ~
_
~
~ 5:s
~ ~ ,
"
correct relationships to each other. An exception is Milford. Because
*: ~
01 '"
2 :;;:~ZiiCc5~~~~~&~~
'{±,:"~
...:!.o" .
:"
1':':-
:"'; 1
.'
4,
~
.'l;¿
Pro ceuure fUf multielim cnsi un G; s c; (~ l i n g 13 ::1
-,; :
" '
Table 10.2 Coord ina tes produced by multi· this can onl y be re?-ched by road through Te Anau , lhe 'map'
t:i: ! dimcnsional scaling appl icd to the distances produced by multidimensional scaling has made Milforu c10sest lo Te
bctwccn 13 town s shown in Table !O.!. Anau. In faet, Milford is geographically c10ser to Queenstown than it
These are the coardinates that the tawns are is to Te Anau.
plotted against in Fig. 10.4.
AII tbat is important with the configuration produced by multidi-
, ,
mensional scaling is the relative positions of the objects being
Dimension
considered. This is unchanged by a rotation or a refleqion. It is also
Town 2
unchanged by a magnification or contracti'on of all the scales. That is,
Alexandra 0.72 -0.32 the size of the configuration is not important. F or this reason
Balclutha 0.84 0.78 ALSCAL-4 always scales the configuration so that the average
Blenheim -1 .99 0.43 coordina te is zero in all dimensions and the sum of the squared
Christchurch -0.92 0.34 ,;, • coordinates is eqmU to the number of objects multiplied by the
Dunedin 0.52 0.46
,.i ' number of dimensions. Thus in Table 10.2 the sum ofthe coordinates
Franz Josef -0.69 -1.23
Greymouth -1.32 -0.57 ~"
i,:
is zero fm each of the two dimensions and the total of the coordina tes
Invercargill 1.28 0.39 W1.-· squared is 26.
Milford 1.83 -0.33 · ~~?C:·
Nelson
Queenstown
-2.33
0.81
0.07
-0.49 ~~~.
tr
'7'.:'
Example 10.2 V oting behaviour oJ Congressmen
Te Anau 1.47 -0.26 ~, . For -a second example of the value of múltidimensional scaling,
Timaru -0.19 0.64
~Z~ consider the distance matrix shown in TabIe 10.3. Here the 'distances'
~:, are between 15 New Jersey Congressmen in the United State~ House
-2 -1 O 2 :;:: of Representatives. They are simply a count of the number of voting
l'
I
,
I
21- ~2
disagreements on 19 bilis concerned with environmental matters. F or
I example, Congressmen Hunt and Sandman disagreed 8 out ofthe 19
!~ ;
times. Sandman and Howard disagreed 17 out of the 19 times, etc. An
1, ·
agreement was considered to occur if t\\'o Congressmen both voted
Bolclutho -11
Timoru yes, both voted no, or both failed to vote. The table of distances was
Blenheim •
N
e
.2
~
Nelson
0 1- .
•
'.
Christchurch Dunedin
Invercorgill
•
-lO
constructed from original data given by Romesburg (1984), p. 155).
Two analyses were carried out using the ALSCAL-4 programoThe
Alexondro Te Anou first was a c1assical metric multidimensional scaling which assumes
E •
Ci
• • that the distances ofTable 10.3 are measured on a ratio scale. That is
•
Greymouth Oueenslown
Milford
to say, it is assumed that doubling a distance value is equivalent to
-1 ~-1
!'- assuming that the configuration distance between two objeCis is
•
Fronz Josef "", doubled. This means that the regression at step 3 of the proce.dure
-2~
"
-1-2
rdescribed aboye is of the:-fonn ...-....,
' ,
'''',
:. ;..
i
" .':
.,
I" '~
C\
:.c found to be 0.065, 0.089 and 0.134, respecti vely. Thc di stinctly lower
C-:;:
v:: _
"'..:.:: stress values fol' non-metric scaling suggest thal this is preferable to
'-'
... <.>
c.~
- r....
10\:; metric scaling for these data. The three-dimensional non-mctric - 'o
'" e
~ g solution has only slightly more stress than the four-dimensional
'-
c';;::
... 1 '" (' 1 'D v
solution. This three-dimcnsional solution is therefore the one that will .
<J e
o
v:
-("1
be considered in more detail.
=:0- V) V)
.3-
...,. Table lOA shows the coordinates ofthe Congressmen for the three-
en o
lo
dimensional solution. A plot for the first two dimensions is shown in --."
-'"
>.
C":_
- ('1 - V '"
Hg. 10.5. The value for the third dimension is shown for each plotted
Ví c
-~~.... I!J
lo- (""'". ("J - t"""", r- lr)
point in Fig. 10.5, where this dimension indica tes how far a three-
_ . .dimensional plot would make the point aboye or below the two-
'c :.o
~-= -.::t V') V') r-. C"l r--.. \O
dimensional plane. F or example, D aniels should be plotted 0.52 units
~ ~
1::.::
..cc : aboye the plane and Rinaldo should be plotted 0.27 units below the
~ ;-
.: e r----\OV'\~~C-M
planeo
1 .....
;:...~ From"Fig.10.5 it is clear that dimension 1 is largely reflecting 'party
o . .C:
t:n :.r.l ""
<J .'",
....
.........
e¡¡-
1'0 t-C"'I"""":"OC\C'-O-.ao
"""
~e
e
<J
Table 10:4 Coordinates of the 15 Congressmen obtained
ZL- IV) 00 C'\ ~ ~ (""1 N r-J :::: - -
"" .; - " . ", e <J -.::
from a three-dimensional non-metric multidimensional
-..c i.: scal ing 01' th e distance matrix given in Table 10.3.
g:g ....,. -t (""' ~ t""') (~ 00 x :.o ~ tr, :::: c-
=,.g-
~
E
~ Diwl!l/siO!l
E",C;
¡ . \~ : . -:.>
I :r.
¿¡ E 11 ,...., '=" ::2 ~ ~ If') V"'I 'C VI -.:::r - e r-
~ .= e eol/gressmen 2 3
c'- .
o o e r- ("1 M (""") N \O r- V"'I \O r- M ('1 ...o
U .... 13 IN Hunt 2.25 0.15 0.53
I!J ._
V)...c::o Sandman 1.74 2.06 O.~
- E =: CGIr. V')OC\r-"',,) \O~lIi...cC'--t""') Howard -1.37 -0.01 0.34
e =: C.
eJ e ~
C,) ~ Thompsol1 -0.85 1.42 -0.45
~~ 11 Frelinghuysen 1.47 - 0.83 - 1.23
<> -
..c~e: Forsythe 0.81 - 0.93 -0,43
'", o" WidnalJ 2.25 -0.28 -0.46
u Vl _e.>
tU..J::.
e o :c
~ Roe - 1.40 -0.01 0.60
S~.~ _ Cc Hdstoski -1.50 0.22 - 0.18
:3 = ~;;-~~_ C_ -,...,_
Vl_"
.-
? - .- --e>' ~ -I""\-~-el- Rodino -1.09 -0.19 0.10
<> ~ <>
<)
... - c - o ::: ~ -
O:::::O:-.:J:.n-..c=_... ~"...
8 -:::- - el .:- ':::. Minish -1.13 -0.21 -0.24
C. :.Il - '" el o ~ -= -::l . ;: -
O-;n-
..c:-=~ - E ... C
f- c ", ~ E.: ~ e - ~ .: . ~ -:;: ~.: Rinaldo - 1.27 -0.18 -0.27
...,.- .~
u
-
;
-::l
e o :2 o o3 ~ ~ --g.: =~ ;
= I!J
Maraziti 1.20 -1.20 0.97 .~
:::J5=¡::::!.l..:::~=~¿::;;:¿O:l:
• Vl
Q ... c
-... .0 ~
o ._ Daniels -0.Í2 - 0.16 0.52
:QE~
. ~~- Patten -0.99 0.14 -0.94
.. :::= -N"" "'" "'\O r--OOO'l O-N...,.."..,.. ~.
!- c:tt ------ .,
lúl, Mu]licliJJIVJJ:;ÍcJflol seo/ing
-2 -1 O
------.,-, -------,- 2 >-:
2
---¡-
•
Sondmon (tO.64) -12
·, ......... . .
.)«(\,IY.
. ..... . .., <, . .,,, ' ... .......... • t< ..
.-,.-, ~
·
.. o:r xx >'
o +1"
· "'
.,.;0 "C
·
.X r<'Ixl""l X
,
i
•
... v
.wrl')
Vl
-5
..,
-o
Howord (tO.82) ,- c:
N
c: Helst.oski / Pollen (-094) Hunt (tO.53)
XNXNLOXX
""
'N .~
1 o (-O 18) o o
O • ."'"'
ll~
.~-
o • <D c:
QJ Roe (to 60)00 ·oo(tO
Rodino
10) 0Daniels (tO.:;2) Widnall (-OA6). : - '0
E
/ Minish (-0.24) .",,,, c.
Ci .".,
'Ir¡ -1
Rinaldo
(-0.27)
•
Frelinghuysen (-1.23)
•
...
""
.... ~,.,
,
4)
Oc:
Fors)lthe (-043) • Moraziti (tO.97) -1 lil')q-N XN , r-
..,u'"
._N
• r-
·
::-.::'" ·.• x ,.,xxx
'a>
'a>
.Vl
c:
'"'"
-2
2 -1 o
-2
.. ~~.
:.~.! : · '"
.NO
•
•
N
<r
'"'O
c::
c::
2
.;':' .CON --o
Dimension 1
•., .'"
'N -o
E
..
.~:-:~;
,,
....· • N""" Ü
Figure 105 PIOI of'Congressmen agains( (be firsl (wo dimensions of tbe ',
•• '"
r- en
~
contiguration produced by a (bree-dimensional cJassieal non-metric multidi- • • o '"
4)
mensional scaling oribe data in Table 10.3. Open circles indicate Democrats,
,
· , xx"", N X x
.ION
o."
.-
.¡:
c10sed eireles indica te Republicans. The eoordinate for dimension 3 is •• '0
,O>
, ,-o<) '"
c.
indicated in parentbesis for eacb poin!. ,
"'" x
· '" '"
·
• ... ,., ""O
, r-
'N- '"
c:
'o
''''
,," ¡;¡,
T' xx x x
,-o
,.,
xx , r- <r.
-~ -- difTerenees. The Demoerals fa]] on the left-hand side of the figure and , ID U
xxxxx , '" U
c:
t/¡¡; Republieans, olher than Rinaldo, on the right-hand side. x
.-
'<D- ~
r -, .'"
¡l. To interpret dimension 2 il is neeessary to eonsider what it is about
~. 'N -o
1;' r' the voting of Sandman and Thompson, who llave the highest two z· .-,.... e
O' •... ov '-o
N.
:I;l', seores, that contrasts with Maraziti and Forsythe, who have the two 0" • o ¡:
, ,
°:ro
-, .".,-
,i
I/::r:
lowest seores. This points to the number o[ abslentions [rom voting.
Sandman abstained [rom nine votes and Thompson abstained [rom
u)'
-.""
X NN NNX X X
.."
.r-
.. ~
~
en...;
~:§
· ... .::'"
¡::.
" Ji, '0", 0-0
~:~
I ; Jo' I j six votes. Individuals with low seo res on dimension 2 voted aH or
most 0-[ the time. u)'
o·,
x x
•
o >.
•
C1>
ID
-.
:'" .:;' 1;
~: ,-
,N
1/; jI,! Dimension 3 appears to have no simple or obvious interpretation. . -J. o '"
u, C:c:
11' ¡Id li
:..Ip!!n
Jt must reflect certain aspects of dilTerenees in voting patterns.
However, these will not be considered for the present example. It
:: ~:
-
...
ct.
~.
x x
x x x·
·· ... -.....,-
.00
'"
'"o
'"
dC.
'0
dd -: _t. .... ' ... ' ....... _. . . _.... ' ... 'iI. ' . . . . . . . . fI . . . . . . _... ' ... ' • ., ... .. ' ... '. ","'O..
suffices to say that the analysis has produced a representation o[ the -U)
tlT ~-3
•• 1oJ
, Congressmen in three dimensions that indicates how they relate with jJ~ ~~~=~~~~~~~~g~~~~:~~:~~:~!~g~~g~:~~~~;:~=~~g~~:~~:~~:~!
P~ ' • . ,' •.....•• ,. " ."", .•.• ' ., .. " ' ¡:;:E
.~ ~~~~~I""l~I""lI""lI""l~~~NNNNNNNNNNNNNNN--------------ooooooooooooo
regard to voting on environmental issues.
i~,
:IT
·Yt.f,'
:~~~.~
. ...
:i~~
?~;.'
-------------------- -----_ ...
x
.. ...... . . ....... ..
3 72 x
~.
:3 66 4
3 59
352
345
338
X
2
3
X
3
X
X
X ...
X"'"
X
3 31 X
324 3 2
3 17 X 2
310 X ~
304
297
290 3
283 4
2.76 X 2
2 69
262 X
2 55 3 2
246 X
242 )(
2 3~ X
2 26
2.21 x
2.14 X
207 3
200
IU~
1 HG
100 x
173
166 X
1.59 x X
1.52 X
1 45 X
X
1.38
1.31 X x x
1.24 x
)',
x
1 1B x
1.11 X X
1.04 X X
097 X
0.90 X 2
083 x
076
0.69
063
X
X
x *
056 . X
X x
O 49 ~X
042 • X
035 .X
0.28
021 x
0.14 .X
... *.- .............. ,................... ....... * •.••••.•• ** .•• p •.• ** •. *.* •.••• *.* •• * .••• ~.** •• .••••.•
1: ooob' 800~: 600~' 400~ 2006 ooo~ Boog soo~ .,oog· 200~: 0005 OO?g: 60bb' 40?g: 20b6' OO?~: 806Ó' 6 O?g: 40Óg· 20?~: 000 0
Figure 10.7 Plot of configuratíon distanecs against the eorresponding dislanees in the data.
3 81
......... -.* ......................... ***
DI5A4RmES (VERTICAL) VS OBSERVATlONS (HORIZONTAL)
............. " ••. ".**. *"" •.• _.- ................................................................... '"
4.
3 75 7 •
3 66
3 62 .. 9
355 •
349
343
3 36 9 4
330
323
317
3.11
..
304
2.98
291
2 8~ .. M
279 •
2.72
266
259
2.53
7 ..
246
240
234
n?
214
20e
202 9
1 g~
189
182 6
1.76
1 69
I 63
1. 57
1 50
., 5
144
1 37 7
1 31
1 25
1 18
I 12
1 05
o 99
093
086
080
073
067
·• 4
060
054
048
041
035
•
·· .3
2
Figure 10.8 Plot of the monotonic rcgression of estimated dísparities against data distances. The lettcr M indicates tha!
more than nine points fall in the same position. In r:let thcre are twelve distanees of 12 in Tablc 10.3 that all yicld cstimatcd
cJíserepancics of ll;hout 2.H5. . ,
r-
,
F .i' l' ¡- ¡. r- r ) ~ ¡-
'"
l;¡ 140 Mullidi¡¡¡cmÍono] scoljng Rt-fcrence;; 141
;¡ ;:' .1:\ Threc graphs out;ut by ALSCAL-4 are helpful in assessing the Romesbllrg, H.C. (1984) Clusler Analysisfor Researchers. Lifetimc Lcarning
accuracy of the solution tha! has been obtained. Figure 10.6 shows PlIblications, Bclmont, California.
the fírst of these, which is a plot of the distances between points on the Schiffman, S.S., Reynolds"M.L. and Young, F.W. (1981) Introductiol1 to
derived configuration, d jj , against the disparities, J jj . The figure Multidimensional Scaling. Acadeinic Press, Orlando, Florida,
.- indicates the lack of fit of the solution since the dísparities are the 105 Torgerson, W.S. (1952) MlIltidimensional scaling. I. Theory and method.
'Ií .
Psychometrika 17,401-19.
1I r: data distances ofTable 10.3 after they have been scaJed to match the Young, F.W. and Lewyckyj, R. (1979) ALSCAL-4 User's Guide. Psycho-
";1 configuratíon as cIoseJy as possible. A plot líke that of Fíg. 10.6 would metric Laborator.,r-. University of North Carolina, Chapel Hill.
i ¡;
be a straight ¡ine if all of the distances and disparities were equa!.
:;\ ¡
,t '
Figure 10.7 is a plot of the distances between the configuration I
~ I ,
points (dij) against the original data distances (bjJ This reJationship I
;;/, ,
l·t"
J:¡': ¡ does not have to be a straight ¡ine with non-metric scaling. However,
q:ti
" ,
scatter about an underlying trend line does indicate lack of fit of the
mode!. For example in Table 10.3 there are eight distances of 5.
~ ...~:::
.. - .~:.
References "'.
"0";:,:
'.;""J¡
,Iii,!!!!
" !W/i l.:
themse1 ves sitting in front of a large pile of computer output with the ~
realizaríon thát it tells them nothing that they really want to know. _
11.1 The next step Third, multivariate analysis does not always work in terms of
-¡ '¡'il':!I!¡: i:;
¡ -:
::: producing a 'neat' answer. There is an obvious bias in statistical -
In writing this book my aims have purposely be en rather limited.
' ., ' 1 111
:;::. ' J ;!:I;¡
:.
textbooks and articles towards examples where results are straightf~r- ~
.! ¡ "11:: These aims will have beeÍ1 achieved if someone who has read the ward and concJusions are c1ear~ In real life this do es not happen quite
~ i· . !. .¡ 1i ¡ :
,,--, 11' ' previous chapters carefully has a fair idea of what can and what so often. Do flot be surprised if multivariate analyses fail to give ~,
n,¡:I!
~
:l
, . ;,1 .. 11 .. '
cannot be achieved by the muItivariate statisticaI methods that are satisfactory results on the data that you are reaIly interested in! It may --.
: :¡~j :'I.II! ¡:
most wideIy used. My hope is that the book wilI help many people well be that the data have message to give, but the message cannot be ~
take the first step in 'a journey of a thousand miles':
,:¡.:,' I. ~i :.!I¡I,·I
read using the somewhat simple models that standard analyses are
- For those who have taken this first step, the way tO.proceed futther
I
t¡i :.i" :' !
is to gain experience ¡of muItivariate methods b,Y analysin.g diJTw:nt. l·
based on. F or example, it may be that variation in a multivariate 'set of -"
I ;r "1 data can be completely described l:5y two orthree underIying factors. -
1 '!l'1:'
l '1 sets of data and seemg what results are otJtamed: It w111 be \lery
, 11, !I I Howevú, rhese: may not show up in a principal component analysis
..- l' 1: : I 1,'
'1 ..
helpful, if not essential, to get access to one of the larger statistical
packages and investigate the diJTerent options that are available. Like 1
or a factor anaIysis because the reIationship betweeri the observed ......
variables and the factors is not a simple linear one. --.,
I,~I ir J' other areas of applied 'statistics, competence in multivariate analysisl Finally, there is always the possibility that an analysis is dominated ~
i~: ;" ;!¡:i:i " . rcq.uires practice. To this end. the Appendix contains sorne sets of ¡ by one or two rather extreme observations. These 'outliers' can
d:lta with suggestions about ho\\' to examine them. so me times be found by simply scanning the data by eye, orby
considering frequency tabIes for the distributions of individual -
11.2 Scme :;:en~r:i; relllinders varia:'les. In sorne cases a more sophisticated multivariate m'e thod _
may be required. A Iarge Mahalanobis distanee from an observation
i:1 devc!oping c;~p,~ rtise and familiarity with multivariate analyses to the mean ol' all observations is one indieation of a multivariate
.. , thcre are a fe\\' general points that are worth keeping in mind .
outlier (see Section 4.3).
j'I.i ." ¡:I! :: ActualIy. these points are just as relevant to univariate analyses. It may be difficult to decide what to do about an outlier. If it is due
. i¡:, However, they are stiJl worth emphasizing in the multivariate context.
to a recording error or sorne other definite mistake then it is fair
.:: i j! t!.
'¡ '1 l,!."
' '.' First, it should be remembered that there are often alternative ways enough to exclude it from the analysis. However, if the observation is ~
1.:,
lB' : 1'
j:,';"
! "l . ' . ¡;
of approaching the analysis of a particular set of data, none ofwhic~ is
a genuine value then this is not valido Appropriate action then
tCl :'il! necessarily the 'bese. Indeed, several types of analysis may weJl be
depends on the particular circumstances. Hawkins (1980) has consi-
¡Ii ¡I '1, ,~~ 11.'
:1
. .' : : carried out to investigate diJTerent aspects o( the same data. F or
dered the probIem of outliers at sorne length.
example, the body measur.ements of female sparrows given in
Missing values can cause more problems with multivariate data than
with univariate data. The trouble is that when there are many ;..:
variables being measured on each individual it is quite often the ease '"
that one or two of these variables have missing values. It may then
happen that ifindividuals with any missing values are excluded from
an analysis this means excluding quite a Iarge proportion of
individuals, which may be completely impractical. For example, in
:t-;r:'¡I'
. l'" ": studying aneient human populations skelctons are [requentl)' broken
r ~ and Í:1complete.
~: : ,
Texts 00 multivariate analysis are oflen remarkabJy siJent on the
question o[ missing values. To sorne extent this is bccause doiog
something about missing values i.s by no means a straightforward
matter. Seber (1984) gives a discussion of (he problem, with
referenees. In practice, computer packages sometimes incJude a
facility for estimating missing values. For· example, the BMDP
package (Dixon, 1983) allows missing values tp be estimated by
several different 'eommon sense' methods. One possible approach is
therefore to estima te missing values and then analyse the data,
·including these estimates, as if they were complete data in the first
place. Jt seems.reasonabh! to SUPP9sé that this procedure wi1l work
satisfactorily providing that only a smalI proportion of values are
missing.
. y--
,"'2,- .
i
:;1:
'~~
I '- :~::
1 il
,', ;' Prchistoric 90blets from Thailancl 147
' x.1 ..
Coblet Xl X1 x.J X~ X 5· .• X6 .f ·· , ~-
''', :
... ..
'- -'.
,1,":"
1
2
13
14
21
'14
23
24·
14 .
19
7
5
8
9 1 :;:
.
3 19 .,"
_.J 24 20 6 12
4 17 18 16 16 11 8 .,1 ':~:;; " ;"
5, 19 20 16 16 10 7
6 12 20 24 17 6 9 I
7 12 19 22 16 6 la
8 12 l' 25 15 7 7
9 11 15 17 11 6 5
.l ,.
I
I
;0
ji
13 14 11 7 4 I
I
11 ::0 25 18 5 12
·.. 2 X3
I 13
13
1:
21
15
23
19
15
12
9
5
8
6 ~X5~
14 13 22 26 17 7 10
15 .-"J 14 21 26 15 7 9
16 14 19 20 17 5 10
!
17
18
19
15
19
12
16
21
20
15
20
26
15
16
16
9
9
7
7
lO
10
t
X6
.1 Ji
"
I
20
21
22
23
17
13
9
8
20
20
9
8
27
27
10
7
18
17
7
5
6
6
4
2
14
9
3
2
¡
24 9 9 8 4 2 2 ,---
·1
I
25 12 19 27 18 5 12
.. X4 •
Data sourct.>: Professor C.F.W. Higham • Universüy of Olago.
1!2:,. Figure A.l Measurements made on pottery goblets from Thailand.
!~r
146
<~.
:#.~:
~;¡:.
I'¡ i '
;t
L:
_Id
1tjH Exomplr: set " of dn[¡;
~. ~,,~ .,
;:; d·;I; · Possible V;i!ys to approach lhese queslions are by cluster anaJysi s C)
,.-
-::;- 0
-- .-:....
en e
'-- ;> :=-
; (Chapter 8), by plotting the goblels against vaJucs for thcií first [wo o Ó u ;-
:¿;¿::¿:¿:¿:¿:¿:¿LJ..LJ..L.:..L.:..L.:..L.:..LJ..L.:..
principal components Chapter 5), or by carrying out a multidimen-
..c-~
~
eíJ~ t.2
sional scaling (Chapter 10). The distance matrix for a multidimen- "..,-0
.!!::ot:
sional scaling can simpJy be constructed of Euclidean distances 1I:o~
between the goblets, as defined in Section 4.2. . -"
~c-= ....
..,E~
- ~
c- \C)OCNO\\Ot"""".V)NMr"'--V"i_V":IOC'CO_
v)V)\ÓV)\Ó"¿V)\Civ)v)~v)..n..,.-)v),,¿
One point that needs consideration in this example is the extent to .......
ce O
~
..,
which difTerences between goblets are due to shape difTerences rathe~ e.., .Eeo.=!.¡¡;>
ti)
than size difTerences. It may well be considered that two goblets that E .., u .., O'<tonOOr-o-r-r--NOOonOo-O\
... .c.:: MÓr..:..oN"'¿~\oÓV)V)"¿'Mv-ioCc\~
are almost the same shape but have very dilferent sizes are 'similiar'. ~ M ~ MM -.::t t"""", MM C""') ("'f"') f""'", M r"'1Mr-"')("r)
I
to divide each o[ the measun:ments lor a goblet by one of the . :3 '-t; _
o te _ v
e.Il o E ' c
'--: t:
measurements,say the total height ofthe body, or by the sum ofall the oc OOOCO\O\'<t:>oonon\D\D:>o'Ci-on<")r-
u ..;:: o ~
measurements for that goblet. This standardization will ensure that -:¡:; ~ r.!: u ~ r..:r..:r..:r..:oOr..:r..:r..:r..:r..:r..:..or..:r..:r..:oO
·~i
_ ...
goblets with the same shape but difTerentsizes wiII have similar data ¡;'o~~
values. fJ.ct:o
~ .... u - ·
~~)...
't.Kt~
:: ] -ü 'o NN'Ci-OOvN-ono,O\r--OOll:>
o ~ ""..c:
."
~~ff "
Data Set 2: Canine groups from Asia
~.c~-o
o. 11 ..: <::
E ... te ~
~
--NNNN----------
~~OOOO~~oOoO~oOoO~~~
o:~'O..c
Example 1.4 of Chapter 1 con cerned . the comparison between
preh istoric dogs from Thailand and six other related animal groups in
~ ..: E 11
o ~ .... o-
~o~~ ~
. OOO\OVNc-o.,...N\Do\o\OOr-Or-on
N-=--:NV)C~~o.:c\oOc\O-:~N
NNNNNNN-----Nt'l-N
terms of mean mandible measurements. Table A.2 sho\\'s sorne more c:<J _E'O c.í
>
detailed data for the comparison offive ofthese groups. This is a part E tI).c
_ _ .-v.
~>-o.=!
of the more extensive data discussed in the paper by Higham el al. :J ;.> C'J U
..,
~ .2 ~ .:: -Ooo-O<")r-oooo<")c'<t<")Vo-O
(1980). (.) O.D v;; ~ M~r---:-.iv)~o'oO""':Oc\O:-=NNV)
~ ~
N--NNN--NN--NNNN
There are several questions that can be addressed using the data of E 11
~::o "''O
Table A.2. Do all the animal groups d isplay the same amount of P-:O>:E
-"Oc,-u
variation in mandibie measurements? Are there significant difTerences cror=c..
~I-~~~~~~~~~~~~~~~~
¡ in mean values between the groups and, i[ this is so, to what extent is it
possible to separate the individuals in the groups using the nine ~."" ,
teEE'O.c
.... E_
uO .... -.::t
. :: .¡:;~ o
~ ~OO\OONO\O\O\O\OOO\O\O\O\OOO
~
~
¡:;- - ~
r mandible measurements? Within ea eh o[ the first [our groups, what ~4t·· '-~o;; ~
,1 .... ~ .;' .E ... .c- E~~~~~~~~~~~~~~~~
¡ difTerences exist between males and [emales? Are there any outliers,
i.e., individuals with measurements that appear to be anomalous? ... ..-
~..5Cj¡E
.=!1I~2
:a N - '--
~IOMr--OOCon\DonON'<t-'<tOCOr-
~ ~~~~~~~~~~~==~~~~
~
XI X2 XJ X.I Xs X6 X, -
X9
-
--- ----- - - - X~ Se,"(
Go/den jackcl/s
I 120.0 8.2 IIU 17.3 IR.I 7.0 32.0
2 106.8 7.') ló.ó 1(,.5 34.7 S.2 1\1
3 19.7 7.0 32.1 33.6' •
110.0 S.I 17.6 16.0 : 5.3 M
4 19.0 7.1 30.7 32.5 4.7
115.6 8.5 20.0 UU M·
5 18.0 7.1 31.5 33.1 4.7
113.6 8.2 18.7 111.1 M.
6 18.8 7.9 31.5 33.0
111.0 8.5 19.0 1(,,4 5.1 M
7 18. I 7.1 30.0 32.7
112.9 8.5 - 17.3 17.6 5.0 t-.)
8 18.5 7.1 30.0 33.6 4.6
116.5 S.7 20.2 16.7 M
9 18.3 7.0 29.9 34.3
113.6 9.4 21.0 18.5 5.2 M
18.7 ' 7.5 31.1 34.8
10 111.9 8.2 19.0 5.3 M
16.8 18.5 6.8 29.7
11 109.5 8.5 34.0 5.1 M
18.3 16.9 19.2 7.0 31.1
12 1 I 1.3 7.7 19.9 111.2
33.2 4.9 ,. F
13 IRO 6.7 29.7 31.9 4.5
106.9 7.2 16.5 ló.O 17.5 F
14 6.0 28.0 35.0 4.7
108.0 8.2 18,4 IÓ.2 17.5 F
15 109.5 6.5 2H.7 32.5 4.R F
7.3 19.2 15.S 17.4 • 6.1
16 104.6 8.3 29.8 33.3 4.S F·
18.6 16.9 17.2 6.5
17 106.9 8.4 29.2 32.4 4.5 ' F
IR.O 16.9 17.7 6.2 2S.6
18 105.5 7.8 18.9 31.0 4.3 F
18,4 18.1 6.2 30.6
19 111 .2 8.4 31.6 4.4 F
16.6 15.9 18.2 7.0 30.3
20 1 I 1.0 7.6 34.0 4.7 F
IR.7 (6.5 17.H 6.5 29.9 34.S 4.6 Jo'
-------......-- _._ .. . _.
CI/Ol/S
1 123.0 9.7 21.8 :20.7 20.2 7.8 26.9 36.1 6.1 M
2 135.3 II.R 24.9 21.2 22.7 8.9 30.5 37.6 7.1 1\·1
3 1311.2 11.4 25.4 2S.0 22.4 9.0 29.S :17. S 7.3 I'vl
4 141.3 10.8 26.0 24.7 21.3 tU 28.6 39.1 6.6 1\1
5 134.7 11.2 25.0 24.5 21.2 8.5 28.6 39.2 6.7 M
6 135.8 11.0 22.1 24 .3 21.6 8.1 31.4 39.3 ó.8 M
7 .'1. f
1 J 1. 1 iO.4 22.9 nI 22.5 8.7 29.8 . 36.1 6.8 M
8 137.3 10.6 25.4 23.S 21.3 8.3 28.0 37.8 6-.5 M
9 135.0 10.5 25.0 24.5 21.0 H.4 2R.6 39.2 6.9 M
10 130.7 10.9 24.5 24.0 21.0 H.5 29.3 34.9 6.2 F
1I 129.7 11.3 22.3 23.1 21.1 8.7 29.2 36.5 7.0 F
12 144.0 10.8 24.2 25.9 22.2 8.9 29.6 42.0 7.1 F
13 138.5 10.9 25.6 23.2 21.7 R.7 29.5 39.2 6.9 F
14 123.0 9.8 23.2 22.2 19.7 8.1 26.1 34.0 5.6 F
15 137.1 11.3 26.7 25'(, 22.5 8.7 2?8 38.8 6.5 F
16 127.9 \0.0 21.6 22.7 21.8 8.7 28 .6 ' 37.0 6.6 F
17 121.8 9.9 22.1 21.7 20.0 R.2 26.4 36.0 5.7 F
¡mUall lIIo/ves
1 166.8 11.5 29.0 27.9 25.3 9.5 40.5 45.2 7.2 M
2 164.3 12.3 27.0 2ó.0 25.3 . 10.0 41.6 47.3 7.9 M
3 149.5 11.5 21A 23.5 24.6 9.3 41.3 45.5 8.5 M '
4 145.5 11.3 28.0 23.8 24.3 9.2 35.5 41.2 7.2 M
5 176.8 12.4 31.3 26.6 27.3 10.5 42.9 49.8 7.9 M
6 165.8 13.4 31.7 26.5 25.5 9.5 40.3 47.0 7.3 M '
7 163.6 12.1 27.1 24.3 25.0 9.9 42.1 44.5 8.3 M ':'
8 165.\ 12.6 29.5 25.5 24.7 7.7 39.9 43.4 7.9 M
v
.'\ 1 X2 v
AJ )(4 X5 X6 X7 X8 XQ Scx
Pulses,
Red W11ile Starchy nuts, Fruits,
meat meat Eggs Mi/k Fish Cereals foods oil-seeds vegetahles
Al¡;a~ia 10.1 1.4 0.5 8.9 0.2 42.3 0.6 5.5 1.7
Austria. 8.9 14.0 4.3 19.9 2.1 28.0 3.6 1.3 4.3
Belgium 13.5 9.3 4.1 17.5 4.5 26.6 5.7 2.1 4.0
Bulgaria 7.8 6.0 1.6 8.3 1.2 . 56.7 1.1 3.7 4.2
Czechoslovakia 9.7 11.4 2.8 12.5 2.0 34.3 5.0 1.1 4.0
Denmark 10.6 10.8 3.7 25.0 9.9 21.9 4.8 0.7 2.4
East Germany 8.4 11.6 3.7 11.1 5.4 24.6 6.5 . 0.8· 3.6
Finland 9.5 4.9 2.7 33.7 5.8 26.3 5.1 1.0 1.4
Fran'ce 18.0 9.9 3.3 19.5 5.7 2fU 4.8 2.4 6.5
Greece 10.2 3.0 2.R 17.6 5.9 41.7 2.2 7.8 6.5
Hungary 5.3 12.4 2.9 9.7 0.3 40.1 4.0 5.4 4.2
Ireland 13.9 10.0 4.7 25.8 2.2 24.0 6.2 1.6 2.9
Italy 9.0 5.1 2.9 13.7 3.4 36.8 2.1 4.3 6.7
NetherIands 9.5 13.6 3.6 23.4 2.5 22.4 4,2 1.8 3.7
Norway 9.4 4.7 2.7 23.3 9.7 23.0 4.6 1.6 2.7
Poland 6.9 10.2 2.7 19.3 '. 3.0 -36.1 5.9 2.0 6.6
Portugal 6.2 3.7 1.1 4.9 14.2 27.0 5.9 4.7 7.9
Romanía 6.2 6.3 1.5 11.1 1.0 49.6 3.1 5.3 2.8
Spain 7.1 3.4 3.1 8.6 7.0 29.2 5.7 5.9 7.2
Sweden 9.9 7.8 3.5 24.7 7.5 19.5 3.7 lA 2.0
Switzerland 13.1 10.1 3.1 23.8 . 2.3 25.6 2.8 2.4 4.9
UK 17.4 5.7 4.7 20.6 4.3 24.3 4.7 3.4 3.3
USSR 9.3 4.6 . 2.1 16.6 3.0 43.6 6.4 304 2.9
West Germany 11.4 12.5 4.1 18.8 3.4 18.6 5.2 1.5 IR
Yugoslavia 4.4 5.0 1.2 9.5 0.6 55.9 3.0 5.7 3.2
Table A.J.
References
I1:·':'
, '
36 -
Namboodiri, K. 25
-
Giffins, R. 125
Goldstein. M. 98, 99 'Nash, le. 25, 41, 57, 58, 71, 98, 99,
Gordon. A.D. 57, 113 125 ,--
..,
I
~;:': I
H'
,~" .
155
1-
L
1 :Jvc-'
~
J'.u :o r indcx
157
;, ';i ¡!
~ , :;1 ' 11
~ ;~t .. \i : 1:
',,', \\\ "
? ~ !;! ¡j¡ ~.
15~ Subiect index Glib\~~t \~UG~ le
, 11, 'j' ,. Graphical methods 143-4 Missing values 144 [-test 26-7, 2:l, 33, 66 Van-VaJen's test 33-4, 35, 40
~. 'liI :, i Group average linkage 101-4 Multidimensional scaling 14, 126- ~. y2_ tes t 28-31, 32, 33, 35, 41, 89 Varimax rotatio n 75, 80, 84
:!' ; Ii! "1:'
.\;¡:: !, 41, 148 Two factor theory of mental tests Vector of means 22-4
~ . liI; "
~ ti Hierarchic c1ustering 101-4, 108, Multiple regression 1, 114-16,120 73
11 1 Multivariate di :o:ances 42-53 T ype one error 32 ,......
Hotclling's y2 test see y2_ test between individuals 42-7
Euclidean 43-5, 105;.6, 108, 111,
Jackknife c1assification 97 128, 148
from proportions 52-3
Kai~er normalization 75-6, 80 Mahalanobis 47-52, 57, 87, 90, 94,
96,97, 98, 143
Levene's test 33, 34-5, 66-7 Penrose 47-52, 56
Like\ihood ratio test on sample with cluster ana~ysis 101, 105-6
mean vectors 37-9, 4i, with multidimensional scaling
89-:90 126-7, 128, 133
Multivariate normal distribution
Mahalanobis distance see . see Normal distribution
::."
Multivariate distances Multívariate residuals 48, 143, 14&
Ma!1tel's test' 42, 53-7 -:
;i[~ .-
"J:'.' "
~:~