Generalized Linear Models
Generalized Linear Models
Copying prohibited
All rights reserved. No part of this publication may be reproduced or
transmitted in any form or by any means, electronic or mechanical,
including photocopying, recording, or any information storage and
retrieval system, without permission in writing from the publisher.
The papers and inks used in this product are environment-friendly.
Art. No 31023
eISBN10 91-44-03141-6
eISBN13 978-91-44-03141-5
Ulf Olsson and Studentlitteratur 2002
Cover design: Henrik Hast
Printed in Sweden
Studentlitteratur, Lund
Web-address: www.studentlitteratur.se
Printing/year 1
9 10
2006 05 04 03 02
Contents
Preface
ix
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
3
4
4
4
6
7
7
8
8
10
12
13
18
21
23
23
24
24
24
25
26
27
iv
2 Generalized Linear Models
2.1 Introduction . . . . . . . . . . . . . . . . . .
2.1.1 Types of response variables . . . . .
2.1.2 Continuous response . . . . . . . . .
2.1.3 Response as a binary variable . . . .
2.1.4 Response as a proportion . . . . . .
2.1.5 Response as a count . . . . . . . . .
2.1.6 Response as a rate . . . . . . . . . .
2.1.7 Ordinal response . . . . . . . . . . .
2.2 Generalized linear models . . . . . . . . . .
2.3 The exponential family of distributions . . .
2.3.1 The Poisson distribution . . . . . . .
2.3.2 The binomial distribution . . . . . .
2.3.3 The Normal distribution . . . . . . .
2.3.4 The function b () . . . . . . . . . . .
2.4 The link function . . . . . . . . . . . . . . .
2.4.1 Canonical links . . . . . . . . . . . .
2.5 The linear predictor . . . . . . . . . . . . .
2.6 Maximum likelihood estimation . . . . . . .
2.7 Numerical procedures . . . . . . . . . . . .
2.8 Assessing the fit of the model . . . . . . . .
2.8.1 The deviance . . . . . . . . . . . . .
2.8.2 The generalized Pearson 2 statistic
2.8.3 Akaikes information criterion . . . .
2.8.4 The choice of measure of fit . . . . .
2.9 Dierent types of tests . . . . . . . . . . . .
2.9.1 Wald tests . . . . . . . . . . . . . . .
2.9.2 Likelihood ratio tests . . . . . . . . .
2.9.3 Score tests . . . . . . . . . . . . . .
2.9.4 Tests of Type 1 or 3 . . . . . . . . .
2.10 Descriptive measures of fit . . . . . . . . . .
2.11 An application . . . . . . . . . . . . . . . .
2.12 Exercises . . . . . . . . . . . . . . . . . . .
c Studentlitteratur
Contents
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
31
31
32
32
33
34
35
35
36
37
37
37
38
38
40
42
42
42
44
45
45
46
46
47
47
47
48
48
49
49
50
53
Contents
3 Model diagnostics
3.1 Introduction . . . . . . . . . . . . . . .
3.2 The Hat matrix . . . . . . . . . . . . .
3.3 Residuals in generalized linear models
3.3.1 Pearson residuals . . . . . . . .
3.3.2 Deviance residuals . . . . . . .
3.3.3 Score residuals . . . . . . . . .
3.3.4 Likelihood residuals . . . . . .
3.3.5 Anscombe residuals . . . . . .
3.3.6 The choice of residuals . . . . .
3.4 Influential observations and outliers .
3.4.1 Leverage . . . . . . . . . . . . .
3.4.2 Cooks distance and Dfbeta . .
3.4.3 Goodness of fit measures . . .
3.4.4 Eect on data analysis . . . . .
3.5 Partial leverage . . . . . . . . . . . . .
3.6 Overdispersion . . . . . . . . . . . . .
3.6.1 Models for overdispersion . . .
3.7 Non-convergence . . . . . . . . . . . .
3.8 Applications . . . . . . . . . . . . . . .
3.8.1 Residual plots . . . . . . . . . .
3.8.2 Variance function diagnostics .
3.8.3 Link function diagnostics . . .
3.8.4 Transformation of covariates .
3.9 Exercises . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
55
55
55
56
56
57
57
58
58
58
59
59
60
60
60
60
61
62
63
64
64
66
67
67
68
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
69
69
69
71
72
73
73
75
75
77
78
78
79
79
81
83
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
c Studentlitteratur
vi
5 Binary and binomial response variables
5.1 Link functions . . . . . . . . . . . . . . . . . . . .
5.1.1 The probit link . . . . . . . . . . . . . . .
5.1.2 The logit link . . . . . . . . . . . . . . . .
5.1.3 The complementary log-log link . . . . . .
5.2 Distributions for binary and binomial data . . . .
5.2.1 The Bernoulli distribution . . . . . . . . .
5.2.2 The Binomial distribution . . . . . . . . .
5.3 Probit analysis . . . . . . . . . . . . . . . . . . .
5.4 Logit (logistic) regression . . . . . . . . . . . . .
5.5 Multiple logistic regression . . . . . . . . . . . . .
5.5.1 Model building . . . . . . . . . . . . . . .
5.5.2 Model building tools . . . . . . . . . . . .
5.5.3 Model diagnostics . . . . . . . . . . . . .
5.6 Odds ratios . . . . . . . . . . . . . . . . . . . . .
5.7 Overdispersion in binary/binomial models . . . .
5.7.1 Estimation of the dispersion parameter .
5.7.2 Modeling as a beta-binomial distribution
5.7.3 An example of over-dispersed data . . . .
5.8 Exercises . . . . . . . . . . . . . . . . . . . . . .
c Studentlitteratur
Contents
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
85
85
85
86
86
87
87
88
89
91
92
92
96
97
98
100
101
101
102
104
vii
Contents
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
111
111
112
112
113
113
114
114
114
115
117
118
118
119
119
120
121
121
122
122
126
129
131
133
133
134
135
137
7 Ordinal response
7.1 Arbitrary scoring . .
7.2 RC models . . . . .
7.3 Proportional odds .
7.4 Latent variables . . .
7.5 A Genmod example
7.6 Exercises . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
145
145
148
148
150
153
155
8 Additional topics
8.1 Variance heterogeneity . . . . . . . . . . . .
8.2 Survival models . . . . . . . . . . . . . . . .
8.2.1 An example . . . . . . . . . . . . . .
8.3 Quasi-likelihood . . . . . . . . . . . . . . . .
8.4 Quasi-likelihood for modeling overdispersion
8.5 Repeated measures: the GEE approach . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
157
157
158
159
162
163
165
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
c Studentlitteratur
viii
Contents
8.6
8.7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
179
179
180
180
180
181
182
182
182
183
183
183
184
184
185
185
186
186
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
187
187
188
188
189
189
189
190
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Bibliography
191
197
c Studentlitteratur
Preface
* Poisson regression
* Log-linear models
* Generalized estimating
equations
...
In this book we will make an overview of generalized linear models and present
examples of their use. We assume that the reader has a basic understanding
of statistical principles. Particularly important is a knowledge of statistical
model building, regression analysis and analysis of variance. Some knowledge
of matrix algebra (which is summarized in Appendix A), and knowledge of
basic calculus, are mathematical prerequisites. Since many of the examples
are based on analyses using SAS, some knowledge of the SAS system is recommended.
In Chapter 1 we summarize some results on general linear models, assuming
equal variances and normal distributions. The models are formulated in
ix
x
matrix terms. Generalized linear models are introduced in Chapter 2. The
exponential family of distributions is discussed, and we discuss Maximum
Likelihood estimation and ways of assessing the fit of the model. This chapter
provides the basic theory of generalized linear models. Chapter 3 covers
model checking, which includes systematic ways of assessing whether the data
deviates from the model in some systematic way. In chapters 47 we consider
applications for dierent types of response variables. Response variables as
continuous variables, as binary/binomial variables, as counts and as ordinal
response variables are discussed, and practical examples using the Genmod
software of the SAS package are given. Finally, in Chapter 8 we discuss theory
and applications of a more complex nature, like quasi-likelihood procedures,
repeated measures models, mixed models and analysis of survival data.
Terminology in this area of statistics is a bit confused. In this book we will let
the acronym GLM denote General Linear Models, while we will let GLIM
denote Generalized Linear Models. This is also a way of paying homage
to two useful computer procedures, the GLM procedure of the SAS package,
and the pioneering GLIM software.
Several students and colleagues have read and commented on earlier versions
of the book. In particular, I would like to thank Gunnar Ekbohm, Jan-Eric
Englund, Carolyn Glynn, Anna Gunsj, Esbjrn Ohlsson, Tomas Pettersson
and Birgitta Vegerfors for giving many useful comments.
Most of the data sets for the examples and exercises are available on the
Internet. They can be downloaded from the publishers home page which has
address http://www.studentlitteratur.se.
c Studentlitteratur
1.1
Models play an important role in statistical inference. A model is a mathematical way of describing the relationships between a response variable and
a set of independent variables. Some models can be seen as a theory about
how the data were generated. Other models are only intended to provide a
convenient summary of the data. Statistical models, as opposed to deterministic models, account for the possibility that the relationship is not perfect.
This is done by allowing for unexplained variation, in the form of residuals.
1
(1.1)
1.2
In a general linear model (GLM), the observed value of the dependent variable
y for observation number i (i = 1, 2, ..., n) is modeled as a linear function of
(p 1) so called independent variables x1 , x2 , . . . , xp1 as
yi = 0 + 1 xi1 + . . . + p1 xi(p1) + ei
(1.2)
or in matrix terms
y = X + e.
(1.3)
In (1.3),
y =
y1
y2
..
.
yn
1 x11 x1(p1)
1 x21
X = .
.
..
..
1 xn1
xn(p1)
0
1
= .
..
p1
c Studentlitteratur
e1
e2
e = .
..
en
1.3
Estimation
0
e0 e = (y X) (y X) .
(1.4)
Minimizing (1.4) with respect to the parameters in gives the normal equations
X0 X = X0 y.
(1.5)
(1.6)
solution may not be unique. We can use generalized inverses (see Appendix
A) and find a solution as
b = (X0 X) X0 y.
(1.7)
Alternatively we can restrict the number of parameters in the model by introducing constraints that lead to a nonsingular X0 X.
1.4
1.4.1
When the parameters of a general linear model have been estimated you may
want to assess how well the model fits the data. This is done by subdividing
the variation in the data into two parts: systematic variation and unexplained
variation. Formally, this is done as follows.
We define the predicted value (or fitted value) of the response variable as
ybi =
or in matrix terms
p1
X
j=0
bj xij
(1.8)
b
b = X.
y
(1.9)
ebi = yi ybi .
(1.10)
The predicted values are the values that we would get on the dependent
variable if the model had been perfect, i.e. if all residuals had been zero. The
dierence between the observed value and the predicted value is the observed
residual:
1.4.2
The total variation in the data can be measured as the total sum of squares,
X
SST =
(yi y)2 .
i
X
i
c Studentlitteratur
(yi ybi )2 +
X
i
(b
yi y)2 + 2
(1.11)
X
i
(yi ybi ) (b
yi y) .
The last term can be shown to be zero. Thus, the total sum of squares SST
can be subdivided into two parts:
X
(b
yi y)2
SSModel =
i
and
SSe =
X
i
(yi ybi )2 .
SSe , called the residual (or error) sum of squares, will be small if the model
fits the data well.
The sum of squares can also be written in matrix terms. It holds that
X
(yi y)2 = y0 yny 2 with n 1 degrees of freedom (df).
SST =
i
SSModel
X
i
SSe
X
i
2
b 0 X0 yny2 with p 1 df .
(b
yi y) =
b 0 X0 y with n p df .
(yi ybi )2 = y0 y
The subdivision of the total variation (the total sum of squares) into parts is
often summarized as an analysis of variance table:
Source
Model
Residual
Total
df
M S = SS/df
p1
np
n1
M SModel
M Se =
b2
SSModel
SSe
=1
.
SST
SST
(1.12)
added to the model, model comparisons are often based on the adjusted R2 .
The adjusted R2 decreases when irrelevant terms are added to the model. It
is defined as
R2adj = 1
n1
M Se
.
1 R2 = 1
np
SST / (n 1)
(1.13)
A formal test of the full model (i.e. a test of the hypothesis that 1 , ..., p1
are all zero) can be obtained as
F =
MSModel
.
M Se
(1.14)
1.5
j
i
where wij are known weights. If we assume that all yi :s have the same
variance 2 , this makes it possible to obtain the variance of any parameter
estimator as
X
b =
w2 2 .
(1.16)
V ar
ij
b2 = i
np
b .
Vd
ar
i
c Studentlitteratur
(1.17)
(1.18)
j
t= r
b
Vd
ar
j
(1.19)
(1.20)
ar
j
j
would provide a (1 ) 100% confidence interval for the parameter j .
1.6
(SSe2 SSe1 ) /q
SSe1 / (n p)
(1.21)
1.7
independent. In other cases there are several ways to test hypotheses. SAS
handles this problem by allowing the user to select among four dierent types
of tests.
Type 1 means that the test for each parameter is calculated as the change
in SSe when the parameter is added to the model, in the order given in the
MODEL statement. If we have the model Y = A B A*B, SSA is calculated
first as if the experiment had been a one-factor experiment. (model: Y=A).
Then SSB|A is calculated as the reduction in SSe when we run the model
Y=A B, and finally the interaction SSAB|A,B is obtained as the reduction
in SSe when we also add the interaction to the model. This can be written
as SS(A), SS(B|A) and SS(AB|A, B). Type I SS are sometimes called
sequential sums of squares.
Type 2 means that the SS for each parameter is calculated as if the factor had been added last to the model except that, for interactions, all
main eects that are part of the interaction should also be included. For
the model Y = A B A*B this gives the SS as SS(A|B); SS(B|A) and
SS(AB|A, B).
Type 3 is, loosely speaking, an attempt to calculate what the SS would
have been if the experiment had been balanced. These are often called
partial sums of squares. These SS cannot in general be computed by
comparing model SS from several models. The Type 3 SS are generally
preferred when experiments are unbalanced. One problem with them is
that the sum of the SS for all factors and interactions is generally not the
same as the Total SS. Minitab gives the Type 3 SS as Adjusted Sum
of Squares.
Type 4 diers from Type 3 in the method of handling empty cells, i.e.
incomplete experiments.
If the experiment is balanced, all these SS will be equal. In practice, tests in
unbalanced situations are often done using Type 3 SS (or Adjusted Sum
of Squares in Minitab). Unfortunately, this is not an infallible method.
1.8
1.8.1
Some applications
Simple linear regression
In regression analysis, the design matrix X often contains one column that
only contains 1:s (corresponding to the intercept), while the remaining coc Studentlitteratur
lumns contain the values of the independent variables. Thus, the small regression model yi = 0 + 1 xi + ei with n = 4 observations can be written
in matrix form as
y1
1 x1
e1
y2 1 x2 0
e2
(1.22)
y3 = 1 x3 1 + e3 .
y4
1 x4
e4
Example 1.1 An experiment has been made to study the emission of CO2
from the root zone of Barley (Zagal et al, 1993). The emission of CO2 was
measured on a number of plants at dierent times after planting. A small
part of the data is given in the following table and graph:
Emission of CO2 as a function of time
Time
24
24
30
30
35
35
38
38
Y = -36.7443 + 2.09776X
R-Sq = 97.5 %
45
40
35
Emission
Emission
11.069
15.255
26.765
28.200
34.730
35.830
41.677
45.351
30
25
20
15
10
24
29
34
39
Time
One purpose of the experiment was to describe how y=CO2 -emission develops
over time. The graph suggests that a linear trend may provide a reasonable
approximation to the data, over the time span covered by the experiment.
The linear function fitted to these data is yb = 36.7+2.1x. A SAS regression
output, including ANOVA table, is given below. It can be concluded that the
emission of CO2 increases significantly with time, the rate of increase being
about 2.1 units per time unit.
c Studentlitteratur
10
DF
1
6
7
Sum of
Squares
992.3361798
25.3765201
1017.7126999
Mean
Square
992.3361798
4.2294200
R-Square
C.V.
Root MSE
EMISSION Mean
0.975065
6.887412
2.056555
29.85963
F Value
234.63
Pr > F
0.0001
Estimate
T for H0:
Parameter=0
Pr > |T|
Parameter
Std Error of
Estimate
INTERCEPT
TIME
-36.74430710
2.09776164
-8.33
15.32
0.0002
0.0001
4.40858691
0.13695161
1.8.2
Multiple regression
1 x11 x12
y1
e1
y2 1 x21 x22
e2
0
y3 1 x31 x32
=
1 + e3 .
(1.23)
y4 1 x41 x42
e4
2
y5 1 x51 x52
e5
y6
1 x61 x62
e6
Example 1.2 Professor Orley Ashenfelter issues a wine magazine, Liquid
assets, giving advice about good years. He bases his advice on multiple
regression of
y = Price of the wine at wine auctions
with meteorological data as predictors. The New York Times used the headline Wine Equation Puts Some Noses Out of Joint on an article about
Prof. Ashenberger. Base material was taken from Departures magazine,
September/October 1990, but the data are invented. The variables in the
data set below are:
Rain_W=Amount of rain during the winter.
Av_temp=Average temperature.
c Studentlitteratur
11
Rain_W
123
66
58
109
46
40
42
167
99
48
85
177
80
64
75
Av_temp
23
21
20
26
22
19
18
25
28
24
24
27
22
25
25
Rain_H
23
100
27
33
102
77
85
14
17
47
28
11
45
40
16
Quality
89
70
77
87
73
70
60
92
87
79
84
93
75
82
88
Coef
48.91
0.05937
1.3603
-0.11773
S = 3.092
StDev
10.41
0.02767
0.4187
0.04010
R-Sq = 91.6%
T
4.70
2.15
3.25
-2.94
P
0.001
0.055
0.008
0.014
R-Sq(adj) = 89.4%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
3
11
14
SS
1152.43
105.17
1257.60
MS
384.14
9.56
F
40.18
P
0.000
c Studentlitteratur
12
The output indicates that the three predictor variables do indeed have a
relationship to the wine quality, as measured by the price. The variable
Rain_W is not quite significant but would be included in a predictive model.
The size and direction of this relationship is given by the estimated coecients
of the regression equation. It appears that years with much winter rain, a
high average temperature, and only a small amount of rain at harvest time,
would produce good wine.
1.8.3
e11
1 1
y11
y12 1 1
e12
e13
y13 1 1
(1.24)
y21 = 1 0 + e21 .
e22
y22 1 0
1 0
y23
e23
Example 1.3 In a pharmacological study (Rea et al, 1984), researchers measured the concentration of Dopamine in the brains of six control rats and of
six rats that had been exposed to toluene. The concentrations in the striatum
region of the brain are given in Table 1.2.
The interest lies in comparing the two groups with respect to average Dopamine level. This is often done as a two sample t test. To illustrate that
the t test is actually a special case of a general linear model, we analyzed
these data with Minitab using regression analysis with Group as a dummy
variable. Rats in the toluene group were given the value 1 on the dummy
variable, while rats in the control group were coded as 0. The Minitab output
of the regression analysis is:
c Studentlitteratur
13
Table 1.2: Dopamine levels in the brains of rats under two treatments.
Dopamine, ng/kg
Toluene group Control
3.420
2.314
1.911
2.464
2.781
2.803
group
1.820
1.843
1.397
1.803
2.539
1.990
Regression Analysis
The regression equation is
Dopamine level = 1.90 + 0.717 Group
Predictor
Constant
Group
S = 0.4482
Coef
1.8987
0.7168
StDev
0.1830
0.2587
R-Sq = 43.4%
T
10.38
2.77
P
0.000
0.020
R-Sq(adj) = 37.8%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
10
11
SS
1.5416
2.0084
3.5500
MS
1.5416
0.2008
F
7.68
P
0.020
1.8.4
One-way ANOVA
The generalization of models of type (1.24) to more than two groups is rather
straightforward; we would need one more column in X (one new dummy variable) for each new group. This leads to a simple oneway analysis of variance
(ANOVA) model. Thus, a one-way ANOVA model with three treatments,
c Studentlitteratur
14
(1.25)
1 for group i
. The model can now be written as
0 otherwise
= + 1 d1 + 2 d2 + 3 d3 + eij
= + i di + eij ,
i = 1, 2, 3, j = 1, 2
yij
(1.26)
Note that the third dummy variable d3 is not needed. If we know the values
of d1 and d2 the group membership is known so d3 is redundant and can
be removed from the model. In fact, any combination of two of the dummy
variables is sucient for identifying group membership so the choice to delete
one of them is to some extent arbitrary. After removing d3 , the model can
be written in matrix terms as
1 1 0
y11
e11
y12 1 1 0
e12
y21 1 0 1
1 + e21
=
(1.27)
y22 1 0 1
e22
2
y31 1 0 0
e31
1 0 0
y32
e32
Although there are three treatments we have only included two dummy variables for the treatments, i.e. we have chosen the restriction 3 = 0.
Follow-up analyses
One of the results from a one-way ANOVA is an over-all F test of the hypothesis that all group (treatment) means are equal. If this test is significant,
it can be followed up by various types of comparisons between the groups.
Since the ANOVA provides an estimator
b2e = MSe of the residual variance
2
e , this estimator should be used in such group comparisons if the assumption
of equal variance seems tenable.
A pairwise comparison between two group means, i.e. a test of the hypothesis
that two groups have equal mean values, can be obtained as
y i y i0
t= r
MSe n1i +
c Studentlitteratur
1
ni0
15
with degrees of freedom taken from M Se . A confidence interval for the difference between the mean values can be obtained analogously.
In some cases it may be of interest to do comparisons which are not simple
pairwise comparisons. For example, we may want to compare treatment 1
with the average of treatements 2, 3 and 4. We can then define a contrast
in the treatment means as L = 1 2 +33 +4 . A general way to write a
contrast is
X
hi i ,
(1.28)
L=
i
estimated as
b=
L
hi y i ,
(1.29)
b is
and the estimated variance of L
X h2
i
b = M Se
.
Vd
ar L
n
i
i
(1.30)
16
Table 1.3: Change in urine production following treatment with dierent contrast
media (n = 57).
Medium
Diatrizoate
Diatrizoate
Diatrizoate
Diatrizoate
Diatrizoate
Hexabrix
Hexabrix
Hexabrix
Hexabrix
Hexabrix
Hexabrix
Hexabrix
Hexabrix
Hexabrix
Isovist
Isovist
Isovist
Isovist
Isovist
Di
32.92
25.85
20.75
20.38
7.06
6.47
5.63
3.08
0.96
2.37
7.00
4.88
1.11
4.14
2.10
0.77
0.04
4.80
2.74
Medium
Isovist
Isovist
Isovist
Isovist
Omnipaque
Omnipaque
Omnipaque
Omnipaque
Omnipaque
Omnipaque
Omnipaque
Omnipaque
Omnipaque
Omnipaque
Ringer
Ringer
Ringer
Ringer
Ringer
Di
2.44
0.87
0.22
1.52
8.51
16.11
7.22
9.03
10.11
6.77
1.16
16.11
3.99
4.90
0.07
0.03
0.34
0.08
0.51
Medium
Ringer
Ringer
Mannitol
Mannitol
Mannitol
Mannitol
Mannitol
Mannitol
Mannitol
Mannitol
Mannitol
Ultravist
Ultravist
Ultravist
Ultravist
Ultravist
Ultravist
Ultravist
Ultravist
Di
0.10
0.40
9.19
0.79
10.22
4.78
14.64
6.98
7.51
9.55
5.53
12.94
7.30
15.35
6.58
15.68
3.48
5.75
12.18
methods for deciding which limit to use. A simple but reasonably powerful
method is to use Bonferroni adjustment. This means that each individual
test is made at the significance level /c, where is the desired over-all level
and c is the number of comparisons you want to make.
Example 1.4 Liss et al (1996) studied the eects of seven contrast media
(used in X-ray investigations) on dierent physiological functions of 57 rats.
One variable that was studied was the urine production. Table 1.3 shows the
change in urine production of each rat before and after treatment with each
medium. It is of interest to compare the contrast media with respect to the
change in urine production.
This analysis is a oneway ANOVA situation. The procedure GLM in SAS
produced the following result:
c Studentlitteratur
17
Source
MEDIUM
DF
6
50
56
DIFF
Sum of
Mean
Squares
Square
F Value
1787.9722541
297.9953757
16.46
905.1155428
18.1023109
2693.0877969
R-Square
0.663912
C.V.
61.95963
Root MSE
4.2546811
DF
6
Type III SS
1787.9722541
Mean Square
297.9953757
Pr > F
0.0001
DIFF Mean
6.8668596
F Value
16.46
Pr > F
0.0001
There are clearly significant dierences between the media (p < 0.0001). To
find out more about the nature of these dierences we requested Proc GLM
to print estimates of the parameters, i.e. estimates of the coecients i for
each of the dummy variables. The following results were obtained:
Parameter
INTERCEPT
MEDIUM
Diatrizoate
Hexabrix
Isovist
Mannitol
Omnipaque
Ringer
Ultravist
Estimate
9.90787500
11.48412500
-5.94731944
-8.24365278
-2.21920833
-1.51817500
-9.69787500
0.00000000
B
B
B
B
B
B
B
B
T for H0:
Parameter=0
Pr > |T|
Std Error of
Estimate
6.59
4.73
-2.88
-3.99
-1.07
-0.75
-4.40
.
0.0001
0.0001
0.0059
0.0002
0.2882
0.4554
0.0001
.
1.50425691
2.42554139
2.06740338
2.06740338
2.06740338
2.01817243
2.20200665
.
NOTE: The X'X matrix has been found to be singular and a generalized inverse
was used to solve the normal equations.
Estimates followed by the
letter 'B' are biased, and are not unique estimators of the parameters.
18
Table 1.4: Least squares means, and pairwise comparisons between treatments, for
the contrast media experiment.
Mean
Diatrizoate
Ultravist
Omnipaque
Mannitol
Hexabrix
Isovist
Ringer
Diatrizoate
21.39
*
*
*
*
*
*
Ultravist
9.91
n.s.
n.s.
n.s.
*
*
Omnipaque
8.39
n.s.
n.s.
*
*
Mannitol
7.69
n.s.
n.s.
*
Hexabrix
3.96
n.s.
n.s.
Isovist
Ringer
1.66
0.21
n.s.
media. This boxplot is given in Figure 1.1. The plot indicates that the
variation is quite dierent for the dierent media, with a large variation for
Diatrizoate and a small variation for Ringer (which is actually a placebo).
This suggests that one assumption underlying the analysis, the assumption
of equal variance, may be violated. We will return to these data later to see
if we can make a better analysis.
1.8.5
The ideas used above can be extended to factorial experiments that include
more than one factor and possible interactions. The dummy variables that
correspond to the interaction terms would then be constructed by multiplying
the corresponding main eect dummy variables with each other.
This feature can be illustrated by considering a factorial experiment with
factor A (two levels) and factor B (three levels), and where we have two
observations for each factor combination. The model is
yijk = + i + j + ()ij + eijk ,
i = 1, 2, j = 1, 2, 3, k = 1, 2
(1.31)
The number of dummy variables that we have included for each factor is
equal to the number of factor levels minus one, i.e. the last dummy variable
for each factor has been excluded. The number of non-redundant dummy
variables equals the number of degrees of freedom for the eect. In matrix
terms,
c Studentlitteratur
19
30
Diff
20
10
0
Diatrizoate Hexabrix
Isovist
MannitolOmnipaque Ringer
Ultravist
Medium
Figure 1.1: Boxplot of change in urine production for dierent contrast media.
y111
y112
y121
y122
y131
y132
y211
y212
y221
y222
y231
y232
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
1
1
0
0
0
0
1
1
0
0
0
0
0
0
1
1
0
0
0
0
1
1
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
()
11
()
12
e111
e112
e121
e122
e131
e132
e211
e212
e221
e222
e231
e232
(1.32)
Example 1.5 Lindahl et al (1999) studied certain reactions of fungus myceliae on pieces of wood by using radioactively labeled 32 P . In one of the
experiments, two species of fungus (Paxillus involutus and Suillus variegatus)
were used, along with two sizes of wood pieces (Large and Small); the response
was a certain chemical measurement denoted by C. The data are reproduced
in Table 1.5.
These data were analyzed as a factorial experiment with two factors. Part of
the Minitab output was:
c Studentlitteratur
20
Size
Large
Large
Large
Large
Large
Large
Large
Large
Small
Small
Small
Small
Small
Small
Small
Small
C
0.0010
0.0011
0.0017
0.0008
0.0010
0.0028
0.0003
0.0013
0.0061
0.0010
0.0020
0.0018
0.0033
0.0015
0.0040
0.0041
Species
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
Size
Large
Large
Large
Large
Large
Large
Large
Large
Small
Small
Small
Small
Small
Small
Small
Small
C
0.0021
0.0001
0.0016
0.0046
0.0035
0.0065
0.0073
0.0039
0.0007
0.0011
0.0019
0.0022
0.0011
0.0012
0.0009
0.0040
DF
1
1
1
28
31
Seq SS
0.0000025
0.0000002
0.0000287
0.0000742
0.0001056
Adj SS
0.0000025
0.0000002
0.0000287
0.0000742
Adj MS
0.0000025
0.0000002
0.0000287
0.0000027
F
0.93
0.09
10.82
P
0.342
0.772
0.003
The main conclusion from this analysis is that the interaction Species
Size is highly significant. This means that the eect of Size is dierent for
dierent species. In such cases, interpretation of the main eects is not
very meaningful. As a tool for interpreting the interaction eect, a so called
interaction plot can be prepared. Such a plot for these data is as given in
Figure 1.2. The mean value of the response for species S is higher for large
wood pieces than for small wood pieces. For species H the opposite is true:
the mean value is larger for small wood pieces. This is an example of an
interaction.
c Studentlitteratur
21
H
S
Mean
0.0035
0.0025
0.0015
Large
Small
Size
1.8.6
Analysis of covariance
In regression analysis models the design matrix X contains quantitative variables. In ANOVA models, the design matrix only contains dummy variables
corresponding to treatments, design structure and possible interactions. It
is quite possible to include a mixture of quantitative variables and dummy
variables in the design matrix. Such models are called covariance analysis,
or ANCOVA, models.
Let us look at a simple case where there are two groups and one covariate.
Several dierent models can be considered for the analysis of such data even
in the simple case where we assume that all relationships are linear:
1. There is no relationship between x and y in any of the groups and the
groups have the same mean value.
2. There is a relationship between x and y; the relationship is the same in
the groups.
3. There is no relationship between x and y but the groups have dierent
levels.
4. There is a relationship between x and y; the lines are parallel but at
dierent levels.
c Studentlitteratur
22
14
12
12
10
10
14
14
12
12
10
10
14
12
10
Model 5 is the most general of the models, allowing for dierent intercepts
( + i ) and dierent slopes + di , where d is a dummy variable indicating
group membership. If it can be assumed that the term di is zero for all i,
then we are back at model 4. If, in addition, all i are zero, then model 2 is
correct. If, on the other hand. is zero, we would use model 3. If finally
is zero in model 2, then model 1 describes the situation. This is an example
of a set of models where some of the models are nested within other models.
The model choice can be made by comparing any model to a simpler model
which only diers in terms of one factor.
c Studentlitteratur
1.8.7
23
Non-linear models
(1.33)
Such models are simple to analyze using general linear models. Formally,
each transformation of x is treated as a new variable. Thus, if we denote
ui = x2i and vi = exi then the model (1.33) can be written as
yi = 0 + 1 xi + 2 ui + 3 vi + ei
(1.34)
1.9
Estimability
(1.35)
(1.36)
24
Let us denote with ij the mean value for the treatment combination that
has factor A at level i and factor B at level j. It holds that
ij = E (yijk ) = + i + j + ()ij
(1.37)
11 + 12 + 13
.
3
This is a linear function of cell means. Since the cell means are estimable,
i is also estimable.
1.10
The classical application of general linear models rests on the following set
of assumptions:
The model used for the analysis is assumed to be correct.
The residuals are assumed to be independent.
The residuals are assumed to follow a Normal distribution.
The residuals are assumed to have the same variance 2e , independent of X,
i.e. the residuals are homoscedastic.
Dierent diagnostic tools have been developed to detect departures from
these assumptions. Since similar tools are used for generalized linear models,
reference is made to Chapter 3 for details.
1.11
Model building
1.11.1
There are many options for fitting general linear models to data. One option
is to use a regression package and leave it to the user to construct appropriate
dummy variables for class variables. However, most statistical packages have
routines for general linear models that automatically construct the appropriate set of dummy variables.
c Studentlitteratur
25
Let us use letters at the end of the alphabet (X, Y , Z) to denote numeric
variables. Y will be used for the dependent variable. Letters in the beginning
of the alphabet (A, B) will symbolize class variables (groups, treatments,
blocks, etc.)
Computer software requires the user to state the model in symbolic terms.
The model statement contains operators that specify dierent aspects of the
model. In the following table we list the operators used by SAS. Examples
of the use of the operators are given below.
Operator
*
(none)
|
()
@
The kinds of models that we have discussed in this chapter can symbolically
be written in SAS language as indicated in the following table.
Model
Simple linear regression
Multiple regression
t tests, oneway ANOVA
Two-way ANOVA with interaction
Covariance analysis model 1
Covariance analysis model 2
Covariance analysis model 3
Covariance analysis model 4
Covariance analysis model 5
1.11.2
26
1.11.3
The dierence between the two programs is that in the t test, the independent
variable ( group) is given as a CLASS variable. This asks SAS to build
appropriate dummy variables.
c Studentlitteratur
27
1.12
Exercises
Leucine (ng)
0.02
0.25
0.54
0.69
1.07
1.50
1.74
b
8.3
3.8
3.9
7.8
9.1
15.4
7.7
6.5
5.7
13.6
c
10.2
9.2
9.6
53.8
15.8
A. Make an analysis of these data that can answer the question whether there
are any dierences in cortisol level between the groups. A complete solution
should contain hypotheses, calculations, test statistic, and a conclusion. A
c Studentlitteratur
28
1.12. Exercises
Low
A. Analyze the data in a way that treats Days from germination as a quantitative factor. Treat level of nitrogen as a dummy variable, and assume that
all regressions are linear.
i) Fit a model that assumes that the two regression lines are parallel.
ii) Fit a model that does not assume that the regression lines are parallel.
iii) Test the hypothesis that the regressions are parallel.
B. What is the expected rate of CO2 emission for a plant with a high level
of nitrogen, 35 days after germination? The same question for a plant with
a low level of nitrogen? Use the model you consider the best of the models
you have fitted under A. and B. above. Make the calculation by hand, using
the computer printouts of model equations.
C. Graph the data. Include both the observed data and the fitted Y values
in your graph.
D. According to your best analysis above, is there any significant eect of:
i) Interaction
ii) Level of nitrogen
ii) Days from germination
c Studentlitteratur
29
Exercise 1.4 Gowen and Price, quoted from Snedecor and Cochran (1980),
counted the number of lesions of Aucuba mosaic virus after exposure to Xrays for various times. The results were:
Exposure
0
15
30
45
60
Count
271
108
59
29
12
It was assumed that the Count (y) depends on the exposure time (x) through
an exponential relation of type y = AeBx . A convenient way to estimate
the parameters of such a function is to make a linear regression of log(y) on
x.
A. Perform a linear regression of log(y) on x.
B. What assumptions are made regarding the residuals in your analysis in
A.?
C. Plot the data and the fitted function in the same graph.
c Studentlitteratur
2. Generalized Linear
Models
2.1
Introduction
2.1.1
This book is concerned with statistical models for data. In these models,
the concept of a response variable is crucial. In general linear models, the
response variable Y is often assumed to be quantitative and normally distributed. But this is by no means the only type of response variables that
we might meet in practice. Some examples of dierent types of response
variables are:
31
32
2.1. Introduction
2.1.2
Continuous response
Models where the response variable is considered to be continuous are common in many application areas. In fact, since measurements cannot be made
to infinite precision, few response variables are truly continuous, but continuous models are still often used as approximations. Many response variables of
this type are modeled as general linear models, often assuming normality and
homoscedasticity. It is common for response variables to be restricted to positive values. Physical measurements in cm or kg are examples of this. Since
the Normal distribution is defined on [, ], the normality assumption
cannot hold exactly for such data, and one has to revert to approximations.
We may illustrate the concept of continuous response using data of a type
often used in general linear models; other examples will be discussed in later
chapters.
Example 2.1 In the pharmacological study discussed in Example 1.3 the
concentration of Dopamine was measured in the brains of six control rats
and of six rats that had been exposed to toluene. The results were given on
page 13. In this example the response variable may be regarded as essentially
continuous.
2.1.3
33
..
.
Acid
level
0.48
0.46
0.50
0.48
0.84
X-ray
result
0
1
0
1
1
Tumour
size
0
0
1
1
1
Tumour
grade
0
0
0
0
1
Nodal
involvement
0
0
0
1
1
..
.
..
.
..
.
..
.
..
.
In this type of data, the response Y has value 1 if nodal involvement has
occurred and 0 otherwise. This is called a binary response. Even some of
the independent variables (X-ray results, Tumour size and Tumour grade)
are binary variables, taking on only the values 0 or 1. These data will be
analyzed in Chapter 5.
2.1.4
Response as a proportion
34
2.1. Introduction
phoniella sanborni, in batches of about fifty. The results are given in the
following table.
Conc
10.2
7.7
5.1
3.8
2.6
Log(Conc)
1.01
0.89
0.71
0.58
0.41
No. of insects
50
49
46
48
50
No. aected
44
42
24
16
6
% aected
88
86
52
33
12
One aim with this experiment was to find a model for the relation between the
probability p that an insect is aecteded and the dose, i.e. the concentration.
Such a model can be written, in general terms, as
g(p) = f(Concentration).
The functions g and f should be chosen such that the model cannot produce
a predicted probability that is larger than 1. These data will be discussed
later on page 89.
2.1.5
Response as a count
Counts are measurements where the response indicates how many times a
specific event has occurred. Counts are often recorded in the form of frequency tables or crosstabulations. Count data are restricted to integers 0.
Models for counts should take this limitation into account.
Example 2.4 Sokal and Rohlf (1973) reported some data on the color of
Tiger beetles (Cicindela fulgida) collected during dierent seasons. The results are:
Season
Early spring
Late spring
Early summer
Late summer
Total
Red
29
273
8
64
374
Other
11
191
31
64
297
Total
40
464
39
128
671
The data may be used to study how the color of the beetle depends on season.
A common approach is to test whether there is independence between season
and color through a 2 test. We will return to the analysis of these data later
(page 117).
c Studentlitteratur
35
2.1.6
Response as a rate
Females
175
17.3
Males
320
21.4
Accident data can often be modeled using the Poisson distribution. In this
case, we have to account for the fact that males and females have dierent
observation periods, in terms of number of person years. Accident rate can
be measured as (no. of accidents)/(no. of person years). In a later chapter
(page 131), we will discuss how this type of data can be modelled.
2.1.7
Ordinal response
Never
Yes
No
Total
24
1355
1379
Sometimes
35
603
638
Snoring
Often
Always
Total
21
192
213
30
224
254
110
2374
2484
c Studentlitteratur
36
The main interest lies in studying possible dependence between snoring and
heart problems. Analysis of ordinal data is discussed in Chapter 7.
2.2
(2.1)
= X
(2.2)
Let us denote
as the linear predictor part of the model (1.3). Generalized linear models are
a generalization of general linear models in the following ways:
1. An assumptions often made in a GLM is that the components of y are
independently normally distributed with constant variance. We can relax
this assumption to permit the distribution to be any distribution that
belongs to the exponential family of distributions. This includes distributions such as Normal, Poisson, gamma and binomial distributions.
2. Instead of modeling =E (y) directly as a function of the linear predictor
X, we model some function g () of . Thus, the model becomes
g () = = X.
The function g () in (2.3), is called a link function.
The specification of a generalized linear model thus involves:
1. specification of the distribution
2. specification of the link function g ()
3. specification of the linear predictor X.
We will discuss these issues, starting with the distribution.
c Studentlitteratur
(2.3)
37
2.3
(y b ())
+ c (y, )
(2.4)
f (y; , ) = exp
a ()
where a () , b () and c () are some functions. The so called canonical parameter is some function of the location parameter of the distribution. Some
authors dier between exponential family, which is (2.4) assuming that a ()
is unity, and exponential dispersion family, which include the function a ()
while assuming that the so called dispersion parameter is a constant; see
Jrgensen (1987); Lindsey (1997, p. 10f). As examples of the usefulness of the
exponential family, we will demonstrate that some well-known distributions
are, in fact, special cases of the exponential family.
2.3.1
f (y; ) =
(2.5)
We can compare this expression with (2.4). We note that = log () which
means that = exp (). We insert this into (2.5) and get
f (y; ) = exp [y exp () log (y!)]
Thus, (2.5) is a special case of (2.4) with = log (), b() = exp(), c(y, ) =
log(y!) and a() = 1.
2.3.2
p
n
= exp y log
+ n log (1 p) + log
.
1p
y
(2.6)
c Studentlitteratur
38
We use = log
p
1p
i.e. p =
exp()
1+exp() .
1
1 + exp ()
n
+ log
.
y
It follows that
distribution
thebinomial distribution is an exponential family
n
p
with = log 1p , b () = n log [1 + exp ()], c (y, ) = log y and a() = 1.
2.3.3
(y)2
1
e 22
f y; , 2 =
22
2
y 2
2
y
1
= exp
2 log 2 2 .
2
2
2
(2.7)
2
This is an exponential family distribution
with =
2
, = , a () =
2
, b () = /2, and c (y, ) = y / + log (2) /2. (In fact, it is an
exponential dispersion family distribution; see above.)
2.3.4
The function b ()
The function b () is of special importance in generalized linear models because b () describes the relationship between the mean value and the variance
in the distribution. To show how this works we consider Maximum Likelihood estimation of the parameters of the model. For a brief introduction to
Maximum Likelihood estimation reference is made to Appendix B.
The first derivative: b0
We denote the log likelihood function with l (, ; y) = log f (y; , ). According to likelihood theory it holds that
l
E
=0
(2.8)
and that
E
c Studentlitteratur
2l
2
+E
"
2 #
= 0.
(2.9)
39
(2.10)
2l
= b00 () /a ()
2
(2.11)
and
where b0 and b00 denote the first and second derivative, respectively, of b with
respect to . From (2.8) and (2.10) we get
l
E
(2.12)
= E {[y b0 ()] /a ()} = 0
so that
E (y) = = b0 () .
(2.13)
Thus the mean value of the distribution is equal to the first derivative of
b with respect to . For the distributions we have discussed so far, these
derivatives are:
Poisson : b () = exp () gives b0 () = exp () =
Binomial : b () = n log (1 + exp ()) gives b0 () = n
Normal : b () =
exp ()
= np
1 + exp ()
2
gives b0 () = =
2
(2.14)
V ar (y) = a () b00 () .
(2.15)
so that
We see that the variance of y is a product of two terms: the second derivative
of b (), and the function a () which is independent of . The parameter
is called the dispersion parameter and b00 () is called the variance function.
c Studentlitteratur
40
For the distributions that we have discussed so far the variance functions are
as follows:
Poisson
b00 () = exp () =
Binomial
b00 () =
= n
Normal
exp ()
2 = np (1 p)
(1 + exp ())
a () b00 () = 1 = 2
2.4
The link function g () is a function relating the expected value of the response
Y to the predictors X1 . . . Xp . It has the general form g () = = X. The
function g () must be monotone and dierentiable. For a monotone function
we can define the inverse function g 1 () by the relation g 1 (g ()) = . The
choice of link function depends on the type of data. For continuous normaltheory data an identity link may be appropriate. For data in the form of
counts, the link function should restrict to be positive, while data in the
form of proportions should use a link that restricts to the interval [0, 1].
Some commonly used link functions and their inverses are:
The identity link: = . The inverse is simply = .
The logit link: = log [/ (1 )]. The inverse =
to the interval [0, 1].
exp()
1+exp()
is restricted
41
c Studentlitteratur
Figure 2.1:
42
2.4.1
Canonical links
2.5
The linear predictor X plays the same role in generalized linear models
as in general linear models. In regression settings, X contains values of
independent variables. In ANOVA settings, X contains dummy variables
corresponding to qualitative predictors (treatments, blocks etc). In general,
the model states that some function of the mean of y is a linear function of
the predictors: = X. As noted in Chapter 1, X is called a design matrix.
2.6
c Studentlitteratur
43
that maximize the log likelihood, which for a single observation can be written
l = log [L (, ; y)] =
y b ()
+ c (y, ) .
a ()
(2.16)
(2.17)
xj j we obtain
= xj . Putting things together,
j
l
j
=
=
(y ) 1 d
xj
a () V d
d
W
(y )
xj .
a ()
d
(2.18)
d
d
V.
(2.19)
So far, we have written the likelihood for one single observation. By summing
over the observations, the likelihood equation for one parameter j is given
by
X Wi (yi ) d
i
i
xij = 0.
a
()
d
i
i
(2.20)
We can solve (2.20) with respect to j since the i :s are functions of the
parameters j . Asymptotic variances and covariances of the parameter estimates are obtained through the inverse of the Fisher information matrix (see
Appendix B). Thus,
c Studentlitteratur
44
b
b ,
b
V ar
Cov
0
0
1
b
b
b ,
V
ar
Cov
1
0
1
..
b
b
Cov p1 , 0
2
l
l l
= E
2.7
20
l l
1 0
0 1
2 l
21
l
l
p1 0
b ,
b
Cov
0
p1
..
b
V ar p1
1
l
l
0 p1
..
.
..
2l
2p1
(2.21)
Numerical procedures
=
g ().
d
Form the adjusted dependent variate z0 = b
0 + (y
b0 ) d where the
derivative of the link is evaluated at
b0 .
d
d
V0 , where V is the
c Studentlitteratur
2.8
2.8.1
45
The fit of a generalized linear model to data may be assessed through the
deviance. The deviance is also used to compare nested models.
Dierent models can have dierent degrees of complexity. The null model
has only one parameter that represents a common mean value for all observations. In contrast, the full (or saturated) model has n parameters, one for
each observation. For the saturated model, each observation fits the model
perfectly, i.e. y = yb. The full model is used as a benchmark for assessing the
fit of any model to the data. This is done by calculating the deviance. The
deviance is defined as follows:
Let l(b
, ; y) be the log likelihood of the current model at the Maximum
Likelihood estimate, and let l(y, ; y) be the log likelihood of the full model.
The deviance D is defined as
D = 2 (l(y, ; y) l(b
, ; y)) .
(2.22)
It can be noted that for a Normal distribution, the deviance is just the residual sum of squares. The Genmod procedure in SAS presents two deviance
statistics: the deviance and the scaled deviance. For distributions that have
a scale parameter , the scaled deviance is D = D/. It is actually the
scaled deviance that is used for inference. For distributions such as Binomial
and Poisson, the deviance and the scaled deviance are identical.
If the model is true, the deviance will asymptotically tend towards a 2 distribution as n increases. This can be used as an over-all test of the adequacy
of the model. The degree of approximation to a 2 distribution is dierent
for dierent types of data.
A second, and perhaps more important use of the deviance is in comparing
competing models. Suppose that a certain model gives a deviance D1 on df1
degrees of freedom (df), and that a simpler model produces deviance D2 on
df2 degrees of freedom. The simpler model would then have a larger deviance
and more df. To compare the two models we can calculate the dierence in
deviance, (D2 D1 ), and relate this to the 2 distribution with (df2 df1 )
degrees of freedom. This would give a large-sample test of the significance
of the parameters that are included in model 1 but not in model 2. This, of
course, requires that the parameters included in model 2 is a subset of the
parameters of model 1, i.e. that the models are nested.
c Studentlitteratur
46
2.8.2
An alternative to the deviance for testing and comparing models is the Pearson 2 , which can be defined as
X
(yi
b)2 /Vb (b
) .
(2.23)
2 =
i
Here, Vb (b
) is the estimated variance function. For the Normal distribution,
this is again the residual sum of squares of the model, so in this case, the deviance and Pearsons 2 coincide. In other cases, the deviance and Pearsons
2 have dierent asymptotic properties and may produce dierent results.
Maximum likelihood estimation of the parameters in generalized linear models seeks to minimize the deviance, which may be one reason to prefer the
deviance over the Pearson 2 . Another reason is that the Pearson 2 does
not have the same additive properties as the deviance for comparing nested
models. Computer packages for generalized linear models often produce both
the deviance and the Pearson 2 . Large dierences between these may be an
indication that the 2 approximation is bad.
2.8.3
An idea that has been put forward by several authors is to penalize the
likelihood functions such that simpler models are being preferred. A general
expression of this idea is to measure the fit of a model to data by a measure
such as
DC = D q.
(2.24)
c Studentlitteratur
47
2.8.4
The deviance, and the Pearson 2 , can provide large-sample tests of the fit
of the model. The usefulness of these tests depends on the kind of data being
analyzed. For example, Collett (1991) concludes that for binary data with all
ni = 1, the deviance cannot be used to assess the over-all fit of the model (p.
65). For Normal models the deviance is equal to the residual sum of squares
which is not a model test by itself.
The advantage of the deviance, as compared to the Pearson 2 , is that it is
a likelihood-based test that is useful for comparing nested models. Akaikes
information criterion is often used as a way of comparing several competing
models, without necessarily making any formal inference.
2.9
2.9.1
Wald tests
bb
(2.25)
c Studentlitteratur
48
2.9.2
Likelihood ratio tests are based on the following principle. Denote with L1 the
likelihood function maximized over the full parameter space, and denote with
L0 the likelihood function maximized over parameter values that correspond
to the null hypothesis being tested.
The likelihood ratio statistic is
2 log (L0 /L1 ) = 2 [log (L0 ) log (L1 )] = 2 (l0 l1 ) .
(2.26)
2.9.3
Score tests
We will illustrate score tests (also called ecient score tests) based on arguments taken from Agresti (1996). In Figure 2.2, we illustrate a hypothetical
b is the Maximum Likelihood estimator of some palikelihood function.
rameter . We are testing a hypothesis H0 : = 0. L1 and L0 denote the
likelihood under H1 and H0 , respectively.
The Wald test uses the behavior of the likelihood function at the ML estimate
b The asymptotic standard error of
b depends on the curvature of the
.
b
likelihood function close to .
The score test is based on the behavior of the likelihood function close to 0,
the value stated in H0 . If the derivative at H0 is large, this would be an
indication that H0 is wrong, while a derivative close to 0 would be a sign that
we are close to the maximum. The score test is calculated as the square of
the ratio of this derivative to its asymptotic standard error. It can be treated
as an asymptotic 2 variate on 1 df .
b
The likelihood ratio test uses information on the log likelihood both at
and at 0. It compares the likelihoods L1 and L0 using the asymptotic 2
distribution of 2 (log L0 log L1 ). Thus, in a sense, the LR statistic uses
more information than the Wald and score statistics. For this reason, Agresti
(1996) suggests that the likelihood ratio statistic may be the most reliable of
the three.
c Studentlitteratur
49
L(b )
L1
L0
Figure 2.2: A likelihood function indicating information used in Wald, LR and score
tests.
2.9.4
Tests of Type 1 or 3
Tests in generalized linear models have the same sequential property as tests
in general linear models. Proc Genmod in SAS oers Type 1 or Type 3 tests.
The interpretation of these tests is the same as in general linear models. In
a Type 1 analysis the result of a test depends on the order in which terms
are included in the model. A type 3 analysis does not depend on the order
in which the Model statement is written: it can be seen as an attempt to
mimic the analysis that would be obtained if the data had been balanced. In
general linear models the Type 1 and Type 3 tests are obtained through sums
of squares. In Generalized linear models the tests are Likelihood ratio tests,
but there is an option in Genmod to use Wald tests instead. See Chapter 1
for a discussion on Type 1 and Type 3 tests.
2.10
In general linear models, the fit of the model to data can be summarized as
Model
R2 = SSSS
. It holds that 0 R2 1. A value of R2 close to 1 would
T
indicate a good fit. An adjusted version of R2 has been proposed to account
for the fact that R2 increases even when irrelevant factors are added to the
model; see Chapter 1.
Similar measures of fit have been proposed also for generalized linear models.
c Studentlitteratur
50
2.11. An application
L0
LMax
2/n
(2.27)
where L is the likelihood. This measure equals the usual R2 for Normal
models, but has the disadvantage that it is always smaller than 1. In fact,
2
= 1 (L0 )2/n .
Rmax
(2.28)
For example, in a binomial model with half of the observations in each category, this maximum equals 0.75 even if there is perfect agreement between
the variables. Nagelkerke (1991) therefore suggested the modification
2
2
R = R2 /Rmax
.
(2.29)
2.11
An application
y
s
c Studentlitteratur
Mechanical
2966
269
59
1887
3452
189
93
618
130
2493
1216
1343
Manual
186
107
65
126
123
164
408
324
548
139
219
156
51
This is close to a textbook example that may be used for illustrating twosample t tests. However, closer scrutiny of the data reveals that the variation
is quite dierent in the two groups. In fact, some kind of relationship between
the mean and the variance may be at hand.
Value
Data Set
WORK.SHEEP
Distribution
POISSON
Link Function
LOG
Dependent Variable
COUNT
Observations Used
20
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
18
18
18
18
.
14862.7182
14862.7182
14355.5077
14355.5077
83800.0507
825.7066
825.7066
797.5282
797.5282
.
c Studentlitteratur
52
2.11. An application
Analysis Of Parameter Estimates
Parameter
INTERCEPT
MILKING
MILKING
SCALE
NOTE:
Man
Mech
DF
Estimate
Std Err
ChiSquare
Pr>Chi
1
1
0
0
7.1030
-1.7139
0.0000
1.0000
0.0091
0.0232
0.0000
0.0000
613300.717
5451.1201
.
.
0.0001
0.0001
.
.
However, since the ratio deviance/df is so large, a second analysis was made
where the program estimated the dispersion parameter . This approach,
which is related to a phenomenon called overdispersion, is discussed in the
next chapter. In the analysis where the scale parameter was used, the scaled
deviance was 18.0 and the p value for milking was 0.010.
Two other distributions were also tested: the gamma distribution and the
inverse gaussian distribution. These were used with their respective canonical
links. In addition, a Wilcoxon-Mann-Whitney test was made. The results
for all these models can be summarized as follows:
Model
Normal, Glim
Normal, t test
Log normal
Gamma
Inverse Gaussian
Poisson
Poisson with
Wilcoxon
p value
0.0140
0.0440
0.0405
0.0086
0.0610
<0.0001
0.0102
0.1620
c Studentlitteratur
53
2.12
Exercises
Exercise 2.1 The distribution of waiting times (for example the time you
wait in line at a bank) can sometimes be approximated by an exponential
distribution. The density of the exponential distribution is f(x) = ex for
x > 0. Does the exponential distribution belong to the exponential family?
If so, what are the functions b () and c ()? What is the variance function?
Exercise 2.2 Sometimes data are collected which are essentially Poisson,
but where it is impossible to observe the value y = 0. For example, if data
are collected by interviewing occupants in houses on how many occupants
are there in this house, it would be impossible to get an answer from houses
that are not occupied.
The truncated Poisson distribution can sometimes be used to model such
data. It has probability function
p (yi |) =
e yi
(1 e ) yi !
for yi = 1, 2, 3, ...
A. Investigate whether the truncated Poisson distribution is a member of the
Exponential family.
B. Derive the variance function.
Exercise 2.3 Aanes (1961) studied the eect of a certain type of poisoning
in sheep. The survival time and the weight were recorded for 13 sheep that
had been poisoned. Results:
Weight
46
55
61
75
64
75
71
Survival
44
27
24
24
36
36
44
Weight
59
64
67
60
63
66
Survival
44
120
29
36
36
36
Find a generalized linear model for the relationship between survival time
and weight. Try several dierent distributions and link functions. Do not
use only the canonical link for each distribution. Plot the data and the fitted
models.
c Studentlitteratur
3. Model diagnostics
3.1
Introduction
In general linear models, the fit of the model to data can be explored by
using residual plots and other diagnostic tools. For example, the normality
assumption can be examined using normal probability plots, the assumption
of homoscedasticity can be checked by plotting residuals against yb, and so
on. Model diagnostics in Glim:s can be performed in similar ways.
In this chapter we will discuss dierent model diagnostic tools for generalized
linear models. Our discussion will be fairly general. We will return to these
issues in later chapters when we consider analysis of dierent types of response
variables.
The purpose of model diagnostics is to examine whether the model provides
a reasonable approximation to the data. If there are indications of systematic
deviations between data and model, the model should be modified.
The diagnostic tools that we consider are the following:
Residual plots (similar to the residual plots used in GLM:s) can be used
to detect various deviations between data and model: outliers, problems
with distributions or variances, dependence, and so on.
Some observations may have an unusually large impact on the results.
We will discuss tools to identify such influential observations.
Overdispersion means that the variance is larger than would be expected
for the chosen distribution. We will discuss ways to detect over-dispersion
and to modify the model to account for over-dispersion.
3.2
56
(3.1)
X0 .
The hat matrix may have a more complex form in other models than GLM:s.
Except in degenerate cases, it is possible to compute the hat matrix. We will
not give general formulae here; however, most computer software for Glim:s
have options to print the hat matrix.
3.3
In general linear models, the observed residuals are simply the dierence
between the observed values of y and the values yb that are predicted from
the model: eb = y yb. In generalized linear models, the variance of the
residuals is often related to the size of yb. Therefore, some kind of scaling
mechanism is needed if we want to use the residuals for plots or other model
diagnostics. Several suggestions have been made on how to achieve this.
3.3.1
Pearson residuals
The raw residual for an observation yi can be defined as ebi = yi ybi . The
Pearson residual is the raw residual standardized with the standard deviation
of the fitted value:
yi ybi
.
ei,P earson = q
Vd
ar (b
yi )
(yi b
yi ) .
y
bi
ni p
bi (1b
pi )
(3.2)
P
i
e2i,P earson .
57
3. Model diagnostics
If the model holds, Pearson residuals can often be considered to be approximately normally distributed with a constant variance, in large samples. However, even when they are standardized with the standard error of yb, the
variance of Pearson residuals cannot be assumed to be 1. This is since we
have standardized the residuals using estimated standard errors. Still, the
standard errors of Pearson residuals can be estimated. It can be shown that
adjusted Pearson residuals can be obtained as
ei,P earson
ei,adj,P =
1 hii
(3.3)
where hii are diagonal elements from the hat matrix. The adjusted Pearson
residuals can often be considered to be standard Normal, which means that
e.g. residuals outside 2 will occur in about 5% of the cases. This can be
used to detect possible outliers in the data.
3.3.2
Deviance residuals
p
di .
(3.4)
The deviance residuals can also be written in standardized form, i.e. such
that their variance is close to unity. This is obtained as
ei,Deviance
ei,adj,D =
1 hii
(3.5)
where hii are again diagonal elements from the hat matrix.
3.3.3
Score residuals
The Wald tests, Likelihood ratio tests and Score tests, presented in the previous chapter, provide dierent ways of testing hypotheses about parameters
of the model. The score residuals are related to the score tests.
In Maximum Likelihood estimation, the parameter estimates are obtained by
solving the score equations, which are of type
U=
l
=0
(3.6)
where is some parameter. The score equations involve sums of terms Ui , one
for each observation. These terms can, properly standardized, be interpreted
c Studentlitteratur
58
as residuals, i.e. as the contribution from each observation to the score. The
standardized score residuals are obtained from
Ui
ei,adj,S = p
(1 hii ) vi
(3.7)
where hii are diagonal elements of the hat matrix, and vi are elements of a
certain weight matrix.
3.3.4
Likelihood residuals
Theoretically it would be possible compare the deviance of a model that comprises all the data with the deviance of a model with observation i excluded.
However, this procedure would require heavy computations. An approximation to the residuals that would be obtained using this procedure is
q
ei,Likelihood = sign (yi ybi ) hii (ei,Score )2 + (1 hii ) (ei,Deviance )2 (3.8)
where hii are diagonal elements of the hat matrix. This is a kind of weighted
average of the deviance and score residuals.
3.3.5
Anscombe residuals
The types of residuals discussed so far have distributions that may not always be close to Normal, in samples of the sizes we often meet in practice.
Anscombe (1953) suggested that the residuals may be defined based on some
transformation of observed data and fitted values. The transformation would
be chosen such that the calculated residuals are approximately standard Normal. Anscombe defined the residuals as
A (yi ) A (b
yi )
ei,Anscombe = q
Vd
ar (A (yi ) A (b
yi ))
(3.9)
y
b
.
In
general, the Anscombe residuals are rather dicult
y
/b
y
2
to calculate, which may explain why they have not reached widespread use.
3.3.6
The types of residuals discussed above are related to the types of tests and
other model building tools that are used:
c Studentlitteratur
59
3. Model diagnostics
The deviance residuals are related to the deviance as a measure of fit of the
model and to Likelihood ratio tests.
The Pearson residuals are related to the Pearson 2 and to the Wald tests.
The score residuals are related to score tests.
The likelihood residuals are a compromise between score and deviance residuals
The Anscombe residuals, although theoretically appealing, are not often
used in practice in programs for fitting generalized linear models.
In the previous chapter we suggested that the likelihood ratio tests may be
preferred over Wald tests and score tests for hypothesis testing in Glim:s. By
extending this argument, the deviance residuals may be the preferred type of
residuals to use for model diagnostics. Collett (1991) suggested that either
the deviance residuals or the likelihood residuals should be used.
3.4
Some of the observations in the data may have an unduly large impact on the
parameter estimates. If such so called influential observations are changed by
a small amount, or if they are deleted, the estimates may change drastically.
An outlier is an observation for which the model does not give a good approximation. Outliers can often be detected using dierent types of plots. Note
that influential observations are not necessarily outliers. An observation can
be influential and still be close to the main bulk of the data. Diagnostic tools
are needed to detect influential observations and outliers.
3.4.1
Leverage
c Studentlitteratur
60
3.4.2
It can be shown that this yields the so called Cooks distance Ci . In principle,
the calculation of Ci (or Di ) requires extensive re-fitting of the model which
may take time even on fast computers. However, an approximation to Ci can
be obtained as
Ci
(3.10)
where p is the number of parameters and hii are elements of the hat matrix.
3.4.3
3.4.4
3.5
Partial leverage
3. Model diagnostics
61
linear model to this design matrix and calculate the residuals. Also, fit a
model with variable xj as the response and the remaining variables X[j] as
regressors. Calculate the residuals from this model as well. A partial leverage
plot is a plot of these two sets of residuals. It shows how much the residuals
change between models with and without variable xj . Partial leverage plots
can be produced in procedure Insight (SAS, 2000b).
3.6
Overdispersion
A generalized linear model can sometimes give a good summary the data,
in the sense that both the linear predictor and the distribution are correctly
chosen, and still the fit of the full model may be poor. One possible reason
for this may be a phenomenon called over-dispersion.
Over-dispersion occurs when the variance of the response is larger than would
be expected for the chosen distribution. For example, if we use a Poisson
distribution to model the data we would expect the variance to be equal
to the mean value: = 2 . Similarly, for data that are modelled using a
binomial distribution, the variance is a function of the response probability:
2 = np (1 p). Thus, for many distributions it is possible to infer what the
variance should be, given the mean value. In Chapter 2 we noted that for
distributions in the exponential family, the variance is some function of the
mean: 2 = V ().
Under-dispersion, i.e. a too small variance, is theoretically possible but
rather unusual in practice. Interesting examples of under-dispersion can be
found in the analysis of Mendels classical genetic data; these data are better
than would be expected by chance.
In models that do not contain any scale parameter, over-dispersion can be detected as a poor model fit, as measured by deviance/df . Note, however, that
a poor model fit can also be caused by the wrong choice of linear predictor
or wrong choice of distribution or link. Thus, a poor fit does not necessarily
mean that we have over-dispersion.
Over-dispersion may have many dierent reasons. However, the main reason
is often some type of lack of homogeneity. This lack of homogeneity may occur
between groups of individuals; between individuals; and within individuals.
As an example, consider a dose-response experiment where the same dose of
an insecticide is given to two batches of insects. In one of the batches, 50 out
of 100 insects die, while in the other batch 65 out of 100 insects die. Formally,
this means that the response probabilities in the two batches are significantly
dierent (the reader may wish to confirm that a textbook Chi-square test
gives 2 = 4.6, p = 0.032). This may indicate that the batches of insects are
c Studentlitteratur
62
3.6. Overdispersion
not homogenous with respect to tolerance to the insecticide. If these data are
part of some larger dose-response experiment, using more batches of animals
and more doses, this type of inhomogeneity would result in a poor model fit
because of overdispersion.
3.6.1
Before any attempts are made to model the over-dispersion, you have to
examine all other possible reasons for poor model fit. These include:
Wrong choice of linear predictor. For example, you may have to add terms
to the predictor, such as new covariates, interaction terms or nonlinear
terms.
Wrong choice of link function.
There may be outliers in the data.
When the data are sparse, the assumptions underlying the large-sample
theory may not be fulfilled, thus causing a poor model fit.
A common eect of over-dispersion is that estimates of standard errors are
under-estimates. This leads to test statistics which are too large: it becomes
too easy to get a significant result. A simple way to model over-dispersion is
to introduce a scale parameter into the variance function. Thus, we would
assume that V ar (Y ) = 2 . For binomial data this means that we would use
the variance np (1 p) , and for Poisson data we would use as variance.
The parameter is often called the over-dispersion parameter. A simple, but
somewhat rough, way to estimate is to fit a maximal model1 to the data,
and to use the mean deviance (i.e. Deviance/df ), or Pearson 2 /df , from
that model as an estimator of . We can then re-fit the model, using the
obtained value of the over-dispersion parameter. Williams (1982) suggested
a more sophisticated iterative procedure for estimating ; see Collett (1991)
for details.
A more satisfactory approach would be to model the over-dispersion based
on some specific model. One possible model is to assume that the mean parameter has a separate value for each individual. Thus, the mean parameter
would be assumed to follow some random distribution over individuals while
the response follows a second distribution, given the mean value. This would
1 Note
that this maximal model is not the same as the saturated model, which has
= 0. Instead, the maximal model is a somewhat subjectively chosen large model
which includes all eects that can reasonably be included.
c Studentlitteratur
63
3. Model diagnostics
3.7
Non-convergence
When using packages like Genmod for fitting generalized linear models, it
may happen that the program reports that the procedure has not converged.
Sometimes the convergence is slow and the procedure reports estimates of
standard errors that are very large. Typical error messages might be
WARNING: The negative of the Hessian is not positive definite. The
convergence is questionable.
WARNING: The procedure is continuing but the validity of the model fit is
questionable.
WARNING: The specified model did not converge.
Note that in SAS, the error messages are given in the program log. You can
get some output even if these warnings have been given.
Non-convergence occurs because of the structure of the data in relation to the
model that is being fitted. A common problem is that the number of observed
data values is small relative to the number of parameters in the model. The
model is then under-identified. This can easily happen in the analysis of
multidimensional crosstables. For example, a crosstable of dimension 4333
contains 108 cells. If the sample size is moderate, say n = 100, the average
number of observations per cell will be less than 1. It is then easy to imagine
that many of the cells will be empty. Convergence problems are likely in such
cases.
When the data are binomial, the procedure may fail to converge when it tries
to fit estimated proportions close to 0 or 1. This may happen when many
observed proportions are 0 or 1.
As a general advice: when the procedure does not converge, try to simplify
the model as much as possible by removing, in particular, interaction terms.
Make tables and other summaries of the data to find out the reasons for the
failure to converge.
c Studentlitteratur
64
3.8. Applications
3.8
Applications
3.8.1
Residual plots
In this section we will discuss a number of useful ways to check models, using
the statistics we have discussed in this chapter. As illustrations of the various
plots we use the example on somatic cells in the milk of sheep, discussed in
the previous chapter (page 50). For the illustrations we use a model with a
Normal distribution and a unit link, and a model with a Poisson distribution
and a log link. The following types of residual plots are often useful:
1. A plot of residuals against the fitted values b
should show a pattern
where the residuals have a constant mean value of 0 and a constant range.
Deviations from this random pattern may arise because of incorrect link
function; wrong choice of scale of the covariates; or omission of non-linear
terms in the linear predictor.
2. A plot of residuals against covariates should show the same pattern as the
previous plot. Deviations from this pattern may indicate the wrong link
function, incorrect choice of scale or omission of non-linear terms.
3. Plotting the residuals in the order the observations are given in the data
may help to detect possible dependence between observations.
4. A normal probability plot of the residuals plots the sorted residuals against
their expected values. These are given by
1 [(i 3/8) / (n + 1/4)]
where 1 is the inverse of the standard Normal distribution function,
i is the order of the observation, and n is the sample size. This plot
should yield a straight line, as long as we can assume that the residuals
are approximately Normal.
5. The residuals can also be plotted to detect an omitted covariate u. This
is done as follows: fit a model with u as response, using the same model
as for y. Obtain unstandardized residuals from both these models, and
plot these against each other. Any systematic pattern in this plot may
indicate that u should be used as a covariate.
Plots of residuals against predicted values for the data in the example on Page
50 are given in Figure 3.1 and Figure 3.2 for Normal and Poisson distributions,
respectively. The plots of residuals against predicted values indicate that the
variation is larger for larger predicted values. This tendency is strongest for
the Normal model.
c Studentlitteratur
3. Model diagnostics
65
Figure 3.1: Plot of residuals against predicted values for example data. Normal
distribution and identity link.
Figure 3.2: Plot of residuals against predicted values for example data. Poisson
distribution and log link.
c Studentlitteratur
66
3.8. Applications
Figure 3.3: Normal probability plot for the example data. Normal distribution with
an identity link.
Normal probability plots for these data are given in Figures 3.3 and 3.4,
for the Normal and Poisson models, respectively. The distribution of the
residuals is closer to Normal for the Poisson model, but the fit is not perfect.
3.8.2
McCullagh and Nelder (1989) suggest the following procedure for checking
the variance function. Assume that the variance is proportional to , where
is some constant. Fit the model for dierent values of , and plot the
deviance against . The value of for which the deviance is as small as
possible is suggested by the data.
c Studentlitteratur
3. Model diagnostics
67
Figure 3.4: Normal probability plot for the example data. Poisson distribution with
a log link.
3.8.3
To check the link function we need the so called adjusted dependent variable
z. This is defined as zi = g (b
i ). This can be plotted against b
. If the link is
correct this should result in an essentially linear plot.
3.8.4
Transformation of covariates
So called partial residual plots can be used to detect whether any of the covariates need to be transformed. The partial residual is defined as u = z b
+b
x,
where z is the adjusted dependent variable, b
is the fitted linear predictor, x
is a covariate and
b is the parameter estimate for the covariate. The partial
residuals can be plotted against x. The plot should be approximately linear
if no transformation is needed. Curvature in the plot is an indication that x
may need to be transformed.
c Studentlitteratur
68
3.9
3.9. Exercises
Exercises
c Studentlitteratur
4.1
GLM:s as GLIM:s
General linear models, such as regression models, ANOVA, t tests etc. can
be stated as generalized linear models by using a Normal distribution and
an identity link. We will illustrate this on some of the GLM examples we
discussed in Chapter 1.
Throughout this chapter, we will use the SAS (2000b) procedure Genmod for
data analysis.
4.1.1
The identity link is the default link for the Normal distribution. We used
this program on the regression data given on page 9. The results are:
69
70
Value
Data Set
Distribution
Link Function
Dependent Variable
Observations Used
WORK.EMISSION
NORMAL
IDENTITY
EMISSION
8
DF
Value
Value/DF
6
6
6
6
.
25.3765
8.0000
25.3765
8.0000
-15.9690
4.2294
1.3333
4.2294
1.3333
.
NOTE:
Parameter
DF
Estimate
Std Err
ChiSquare
Pr>Chi
INTERCEPT
TIME
SCALE
1
1
1
-36.7443
2.0978
1.7810
3.8179
0.1186
0.4453
92.6233
312.8360
.
0.0001
0.0001
.
j
z=r
,
b
d
V ar
j
c Studentlitteratur
bj 0
t= r
b
Vd
ar
j
71
2
6
4.1.2
Simple ANOVA
A Genmod program for an ANOVA model (of which the simple t test is a
special case) can be written as
PROC GENMOD DATA=lissdata;
CLASS medium;
MODEL diff = medium /
DIST=normal
LINK=identity ;
RUN;
The output from this program, using the data on page 16, contains the following information:
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
50
50
50
50
.
905.1155
57.0000
905.1155
57.0000
-159.6823
18.1023
1.1400
18.1023
1.1400
.
c Studentlitteratur
72
Parameter
INTERCEPT
MEDIUM
MEDIUM
MEDIUM
MEDIUM
MEDIUM
MEDIUM
MEDIUM
SCALE
NOTE:
Diatrizoate
Hexabrix
Isovist
Mannitol
Omnipaque
Ringer
Ultravist
DF
Estimate
Std Err
ChiSquare
Pr>Chi
1
1
1
1
1
1
1
0
1
9.9079
11.4841
-5.9473
-8.2437
-2.2192
-1.5182
-9.6979
0.0000
3.9849
1.4089
2.2717
1.9363
1.9363
1.9363
1.8902
2.0624
0.0000
0.3732
49.4563
25.5554
9.4340
18.1257
1.3136
0.6451
22.1116
.
.
0.0001
0.0001
0.0021
0.0001
0.2518
0.4219
0.0001
.
.
DF
ChiSquare
Pr>Chi
MEDIUM
62.1517
0.0001
4.2
c Studentlitteratur
73
ships between some common distributions. Note, however, that not all these
distributions are members of the exponential family.
4.3
t1 et dt
(4.1)
(4.2)
(n) = (n 1)!
(4.3)
and that
1
y 1 ey/ .
()
(4.4)
4.3.1
The 2 distribution is a special case of the gamma distribution. A 2 distribution with p degrees of freedom can be obtained as a gamma distribution
c Studentlitteratur
74
c Studentlitteratur
75
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
10
12
14
with
= p/2 and = 2. The 2 distribution has mean value
2parameters
4.3.2
1 y/
e
(4.5)
4.3.3
Example 4.1 Hurn et al (1945), quoted from McCullagh and Nelder (1989),
studied the clotting time of blood. Two dierent clotting agents were com-
c Studentlitteratur
76
Clotting time
Agent 1 Agent 2
118
69
58
35
42
26
35
21
27
18
25
16
21
13
19
12
18
12
Duration data can often be modeled using the gamma distribution. The
canonical link of the gamma distribution is minus the inverse link, 1/.
Preliminary analysis of the data suggested that the relation between clotting
time and concentration was better approximated by a linear function if the
concentrations were log-transformed. Thus, the models that were fitted to
the data were of type
1
= 0 + 1 d + 2 x + 3 dx
where x = log (conc) and d is a dummy variable with d = 1 for lot 1 and
d = 0 for lot 2. This is a kind of covariance analysis model (see Chapter 1). A Genmod analysis of the full model gave the following output:
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
14
14
14
14
.
0.0294
17.9674
0.0298
18.2205
-26.5976
0.0021
1.2834
0.0021
1.3015
.
1
2
1
2
DF
Estimate
Std Err
ChiSquare
Pr>Chi
1
1
0
1
1
0
1
-0.0239
0.0074
0.0000
0.0236
-0.0083
0.0000
611.1058
0.0013
0.0015
0.0000
0.0005
0.0006
0.0000
203.6464
359.9825
24.9927
.
1855.0452
164.0704
.
.
0.0001
0.0001
.
0.0001
0.0001
.
.
c Studentlitteratur
77
We can see that all parameters are significantly dierent from zero, which
means that we cannot simplify the model any further. The scaled deviance
is 17.97 on 14 df. A plot of the fitted model, along with the data, is given in
Figure 4.3. The fit is good, but McCullagh and Nelder note that the lowest
concentration value might have been misrecorded.
4.4
The inverse Gaussian distribution, also called the Wald distribution, has its
roots in models for random movement of particles, called Brownian motion
after the British botanist Robert Brown. The density function is
#
"
1/2
(x )2
(4.6)
f (x; , ) =
exp
2x3
22 x
The distribution has two parameters, and . It has mean value and variance 3 /. It belongs to the exponential family and is available in procedure
Genmod. The distribution is skewed to the right, and resembles the lognormal and gamma distributions. A graph of the shape of inverse Gaussian
distributions is given in Figure 4.4.
In a so called Wiener process for a particle, the time T it takes for the particle
to reach a barrier for the first time has an inverse Gaussian distribution. The
c Studentlitteratur
78
0.5
1.5
2.5
distribution has also been used to model the length of time a particle remains
in the blood; maternity data; crop field size; and length of stay in hospitals.
See Armitage and Colton (1998) for references.
4.5
Model diagnostics
For the example data on page 76, the deviance residuals and the predicted
values were stored in a file for further analysis. In this section we will present
some examples of model diagnostics based on these data.
4.5.1
The residuals can be plotted against the predicted values. In Normal theory
models this kind of plot can be used to detect heteroscedasticity. Such a plot
for our example data is given in Figure 4.5.
The plot does not show the even, random pattern that would be expected.
Two observations in the lower right corner are possible outliers.
c Studentlitteratur
79
Figure 4.5: Plot of residuals against predicted values for the Gamma regression data
4.5.2
The Normal probability plot can be used to assess the distributional properties of the residuals. For most generalized linear models the residuals can be
regarded as asymptotically Normal. However, the distributional properties
of the residuals in finite samples depend upon the type of model. Still, the
normal probability plot is a useful tool for detecting anomalies in the data.
A Normal probability plot for the gamma regression data is given in Figure
4.6.
4.5.3
c Studentlitteratur
80
Figure 4.6: Normal probability plot for the gamma regression data
Figure 4.7: Plot of deviance residuals against Log(conc) for the gamma regression
data
c Studentlitteratur
4.5.4
81
Influence diagnostics
The value of Dfbeta with respect to log(conc) was calculated for all observations and plotted against log(conc). The resulting plot is given in Figure 4.8.
The figure shows that observations with the lowest value of log(conc) have
the largest influence on the results. These are the same observations that
were noted in other diagnostic plots above; the two possible outliers noted
earlier are actually placed on top of each other in this plot.
The diagonal elements of the Hat matrix were computed using Proc Insight
(SAS, 2000b). These values are plotted against the sequence numbers of the
observations in Figure 4.9. Since there are four parameters and n = 18, the
average leverage is 4/18 = 0.222. As noted in Chapter 3, observation with
a leverage above twice that amount, i.e. here above 2 0.22 = 0.44, should
be examined. For these data the first two observations have a high leverage;
these are the observations that have been noted in the other diagnostic plots.
c Studentlitteratur
82
Figure 4.8: Dfbeta plotted against log(conc) for the gamma regression data.
c Studentlitteratur
83
4.6
Exercises
Exercise 4.1 The following data, taken from Box and Cox (1964), show
the survival times (in 10 hour units) of a certain variety of animals. The experiment is a two-way factorial experiment with factors Poison (three levels)
and Treatment (four levels).
Poison
I
II
III
A
0.31
0.45
0.46
0.43
0.36
0.29
0.40
0.23
0.22
0.21
0.18
0.23
Treatment
B
C
0.82 0.43
1.10 0.45
0.88 0.63
0.72 0.76
0.92 0.44
0.61 0.35
0.49 0.31
1.24 0.40
0.30 0.23
0.37 0.25
0.38 0.24
0.29 0.22
D
0.45
0.71
0.66
0.62
0.56
1.02
0.71
0.38
0.30
0.36
0.31
0.33
Analyze these data to find possible eects of poison, treatment, and interactions. The analysis suggested by Box and Cox was a standard twoway
ANOVA on the data transformed as z = 1/y. Make this analysis, and
also make a generalized linear model analysis assuming that the data can
be approximated with a gamma distribution. In both cases, make residual
diagnostics and influence diagnostics.
Exercise 4.2 The data given below are the time intervals (in seconds) between successive pulses along a nerve fibre. Data were extracted from Cox
and Lewis (1966), who gave credit to Drs. P. Fatt and B. Katz. The original
data set consists of 799 observations; we use the first 200 observations only.
If pulses arrive in a completely random fashion one would expect the distribution of waiting times between pulses to follow an exponential distribution.
Fit an exponential distribution to these data by applying a generalized linear
model with an appropriate distribution and link, and where the linear predictor only contains an intercept. Compare the observed data with the fitted
distribution using dierent kinds of plots. The data are as follows:
c Studentlitteratur
84
4.6. Exercises
0.21
0.18
0.02
0.15
0.15
0.24
0.02
0.06
0.55
0.05
0.38
0.01
0.06
0.09
0.08
0.38
0.74
0.17
0.05
0.30
0.49
0.01
0.96
0.23
0.74
0.01
0.09
0.05
0.26
0.05
0.24
0.26
0.16
0.15
c Studentlitteratur
0.03
0.55
0.14
0.08
0.09
0.29
0.15
0.51
0.28
0.07
0.38
0.16
0.06
0.04
0.01
0.08
0.15
0.64
0.34
0.07
0.07
0.35
0.14
0.31
0.30
0.51
0.20
0.08
0.07
0.03
0.08
0.06
0.78
0.29
0.05
0.37
0.09
0.24
0.03
0.16
0.12
0.11
0.04
0.11
0.01
0.05
0.06
0.27
0.70
0.32
0.07
0.61
0.07
0.12
0.11
0.45
1.38
0.05
0.09
0.12
0.03
0.04
0.68
0.40
0.23
0.40
0.04
0.11
0.09
0.05
0.16
0.21
0.07
0.26
0.28
0.01
0.38
0.06
0.10
0.11
0.50
0.04
0.39
0.26
0.15
0.10
0.01
0.35
0.07
0.15
0.05
0.02
0.12
0.05
0.09
0.15
0.04
0.10
0.51
0.27
0.59
0.14
0.15
0.06
0.02
0.07
0.15
0.36
0.94
0.21
0.13
0.16
0.44
0.25
0.08
0.58
0.25
0.26
0.09
0.16
1.21
0.93
0.01
0.29
0.19
0.43
0.13
0.10
0.01
0.21
0.19
0.15
0.35
0.06
0.19
0.23
0.11
0.14
0.04
0.33
0.14
0.73
0.49
0.06
0.06
0.05
0.25
0.16
0.56
0.01
0.03
0.02
0.14
0.17
0.04
0.05
0.01
0.47
0.32
0.15
0.10
0.27
0.29
0.20
1.10
0.71
In binary and binomial models, we model the response probabilities as functions of the predictors. A probability has range 0 p 1. Since the linear
predictor X can take on any value on the real line, we would like the model
to use a link g (p) that transforms a probability to the range (, ). Three
dierent functions are often used for this purpose: the probit link; the logit
link; and the complementary log-log link. We will briefly discuss some arguments related to the choice of link for binary and binomial data.
5.1
Link functions
5.1.1
eu
/2
du.
(5.1)
86
y=1
y=0
5.1.2
p
.
1p
(5.2)
p
The ratio 1p
is the odds of success, so the logit is often called the log odds.
The logit function is a sigmoid function that is symmetric around 0. The
logistic link is rather close to the probit link, and since it is easier to handle
mathematically, some authors prefer it to the probit link. The logit link
is the canonical link for the binomial distribution so it is often the natural
choice of link for binary and binomial data. The logit link corresponds to a
tolerance distribution that is called the logistic distribution. This distribution
has density
f (y) =
5.1.3
e+y
2.
[1 + e+y ]
87
(5.3)
This is the complementary log-log link: log [ log (1 pi )]. As opposed to the
probit and logit links, this function is asymmetric around 0. The tolerance
distribution that corresponds to the complementary log-log link is called the
extreme value distribution, or a Gumbel distribution, and has density
h
i
f (y) = exp ( + y) e(+y) .
The probit, logit and complementary log-log links are compared in Figure
5.2.
5.2
5.2.1
A binary random variable that takes on the values 1 and 0 with probabilities
p and 1 p, respectively, is said to follow a Bernoulli distribution. The
probability function of a Bernoulli random variable y is
1 p if y = 0
f (y) = py (1 p)1y =
.
(5.4)
p
if y = 1
The Bernoulli distribution has mean value E (y) = p and variance V ar (y) =
p (1 p).
c Studentlitteratur
88
Transformed value
3
0
0
0.2
0.4
0.6
0.8
Probability
-1
Probit
Logit
Compl. Log-log
-2
-3
-4
5.2.2
If a Bernoulli trial is repeated n times such that the trials are independent,
then y = the number of successes (1:s) among the n trials follows a binomial distribution with parameters n and p. The probability function of the
binomial distribution is
n y
f (y) =
p (1 p)ny .
(5.5)
y
The binomial distribution has mean
E (y) = np
and variance
V ar (y) = np (1 p) .
The proportion of successes, pb = ny , follows the same distribution, except for
a scale factor: f (y) = f (b
p). It holds that E (b
p) = p and V ar (b
p) = p(1p)
.
n
As was demonstrated in formula (2.6) on page 37, the binomial distribution
is a member of the exponential family. Since the Bernoulli distribution is
a special case of the binomial distribution with n = 1, even the Bernoulli
distribution is an exponential family distribution.
c Studentlitteratur
89
When the binomial distribution is applied for modeling real data, the crucial
assumption is the assumption of independence. If independence does not
hold, this can often be diagnosed as over-dispersion.
5.3
Probit analysis
Log(Conc)
1.01
0.89
0.71
0.58
0.41
No. of insects
50
49
46
48
50
No. aected
44
42
24
16
6
% aected
88
86
52
33
12
A plot of the relation between the proportion of aected insects and Log(Conc)
is given below. A fitted distribution is also included.
1
0.8
0.6
0.4
0.2
0
0
0.3
0.6
0.9
1.2
1.5
1.8
90
DATA probit;
INPUT conc n x;
logconc=log10(conc);
CARDS;
10.2 50 44
7.7 49 42
5.1 46 24
3.8 48 16
2.6 50 6
;
PROC GENMOD DATA=probit;
MODEL x/n = logconc /
LINK = probit
DIST = bin
;
Value
Data Set
Distribution
Link Function
Dependent Variable
Dependent Variable
Observations Used
Number Of Events
Number Of Trials
WORK.PROBIT
BINOMIAL
PROBIT
X
N
5
132
243
This part simply gives us confirmation that we are using a Binomial distribution and a Probit link.
Criteria For Assessing Goodness Of Fit
Criterion
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
DF
Value
Value/DF
3
3
3
3
.
1.7390
1.7390
1.7289
1.7289
-120.0516
0.5797
0.5797
0.5763
0.5763
.
This section gives information about the fit of the model to the data. The
deviance can be interpreted as a 2 variate on 3 degrees of freedom, if the
sample is large. In this case, the value is 1.74 which is clearly non-significant,
indicating a good fit. Collett (1991) states that a useful rule of thumb is
c Studentlitteratur
91
DF
Estimate
Std Err
ChiSquare
Pr>Chi
INTERCEPT
LOGCONC
SCALE
1
1
0
-2.8875
4.2132
1.0000
0.3501
0.4783
0.0000
68.0085
77.5919
.
0.0001
0.0001
.
2.8875
0
= 0.68535 giving
=
b
4.2132
1
5.4
Example 5.2 Since the logit and probit links are very similar, we can alternatively analyze the data in Table 5.1 using a binomial distribution with
a logit link function. The program and part of the output are similar to the
probit analysis. The fit of the model is excellent, as for the probit analysis
case:
Criteria For Assessing Goodness Of Fit
Criterion
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
DF
Value
Value/DF
3
3
3
3
.
1.4241
1.4241
1.4218
1.4218
-119.8942
0.4747
0.4747
0.4739
0.4739
.
c Studentlitteratur
92
The parameter estimates are given by the last part of the output:
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Std Err
ChiSquare
Pr>Chi
INTERCEPT
LOGCONC
SCALE
1
1
0
-4.8869
7.1462
1.0000
0.6429
0.8928
0.0000
57.7757
64.0744
.
0.0001
0.0001
.
pb
= 4.8869 + 7.1462 log (conc) .
1 pb
The estimated model parameters permits us to estimate e.g. the dose that
gives a 50% eect (ED50 ) as the value of log (conc) for which p = 0.5. Since
0.5
log 10.5
= 0, this value is
ED50 =
4.8869
0
= 0.6839
=
b
7.1462
1
which, on the dose scale, is 100.6839 = 4.83. This is similar to the estimate
provided by the probit analysis. Note that the estimated proportion aected
at a given concentration can be obtained from
pb =
5.5
5.5.1
Model building in multiple logistic regression models can be done in essentially the same way as in standard multiple regression.
Example 5.3 The data in Table 5.1, taken from Collett (1991), were collected to explore whether it was possible to diagnose nodal involvement in
c Studentlitteratur
93
The data analytic task is to explore whether the independent variables can be
used to predict the probability of nodal involvement. We have two continuous
covariates and three covariates coded as dummy variables. Initial analysis of
the data suggests that the value of Acid should be log-transformed prior to
the analysis.
There are 32 possible linear logistic models, excluding interactions. As a first
step in the analysis, all these models were fitted to the data. A summary of
the results is given in Table 5.2.
A useful rule-of-thumb in model building is to keep in the model all terms
that are significant at, say, the 20% level. In this case, a kind of backward
elimination process would start with the full model. We would then delete
Grade from the model (p = 0.29). In the model with Age, log(acid), x-ray
and size, age is not significant (p = 0.26). This suggests a model that includes
log(acid), x-ray and size; in this model, all terms are significant (p < 0.05).
There are no indications of non-linear relations between log(acid) and the
probability of nodal involvement. It remains to investigate whether any interactions between the terms in the model would improve the fit. To check
this, interaction terms were added to the full model. Since there are five
variables, the model was tested with all 10 possible pairwise interactions.
The interactions size*grade (p = 0.01) and logacid*grade (p = 0.10) were
judged to be large enough for further consideration. Note that grade was not
suggested by the analysis until the interactions were included. We then tried
a model with both these interactions. Age could be deleted. The resulting
model includes logacid (p = 0.06), x-ray (p = 0.03), size (p = 0.21), grade
(p = 0.19), logacid*grade (p = 0.11), and size*grade (p = 0.02). Part of the
output is:
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
46
46
46
46
.
36.2871
36.2871
42.7826
42.7826
-18.1436
0.7889
0.7889
0.9301
0.9301
.
c Studentlitteratur
94
A ge
A cid
X ray
S ize
G rade
I n v o lv
A ge
A cid
X ray
S ize
G rade
I n v o lv
66
.4 8
64
.4 0
68
.5 6
61
.5 0
66
.5 0
64
.5 0
56
.5 2
63
.4 0
58
.5 0
52
.5 5
60
.4 9
66
.5 9
65
.4 6
58
.4 8
60
.6 2
57
.5 1
50
.5 6
65
.4 9
49
.5 5
65
.4 8
61
.6 2
59
.6 3
58
.7 1
61
1 .0 2
51
.6 5
53
.7 6
67
.6 7
67
.9 5
67
.4 7
53
.6 6
51
.4 9
65
.8 4
56
.5 0
50
.8 1
60
.7 8
60
.7 6
52
.8 3
45
.7 0
56
.9 8
56
.7 8
67
.5 2
46
.7 0
63
.7 5
67
.6 7
59
.9 9
63
.8 2
64
1 .8 7
57
.6 7
61
1 .3 6
51
.7 2
56
.8 2
64
.8 9
68
1 .2 6
c Studentlitteratur
95
Terms
(Intercept only)
Age
log(acid)
Xray
Size
Grade
Age, log(acid)
Age, x-ray
Age, size
Age, grade
log(acid), x-ray
log(acid), size
log(acid), grade
x-ray, size
x-ray, grade
size, grade
Age, log(acid), x-ray
Age, log(acid), size
Age, log(acid), grade
Age, x-ray, size
Age, x-ray, grade
Age, size, grade
log(acid), x-ray, size
log(acid), x-ray, grade
log(acid), size, grade
x-ray, size, grade
age, log(acid), x-ray, size
age, log(acid), x-ray, grade
age, log(acid), size, grade
log(acid), x-ray, size, grade
age, x-ray, size, grade
age, log(acid), x-ray, size, grade
Deviance
70.25
69.16
64.81
59.00
62.55
66.20
63.65
57.66
61.43
65.24
55.27
56.48
59.55
53.35
56.70
61.30
53.78
55.22
58.52
52.09
55.49
60.28
48.99
52.03
54.51
52.78
47.68
50.79
53.38
47.78
51.57
46.56
df
52
51
51
51
51
51
50
50
50
50
50
50
50
50
50
50
49
49
49
49
49
49
49
49
49
49
48
48
48
48
48
47
c Studentlitteratur
96
Parameter
INTERCEPT
LOGACID
XRAY
XRAY
SIZE
SIZE
GRADE
GRADE
LOGACID*GRADE
LOGACID*GRADE
SIZE*GRADE
SIZE*GRADE
SIZE*GRADE
SIZE*GRADE
SCALE
0
1
0
1
0
1
0
1
0
0
1
1
0
1
0
1
DF
Estimate
Std Err
ChiSquare
Pr>Chi
1
1
1
0
1
0
1
0
1
0
1
0
0
0
0
7.2391
12.1345
-2.3404
0.0000
2.5098
0.0000
-4.3134
0.0000
-10.4260
0.0000
-5.6477
0.0000
0.0000
0.0000
1.0000
3.4133
6.5154
1.0845
0.0000
2.0218
0.0000
3.2696
0.0000
6.6403
0.0000
2.4346
0.0000
0.0000
0.0000
0.0000
4.4980
3.4686
4.6571
.
1.5410
.
1.7404
.
2.4652
.
5.3814
.
.
.
.
0.0339
0.0625
0.0309
.
0.2145
.
0.1871
.
0.1164
.
0.0204
.
.
.
.
The model fits well, with Deviance/df=0.79. Since the size*grade interaction
is included in the model, the main eects of size and of grade should also
be included. The output suggest the following models for grade 0 and 1,
respectively:
Grade 0: logit(b
p) = 2.93 + 1.71 log(acid) 2.34x-ray3.14size
Grade 1: logit(b
p) = 7.24 + 12.13 log(acid) 2.34x-ray+2.51size
The probability of nodal involvement increases with increasing acid level.
The increase is higher for patients with serious (grade 1) tumors.
5.5.2
A set of tools for model building in logistic regression has been developed.
These tools are similar to the tools used in multiple regression analysis. The
Logistic procedure in the SAS package includes the following variable selection methods:
Forward selection: Starting with an empty model, the procedure adds, at
each step, the variable that would give the lowest p-value of the remaining
variables. The procedure stops when all variables have been added, or
when no variables meet the pre-specified limit for the p-value.
Backward selection: Starting with a model containing all variables, variables
are step by step deleted from the model until all variables remaining in
the model meet a specified limit for their p-values. At each step, the
variable with the largest p-value is deleted.
c Studentlitteratur
97
Residual
Residual
I Chart of Residuals
3
2.5
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
11
X =-0.05895
5 5
55
1
-2
-1
Normal Score
Histogram of Residuals
Residual
20
30
40
50
Residual
Frequency
10
-3.0SL=-1.535
Observation Number
20
10
3.0SL=1.417
22
2
-1
-2
5552
2.5
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
0.0
0.5
1.0
Fit
5.5.3
Model diagnostics
As an illustration of model diagnostics for logistic regression models, the predicted values and the residuals were stored as new variables for the multiple
logistic regression data (Table 5.1). Based on these, a set of standard diagnostic plots was prepared using Minitab. These plots are reproduced in
Figure 5.3.
c Studentlitteratur
98
5.6
Odds ratios
If an event occurs with probability p, then the odds in favor of the event is
p
Odds =
.
(5.6)
1p
For example, if an event occurs with probability p = 0.75, then the odds in
favor of that event is 0.75/ (1 0.75) = 3. This means that a success is
three times as likely as a failure. If the odds are known, the probability p
Odds
can be calculated as p = Odds+1
.
A comparison between two events, or a comparison between e.g. two groups
of individuals with respect to some event, can be made by computing the
odds ratio
OR =
p1 / (1 p1 )
.
p2 / (1 p2 )
(5.7)
99
Survived
Yes
No
499
15
4327
74
Odds ratios can be estimated using logistic regression. Note that in logistic
p
= + x where, in this case, x is a
regression we use the model log 1p
dummy variable with value 1 for the smokers and 0 for the nonsmokers. This
is the log of the odds, so the odds is exp ( + x). Using x = 1 in the
numerator and x = 0 in the denominator gives the odds ratio as
OR =
exp ( + )
= e
exp ()
Thus, the odds ratio can be obtained by exponentiating the regression coecient in a logistic regression. We will use the same data as in the previous
example to illustrate this. A SAS program for this analysis can be written
as follows:
DATA babies;
INPUT smoking $ survival $ n;
CARDS;
Yes Yes 499
Yes No
15
No Yes 4327
No No
74
;
PROC GENMOD DATA=babies order=data;
CLASS smoking;
FREQ n;
MODEL survival = smoking/
dist=bin link=logit ;
RUN;
100
DF
Value
Value/DF
4913
4913
4913
4913
886.9891
886.9891
4914.9964
4914.9964
-443.4945
0.1805
0.1805
1.0004
1.0004
Parameter
Intercept
smoking
smoking
Scale
Yes
No
DF
Estimate
Standard
Error
1
1
0
0
-4.0686
0.5640
0.0000
1.0000
0.1172
0.2871
0.0000
0.0000
Wald 95%
Confidence Limits
-4.2983
0.0013
0.0000
1.0000
-3.8388
1.1267
0.0000
1.0000
ChiSquare
Pr > ChiSq
1204.34
3.86
.
<.0001
0.0495
.
5.7
Overdispersion in binary/binomial
models
Overdispersion means that the variance of the response is larger than would
be expected for the chosen model. For binomial models, the variance of
y =number of successes is np (1 p), and the variance of pb = ny is p(1p)
. A
n
simple way to illustrate over-dispersion is to consider a simple dose-response
experiment where the same dose has been used on two batches of animals.
Suppose that the chosen dose has eect on 10 out of 50 animals in one
of the replications, and on 20 out of 50 animals in the other replication.
This means that there is actually a significant dierence between the two
replications (p = 0.029). In other less extreme cases, there may be a tendency
for the responses to dier, even if the results are not significantly dierent
at any given dose. Still, when all replications are considered together, a
value of the Deviance/df statistic appreciably above unity may indicate that
overdispersion is present in the data.
c Studentlitteratur
101
5.7.1
b2 =
1 X (yj nj pb)
r 1 j=1 nj pb (1 pb)
(5.9)
5.7.2
If one can suspect some form of clustering, another approach to the modeling is to assume that y follows a binomial distribution within clusters but
that the parameter p follows some random distribution over clusters. If the
distribution of p is known, the distribution of y will be a so called compound
distribution which can be derived. A rather simple case is obtained when
the distribution of p is a Beta distribution. Then, the distribution of y will
follow a distribution called the Beta-binomial distribution. However, this
distribution is not at present available in the Genmod procedure.
Estimation using Quasi-likelihood methods is an alternative approach to
modeling overdispersion. This is discussed in Chapter 8.
c Studentlitteratur
102
5.7.3
O. aegyptiaca 73
Bean
Cucumber
y
n
y
n
8 16
3
12
10 30 22
41
8 28 15
30
23 45 32
51
0
4
3
7
It was of interest to compare the two varieties, and also to compare the two
types of host plants. An analysis of these data using a binomial distribution
with a logit link revealed that an interaction term was needed. Part of the
output for a model containing Variety, Host and Variety*Host, is given below.
The model does not fit well (Deviance=33.2778 on 17 df , p = 0.01). The ratio
Deviance/df is nearly 2, indicating that overdispersion may be present.
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
17
17
17
17
33.2778
33.2778
31.6511
31.6511
-543.1106
1.9575
1.9575
1.8618
1.8618
Algorithm converged.
Analysis Of Parameter Estimates
Parameter
Intercept
variety
variety
host
host
variety*host
variety*host
variety*host
variety*host
Scale
c Studentlitteratur
73
75
Bean
Cucumber
73
73
75
75
Bean
Cucumber
Bean
Cucumber
DF
Estimate
Standard
Error
1
1
0
1
0
1
0
0
0
0
0.7600
-0.6322
0.0000
-1.3182
0.0000
0.7781
0.0000
0.0000
0.0000
1.0000
0.1250
0.2100
0.0000
0.1775
0.0000
0.3064
0.0000
0.0000
0.0000
0.0000
103
Source
variety
host
variety*host
DF
ChiSquare
Pr > ChiSq
1
1
1
2.53
37.48
6.41
0.1121
<.0001
0.0114
As a second analysis, the data were analyzed using the automatic feature in
Genmod to estimate the scale parameter from the data using the Maximum
Likelihood method. Part of the output was as follows:
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
17
17
17
17
33.2778
17.0000
31.6511
16.1690
-277.4487
1.9575
1.0000
1.8618
0.9511
The procedure now uses a scaled deviance of 1.00. The parameter estimates
are identical to those of the previous analysis, but the estimated standard
errors are larger when we include a scale parameter. This has the eect that
the Variety*Host interaction is no longer significant.
Analysis Of Parameter Estimates
Parameter
Intercept
variety
variety
host
host
variety*host
variety*host
variety*host
variety*host
Scale
73
75
Bean
Cucumber
73
73
75
75
Bean
Cucumber
Bean
Cucumber
DF
Estimate
Standard
Error
1
1
0
1
0
1
0
0
0
0
0.7600
-0.6322
0.0000
-1.3182
0.0000
0.7781
0.0000
0.0000
0.0000
1.3991
0.1748
0.2938
0.0000
0.2483
0.0000
0.4287
0.0000
0.0000
0.0000
0.0000
Source
variety
host
variety*host
Num DF
Den DF
F Value
Pr > F
ChiSquare
Pr > ChiSq
1
1
1
17
17
17
1.29
19.15
3.27
0.2718
0.0004
0.0881
1.29
19.15
3.27
0.2561
<.0001
0.070
c Studentlitteratur
104
5.8. Exercises
5.8
Exercises
Exercise 5.1
Species
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
Exposure
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
4
Rel.
Hum
60.0
60.0
60.0
65.8
65.8
65.8
70.5
70.5
70.5
75.8
75.8
75.8
60.0
60.0
60.0
65.8
65.8
65.8
70.5
70.5
70.5
75.8
75.8
75.8
60.0
60.0
60.0
65.8
65.8
65.8
70.5
70.5
70.5
75.8
75.8
75.8
60.0
60.0
60.0
65.8
65.8
65.8
70.5
70.5
70.5
75.8
75.8
75.8
Temp
Deaths
Species
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
0
0
0
0
0
0
0
1
4
5
0
2
4
0
2
3
0
1
2
7
7
7
4
4
7
3
3
5
2
3
3
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
Exposure
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
4
Rel.
Hum
60.0
60.0
60.0
65.8
65.8
65.8
70.5
70.5
70.5
75.8
75.8
75.8
60.0
60.0
60.0
65.8
65.8
65.8
70.5
70.5
70.5
75.8
75.8
75.8
60.0
60.0
60.0
65.8
65.8
65.8
70.5
70.5
70.5
75.8
75.8
75.8
60.0
60.0
60.0
65.8
65.8
65.8
70.5
70.5
70.5
75.8
75.8
75.8
Temp
Deaths
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
10
15
20
0
0
0
0
0
0
0
0
0
0
0
0
0
3
2
0
2
1
0
0
1
1
0
1
7
11
11
4
5
9
2
4
6
2
3
5
12
14
16
10
12
12
5
7
9
4
5
7
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
The data set given above contains data from an experiment studying the
survival of snails. Groups of 20 snails were held for periods of 1, 2, 3 or 4 weeks
under controlled conditions, where temperature and humidity were kept at
assigned levels. The snails were of two species (A or B). The experiment was
c Studentlitteratur
105
Snail species A or B
Exposure in weeks (1, 2, 3 or 4)
Relative humidity (four levels)
Temperature in degrees Celsius (3 levels)
Number of deaths
Number of snails exposed
Analyze these data to find whether Exposure, Humidity, Temp, or interactions between these have any eects on survival probability. Also, make
residual diagnostics and leverage diagnostics.
Exercise 5.2 The file Ex5_2.dat gives the following information about passengers travelling on the Titanic when it sank in 1912. Background material
for the data can be found on http://www.encyclopedia-titanica.org.
Name
PClass
Age
Sex
Survived
Find a model that can predict probability of survival as functions of the given
covariates, and possible interactions. Note that some age data are missing.
Exercise 5.3 Finney (1947) reported some data on the relative potencies of
Rotenone, Deguelin, and a mixture of these. Batches of insects were subjected
to these treatments, in dierent concentrations, and the number of dead
insects was recorded. The raw data are:
c Studentlitteratur
106
5.8. Exercises
Treatment
Rotenone
Deguelin
Mixture
ln(dose)
1.01
0.89
0.71
0.58
0.41
1.70
1.61
1.48
1.31
1.00
0.71
1.40
1.31
1.18
1.00
0.71
0.40
50
49
46
48
50
48
50
49
48
48
49
50
46
48
46
46
47
44
42
24
16
6
48
47
47
34
18
16
48
43
38
27
22
7
Analyze these data. In particular, examine whether the regression lines can
be assumed to be parallel.
Exercise 5.4 Fahrmeir & Tutz (2001) report some data on the risk of infection from births by Caesarian section. The response variable of interest
is the occurrence of infections following the operation. Three dichotomous
covariates that might aect the risk of infection were studied:
planned
risk
antibio
The data are included in the following Sas program that also gives the value
of the variable infection (1=infection, 0=no infection). The variable wt is the
number of observations with a given combination of the other variables. Thus,
for example, there were 17 un-infected cases (infection=0) with planned=1.
risk=1, and antibio=1.
c Studentlitteratur
107
DF
Value
Value/DF
8
8
8
8
226.5177
226.5177
257.2508
257.2508
-113.2588
28.3147
28.3147
32.1563
32.1563
Model 2
Criteria For Assessing Goodness Of Fit
Criterion
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
DF
Value
Value/DF
7
7
7
7
226.4393
226.4393
254.7440
254.7440
-113.2196
32.3485
32.3485
36.3920
36.3920
Model 3
Criteria For Assessing Goodness Of Fit
Criterion
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
DF
Value
Value/DF
7
7
7
7
216.4759
216.4759
261.4010
261.4010
-108.2380
30.9251
30.9251
37.3430
37.3430
c Studentlitteratur
108
5.8. Exercises
One problem with Model 3 is that no standard error, and no test, of the
parameter for the planned*risk interaction is given by Sas. This is because
the likelihood is rather flat which, in turn, depends on cells with observed
count = 0. Therefore, Model 4 used the same model as Model 3, but with
all values where Wt=0 replaced by Wt=0.5.
Model 4
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
11
11
11
11
231.5512
231.5512
446.4616
446.4616
-115.7756
21.0501
21.0501
40.5874
40.5874
Algorithm converged.
Analysis Of Parameter Estimates
Parameter
Intercept
planned
antibio
risk
planned*risk
Scale
DF
Estimate
1
1
1
1
1
0
2.1440
-0.8311
3.4991
-3.7172
2.4394
1.0000
Standard
Error
1.0568
1.1251
0.5536
1.1637
1.2477
0.0000
4.2152
1.3742
4.5840
-1.4364
4.8848
1.0000
ChiSquare
Pr > ChiSq
4.12
0.55
39.95
10.20
3.82
0.0425
0.4601
<.0001
0.0014
0.0506
109
The blood pressure (Z) was measured at the beginning of the experiment on
each patient.
A binary response variable (Y) was measured on each patient at the end
of the experiment. It was modeled as g() = XB, using some computer
package. The final model included the main eects of A and B and their
interaction, and the eect of the covariate Z. The slope for Z was dierent
for dierent treatments, but not for the dierent patient groups or for the
A*B interaction.
A. Write down the complete design matrix X. You should include all dummy
variables, even those that are redundant. Covariate values should be represented by some symbol. Also write down the corresponding parameter vector
B.
B. The link function used to analyze these data was the logit link g (p) =
p
log 1p
. What is the inverse g 1 of this link function?
c Studentlitteratur
6. Response variables as
counts
6.1
Count data can be summarized in the form of frequency tables or as contingency tables. The data are then given as the number of observations with
each combination of values of some categorical variables. We will first look
at a simple example with a contingency table of dimension 22.
Example 6.1 Norton and Dunn (1985) studied possible relations between
snoring and heart problems. For 2484 persons it was recorded whether the
person had any heart problems and whether the person was a snorer. An
interesting question is then whether there is any relation between snoring
and heart problems. The data are as follows:
Heart problems
Yes
No
Snores
Seldom Often
59
51
1958
416
2017
467
Total
110
2374
2484
We assume that the persons in the sample constitute a random sample from
some population. Denote with pij the probability that a randomly selected
person belongs to row category i and column category j of the table. This
can be summarized as follows:
Heart problems
Yes
No
Snores
Seldom Often
p11
p12
p21
p22
p1
p2
Total
p1
p2
1
111
112
6.1.1
log (11 )
1 1 1
log (12 ) 1 1 0
1 .
(6.2)
log (21 ) = 1 0 1
1
1 0 0
log (22 )
6.1.2
If independence does not hold we need to include in the model terms of type
()ij that account for the dependence. The terms ()ij represent interaction between the factors A and B, i.e. the eect of one variable depends on
the level of the other variable. Then the model becomes
(6.3)
log ij = + i + j + ()ij .
c Studentlitteratur
113
6.2
So far, we have seen that a model for the expected frequencies in a crosstable
can be formulated as a log-linear model. This model has the following properties:
The predictor is a linear predictor of the same type as in ANOVA.
The link function is a log function.
It remains to discuss what distributional assumptions to use.
6.2.1
n!
pn1 pn2 2 . . . pnk k
n1 ! n2 ! . . . nk ! 1
(6.4)
c Studentlitteratur
114
6.2.2
A contingency table may have some of its totals fixed by the design of the
data collection. For example, 500 males and 500 females might have been
interviewed in a survey. In such cases it is not meaningful to talk about the
random distribution of the gender variable. For such data each slice of
the table subdivided by gender may be seen as one realization of a multinomial distribution. The joint distribution of all cells of the table is then the
product of several multinomial distributions, one for each slice. This joint
distribution is called the product multinomial distribution.
6.2.3
ei ni i
ni !
(6.5)
6.2.4
Contingency tables can be of many dierent types. In some cases, the total
sample size is fixed; an example is when it has been decided that n = 1000
individuals will be interviewed about some political question. In some cases
even some of the margins of the table may be fixed. An example is when
500 males and 500 females will participate in a survey. A table with a fixed
total sample size would suggest a multinomial distribution; if in addition one
or more of the margins are fixed we would assume a product multinomial
distribution. However, as noted by Agresti (1996), For most analyses, one
need not worry about which sampling model makes the most sense. For the
primary inferential methods in this text, the same results occur for the Poisson, multinomial and independent binomial/multinomial sampling models
(p. 19).
c Studentlitteratur
115
6.3
Example 6.2 We analyzed the data on page 111 using the Genmod procedure with Poisson distribution and a log link. The program was:
DATA snoring;
INPUT snore heart count;
CARDS;
1 1
51
1 0 416
0 1
59
0 0 1958
;
PROC GENMOD DATA=snoring;
CLASS snore heart;
MODEL count = snore heart /
LINK = log
DIST = poisson
;
RUN;
c Studentlitteratur
116
DF
Value
Value/DF
1
1
1
1
.
45.7191
45.7191
57.2805
57.2805
15284.0145
45.7191
45.7191
57.2805
57.2805
.
0
1
0
1
DF
Estimate
Std Err
ChiSquare
Pr>Chi
1
1
0
1
0
0
3.0292
1.4630
0.0000
3.0719
0.0000
1.0000
0.1041
0.0514
0.0000
0.0975
0.0000
0.0000
847.2987
811.6746
.
992.0240
.
.
0.0001
0.0001
.
0.0001
.
.
2
X nij
bij
2
=
.
bij
i,j
The conclusion is the same, in this case, but the tests are not identical.
The output also gives us estimates of the three parameters of the model:
b = 3.0719. An analysis of the saturated model
b = 3.0292,
b 1 = 1.4630 and
1
[ = 1.4033. From
would give an estimate of the interaction parameter as ()
11
this we can calculate the odds ratio OR as
OR = exp (1.4033) = 4.07.
Patients who snore have a four times larger odds of having heart problems.
Odds ratios in log-linear models is further discussed in a later section.
c Studentlitteratur
117
6.4
Red
29
273
8
64
374
Other
11
191
31
64
297
Total
40
464
39
128
671
A standard analysis of these data would be to test whether there is independence between season and color through a 2 test. The corresponding GLIM
approach is to model the expected number of beetles as a function of season
and color. The observed numbers in each cell are assumed to be generated
from an underlying Poisson distribution. The canonical link for the Poisson
distribution is the log link. Thus, a Genmod program for these data is
DATA chisq;
INPUT season $ color $ no;
CARDS;
Early_spring red
29
Early_spring other 11
Late_spring red
273
Late_spring other 191
Early_summer red
8
Early_summer other 31
Late_summer red
64
Late_summer other 64
;
PROC GENMOD DATA=chisq;
CLASS season color;
MODEL no = season color /
DIST=poisson
LINK=log ;
run;
c Studentlitteratur
118
DF
Value
Value/DF
3
3
3
3
.
28.5964
28.5964
27.6840
27.6840
2628.7264
9.5321
9.5321
9.2280
9.2280
.
The deviance is 28.6 on 3 df which is highly significant. The Pearson chisquare is again the same value as would be obtained from a standard chisquare test; it is also highly significant. Formally, independence is tested
by comparing the deviance of this model with the deviance that would be
obtained if the Season*Color interaction was included in the model. This
saturated model has deviance 0.00 on 0 df . Thus, the deviance 28.6 is a
large-sample test of independence between color and season.
6.5
Higher-order tables
6.5.1
A three-way table
Example 6.4 The table below contains data from a survey from Wright
State University in 19921 . 2276 high school seniors were asked whether they
had ever used Alcohol (A), Cigarettes (C) and/or Marijuana (M). This is a
three-way contingency table of dimension 2 2 2.
Alcohol
use
Yes
No
1 Data
Cigarette
use
Yes
No
Yes
No
Marijuana use
Yes
No
911
538
44
456
3
43
2
279
quoted from Agresti (1996) who credited the data to Professor Harry Khamis.
c Studentlitteratur
119
6.5.2
Types of independence
Models for data of the type given in the last example can include the main
eects of A, C and M and dierent interactions containing these. The presence of an interaction, for example A*C, means that students who use alcohol
have a higher (or lower) probability of also using cigarettes. One way of interpreting interactions is to calculate odds ratios; we will return to this topic
soon.
A model of type A C M A*C A*M would permit interaction between A and C,
and between A and M, but not between C and M. C and M are then said to
be conditionally independent, controlling for A.
A model that only contains the main eects, i.e. the model A C M is called
a mutual independence model. In this example this would mean that use of
one drug does not change the risk of using any other drug.
A model that contains all interactions up to a certain level, but no higherorder interactions, is called a homogenous association model.
6.5.3
The saturated model that contains all main eects and all two- and threeway
interactions was fitted to the data as a baseline. The three-way interaction
A*C*M was not significant (p = 0.53). The output for the homogenous association model containing all two-way interactions was as follows:
Criteria For Assessing Goodness Of Fit
Criterion
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
DF
Value
Value/DF
1
1
1
1
.
0.3740
0.3740
0.4011
0.4011
12010.6124
0.3740
0.3740
0.4011
0.4011
.
The fit of this model is good; a simple rule of thumb is that Value/df should
not be too much larger than 1. The parameter estimates for this model are
as follows:
c Studentlitteratur
120
No
Yes
No
Yes
No
Yes
No
No
Yes
Yes
No
No
Yes
Yes
No
No
Yes
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
DF
Estimate
Std Err
ChiSquare
Pr>Chi
1
1
0
1
0
1
0
1
0
0
0
1
0
0
0
1
0
0
0
0
6.8139
-5.5283
0.0000
-3.0158
0.0000
-0.5249
0.0000
2.0545
0.0000
0.0000
0.0000
2.9860
0.0000
0.0000
0.0000
2.8479
0.0000
0.0000
0.0000
1.0000
0.0331
0.4522
0.0000
0.1516
0.0000
0.0543
0.0000
0.1741
0.0000
0.0000
0.0000
0.4647
0.0000
0.0000
0.0000
0.1638
0.0000
0.0000
0.0000
0.0000
42312.0532
149.4518
.
395.6463
.
93.4854
.
139.3180
.
.
.
41.2933
.
.
.
302.1409
.
.
.
.
0.0001
0.0001
.
0.0001
.
0.0001
.
0.0001
.
.
.
0.0001
.
.
.
0.0001
.
.
.
.
All remaining interactions in the model are highly significant which means
that no further simplification of the model is suggested by the data.
6.5.4
In our drug use example, the chosen model does not contain any three-way
interaction, and only one parameter is estimable for each interaction. Thus,
the partial odds ratios for the two-way interactions can be estimated as:
c Studentlitteratur
121
6.6
6.6.1
DATA snoring;
INPUT x n snoring $;
CARDS;
51 467 Yes
59 2017 No
RUN;
PROC GENMOD DATA=snoring ORDER=data;
CLASS snoring;
MODEL x/n = snoring /
DIST=Binomial LINK=logit;
RUN;
Parameter
Intercept
snoring
Yes
snoring
No
Scale
DF
Estimate
Standard
Error
1
1
0
0
-3.5021
1.4033
0.0000
1.0000
0.1321
0.1987
0.0000
0.0000
-3.2432
1.7927
0.0000
1.0000
ChiSquare
Pr > ChiSq
702.47
49.89
.
<.0001
<.0001
.
We note that the parameter estimate for snoring is 1.4033. This is the same as
the estimate of the interaction parameter for the saturated log-linear model.
c Studentlitteratur
122
6.6.2
6.7
Capture-recapture data
Capture-recapture data provide an interesting application of log-linear models. Suppose that there are M individuals in a population; M is unknown
and we want to estimate M . We capture and mark n1 of the individuals.
After some time we capture another n2 individuals. It turns out that s of
c Studentlitteratur
123
(6.10)
If the individuals are captured on three occasions, the data can be written as
a three-way contingency table. There are eight dierent capture patterns:
Notation
n123
n123
n123
n123
n123
n123
n123
n123
Captured at occasion
1, 2 and 3
2 and 3
1 and 3
3
1 and 2
2
1
None
If we assume independence between occasions, the probability that an individual is never captured is
p123 = (1
n2
n3
n1
)(1
)(1
)
M
M
M
(6.11)
Thus, an estimator of the number of individuals that have never been captured is
n
123 = M p123 = M (1
n2
n3
n1
)(1
)(1
)
M
M
M
(6.12)
n1
n2
n3
)(1
)(1
)
M
M
M
(6.13)
for M , where n is the number of individuals that have been captured at least
once.
There are some drawbacks with the method outlined so far. We have to
assume independence between occasions, and we only use the information in
the margins of the table. A more flexible analysis of this kind of data can be
obtained by using log-linear models.
If the occasions are independent, it would hold that, for example, p123 =
p1 p2 p3 . The expected number of individuals in this cell would then be
123 = M p1 p2 p3
(6.14)
c Studentlitteratur
124
Taking logarithms,
ln(123 ) = ln M + ln p1 + ln p2 + ln p3 = + 1 + 1 + 1
(6.15)
proc genmod;
class x1 x2 x3 x4 x5;
model count=x1 x2 x3 x4 x5 /
dist=Poisson obstats residuals;
weight w;
run;
In this program, x1 to x5 refer to the following sources of information:
Code
x1
x2
x3
x4
x5
c Studentlitteratur
Source of information
Health care
Social authorities
Penal system
Police, customs
Others
125
Table 6.1: Swedish drug addicts with dierent capture patterns in 1979.
Hospital
care
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Social
authorities
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
Penal
system
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
Police,
customs
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
Others
Count
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
45
2080
11
1056
11
942
9
1011
59
381
15
245
18
345
13
828
7
179
1
191
3
137
1
264
18
132
9
144
16
133
15
c Studentlitteratur
126
The weight w has been set to 1 for all combinations except the combination
where all x1,...,x5 are 0. This combination cannot be observed and is a
structural zero.
In the program example it is implicitly assumed that the dierent sources of
information are independent. However, interactions can be included in the
model as e.g. x1*x2. In this rather large data set all two-way interactions
were significant. In addition, the interaction x1*x3*x4 was also significant.
Thus, the final model was
count =x1|x2|x3|x4|x5 @2 x1*x3*x4;
This model had a good fit to the data (2 = 17.5 on 14 df ). The model estimates the number of uncaptured individuals as 8878 individuals with confidence interval 7640 10317 individuals. This would mean that the number
of drug addicts in Sweden in 1979 was 8319 + 8878 = 17197 individuals.
This is more than 5000 individuals higher than the published result, which
was 12000 (Socialdepartementet 1980). The published result was obtained
through capture-recapture methods but assuming that the dierent sources
of information are independent.
6.8
1
15
10
10
2
11
11
7
3
14
12
9
4
17
13
11
5
5
14
3
6
11
15
6
7
10
16
1
8
4
17
1
9
8
18
4
127
(6.17)
This is a generalized linear model with a Poisson distribution, a log link and
a simple linear predictor. A SAS program for this model can be written as
DATA stress;
INPUT months number @@;
CARDS;
1 15 2 11 3 14 4 17 5 5 6 11 7 10 8 4 9 8 10 10
11 7 12 9 13 11 14 3 15 6 16 1 17 1 18 4
;
PROC GENMOD DATA=stress;
MODEL number = months / DIST=poisson LINK=log OBSTATS RESIDUALS;
MAKE 'obstats' out=ut;
RUN;
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
16
16
16
16
.
24.5704
24.5704
22.7145
22.7145
174.8451
1.5356
1.5356
1.4197
1.4197
.
The data have a less than perfect fit to the model, with Value/df=1.53; the
p-value is 0.078.
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Std Err
ChiSquare
Pr>Chi
INTERCEPT
MONTHS
SCALE
1
1
0
2.8032
-0.0838
1.0000
0.1482
0.0168
0.0000
357.9476
24.8639
.
0.0001
0.0001
.
We find that the memory of stressful events fades away as log () = 2.80
0.084x. A plot of the data, along with the fitted regression line, is given as
Figure 6.1. Figure 6.2 shows the data and regression line with a log scale for
the y-axis.
c Studentlitteratur
128
c Studentlitteratur
129
6.9
1
P3
M6
O4
N 17
K4
2
O2
K0
M9
P8
N4
3
N5
O6
K1
M8
P2
4
K1
N4
P6
O9
M4
5
M4
P4
N5
K0
O8
We may model the number of wireworms in a certain plot as a Poisson distribution. The design includes a Row eect, a Column eect and a Treatment
eect. Thus, an ANOVA-like model for these data can be written as
g () = 0 + i + j + k
(6.18)
from Snedecor
and Cochran (1980). The original analysis was an Anova on data
transformed as y = x + 1
c Studentlitteratur
130
Value
Data Set
Distribution
Link Function
Dependent Variable
Observations Used
WORK.POISSON
POISSON
LOG
COUNT
25
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
12
12
12
12
.
19.5080
19.5080
18.0096
18.0096
97.0980
1.6257
1.6257
1.5008
1.5008
.
The fit of the model is reasonable but not perfect; the p value is 0.077. Ideally,
Value/df should be closer to 1.
Analysis Of Parameter Estimates
Parameter
INTERCEPT
ROW
ROW
ROW
ROW
ROW
COL
COL
COL
COL
COL
TREAT
TREAT
TREAT
TREAT
TREAT
SCALE
1
2
3
4
5
1
2
3
4
5
K
M
N
O
P
c Studentlitteratur
DF
Estimate
Std Err
ChiSquare
Pr>Chi
1
1
1
1
1
0
1
1
1
1
0
1
1
1
1
0
0
1.4708
-0.4419
-0.1751
0.0451
0.5699
0.0000
0.3045
-0.0506
-0.0936
-0.0636
0.0000
-1.3797
0.2910
0.3324
0.2003
0.0000
1.0000
0.3519
0.3404
0.3175
0.2980
0.2729
0.0000
0.2892
0.3099
0.3207
0.3093
0.0000
0.4627
0.2789
0.2760
0.2854
0.0000
0.0000
17.4670
1.6851
0.3041
0.0229
4.3618
.
1.1087
0.0267
0.0852
0.0423
.
8.8906
1.0888
1.4502
0.4928
.
.
0.0001
0.1942
0.5813
0.8796
0.0368
.
0.2924
0.8703
0.7704
0.8370
.
0.0029
0.2967
0.2285
0.4827
.
.
131
DF
ChiSquare
Pr>Chi
ROW
COL
TREAT
4
4
4
14.3595
2.8225
25.1934
0.0062
0.5880
0.0001
6.10
Rate data
log
= X
(6.19)
t
(6.20)
The adjustment term log (t) is called an oset. The oset can easily be
included in models analyzed with e.g. Proc Genmod.
Example 6.8 The data below, quoted from Agresti (1996), are accident
rates for elderly drivers, subdivided by sex. For each sex the number of
person years (in thousands) is also given. The data refer to 16262 Medicaid
enrollees.
No. of accidents
No. of person years (000)
Females
175
17.30
Males
320
21.40
From the raw data we can calculate accident rates as 175/17.30 = 10.1 per
1000 person years for females and 320/21.40 = 15.0 per 1000 person years for
c Studentlitteratur
132
males. A simple way to model these data is to use a generalized linear model
with a Poisson distribution, a log link, and to use the number of person years
as an oset. This is done with the following program:
DATA accident;
INPUT sex $ accident persyear;
logpy=log(persyear);
CARDS;
Male
320 21.400
Female 175 17.300
;
PROC GENMOD DATA=accident;
CLASS sex;
MODEL accident = sex /
LINK
= log
DIST
= poisson
OFFSET = logpy
;
RUN;
The output is
Criterion
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
DF
Value
Value/DF
0
0
0
0
.
0.0000
0.0000
0.0000
0.0000
2254.7003
.
.
.
.
.
The model is a saturated model so we cant assess the over-all fit of the model
by using the deviance.
Analysis Of Parameter Estimates
Parameter
INTERCEPT
SEX
SEX
SCALE
Female
Male
DF
Estimate
Std Err
ChiSquare
Pr>Chi
1
1
0
0
2.7049
-0.3909
0.0000
1.0000
0.0559
0.0940
0.0000
0.0000
2341.3269
17.2824
.
.
0.0001
0.0001
.
.
The parameter estimate for females is 0.39. The model can be written as
log () = log (t) + 0 + 1 x
where x is a dummy variable taking the value 1 for females and 0 for males.
Thus the estimate can be interpreted such that the odds ratio is e0.3909 =
0.676. The risk of having an accident for a female is 68% of the risk for men.
This dierence is significant; however, other factors that may aect the risk
of accident, for example dierences in driving distance, are not included in
this model.
c Studentlitteratur
133
6.11
Overdispersion means that the variance of the response variable is larger than
would be expected for the chosen distribution. For Poisson data we would
expect the variance to be equal to the mean.
As noted earlier, the presence of overdispersion may be related to mistakes
in the formulation of the generalized linear model: the distribution, the link
function and/or the linear predictor. The eects of overdispersion is that pvalues for tests are deflated: it becomes too easy to get significant results.
6.11.1
Output:
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
12
12
12
12
.
19.5080
7.3424
18.0096
6.7784
36.5456
1.6257
0.6119
1.5008
0.5649
.
c Studentlitteratur
134
DF
ChiSquare
Pr>Chi
ROW
COL
TREAT
4
4
4
5.4046
1.0623
9.4823
0.2482
0.9002
0.0501
Fixing the scale parameter to 1.63 has a rather dramatic eect on the result.
In our previous analysis of these data, the treatment eect was highly significant (p = 0.0001), and the row eect was significant (p = 0.0062). In our
new analysis even the treatment eect is above the 0.05 limit. In the original analysis of these data (Snedecor and Cochran, 1980), only the treatment
eect was significant (p = 0.021). Note that the Genmod procedure has an
automatic feature to base the analysis on a scale parameter estimated by the
Maximum Likelihood method; see the SAS manual for details.
6.11.2
y1 r
P (y) =
p (1 p)yr
(6.21)
r1
for y = r, (r + 1), ....
This is the binomial waiting time distribution. If r = 1, it is called a geometric distribution. Using the Gamma function, the distribution can be
defined even for non-integer values of r. When r is an integer, it is called the
Pascal distribution. The distribution has mean value E (y) = pr and variance
V ar (y) =
r(1p)
p2 .
135
6.12
Diagnostics
Model diagnostics for Poisson models follows the same lines as for other
generalized linear models.
Example 6.10 We can illustrate some diagnostic plots using data from the
Wireworm example on page 129. The residuals and predicted values were
stored in a file. The standardized deviance residuals were plotted against the
predicted values, and a normal probability plot of the residuals was prepared.
The results are given in Figure 6.3 and Figure 6.4. Both plots indicate a
reasonable behavior of the residuals. We cannot see any irregularities in the
plot of residuals against fitted values, and the normal plot is rather linear.
c Studentlitteratur
136
6.12. Diagnostics
Figure 6.3: Plot of standardized deviance residuals against fitted values for the
Wireworm example.
c Studentlitteratur
137
6.13
Exercises
Exercise 6.1 The data in Exercise 1.4 are of a kind that can often be
approximated by a Poisson distribution. Re-analyze the data using Poisson
regression. Prepare a graph of the relation and compare the results with the
results from Exercise 1.4. The data are repeated here for your convenience:
Gowen and Price counted the number of lesions of Aucuba mosaic virus after
exposure to X-rays for various times. The results were:
Minutes exposure
0
15
30
45
60
Count
271
108
59
29
12
Mode2
25.3
14.4
32.5
20.5
97.6
53.6
56.6
87.3
47.8
Failures
15
9
14
24
27
27
23
18
22
c Studentlitteratur
138
6.13. Exercises
are available:
type
yr_constr
per_op
mon_serv
incident
Ship type A, B, C, D or E
Year of construction in 5-year intervals
Period of operation: 1960-74, 1975-79
Aggregate months service for ships in this cell
Number of damage incidents
Type
A
A
A
A
A
A
A
A
B
B
B
B
B
B
B
B
C
C
C
C
C
C
C
C
D
D
D
D
D
D
D
D
E
E
E
E
E
E
E
E
yr_constr
60
60
65
65
70
70
75
75
60
60
65
65
70
70
75
75
60
60
65
65
70
70
75
75
60
60
65
65
70
70
75
75
60
60
65
65
70
70
75
75
per_op
60
75
60
75
60
75
60
75
60
75
60
75
60
75
60
75
60
75
60
75
60
75
60
75
60
75
60
75
60
75
60
75
60
75
60
75
60
75
60
75
mon_serv
127
63
1095
1095
1512
3353
0
2244
44882
17176
28609
20370
7064
13099
0
7117
1179
552
781
676
783
1948
0
274
251
105
288
192
349
1208
0
2051
45
0
789
437
1157
2161
0
542
incident
0
0
3
4
6
18
*
11
39
29
58
53
12
44
*
18
1
1
0
1
6
2
*
1
0
0
0
0
2
11
*
4
0
*
7
7
5
12
*
1
139
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
18
18
18
18
16.2676
16.2676
16.0444
16.0444
138.2221
0.9038
0.9038
0.8914
0.8914
Parameter
DF
Estimate
Standard
Error
Intercept
treat
Scale
1
1
0
1.6094
0.5878
1.0000
0.1414
0.1764
0.0000
Wald 95%
Confidence Limits
ChiSquare
Pr > ChiSq
c Studentlitteratur
140
6.13. Exercises
Year
1970
1971
1972
1973
1974
1975
1976
Collisions
3
6
4
7
6
2
2
Miles
281
276
268
269
281
271
265
Year
1977
1978
1979
1980
1981
1982
1983
Collisions
4
1
7
3
5
6
1
Miles
264
267
265
267
260
231
249
Case
66
141
113
129
0
Control
123
179
106
80
Case
30
59
63
102
Case
36
69
119
373
35Control
13
25
30
85
A. Analyze these data using smoking and coee drinking as qualitative variables.
B. Assign scores to smoking and coee drinking and re-analyze the data using
these scores as quantitative variables.
C. Compare the analyses in A. and B. in terms of fit. Perform residual
analyses.
Exercise 6.8 Even before the space shuttle Challenger exploded on January 20, 1986, NASA had collected data from 23 earlier launches. One part
of these data was the number of O-rings that had been damaged at each
launch. O-rings are a kind of gaskets that will prevent hot gas from leaking during takeo. In total there were six such O-rings at the Challenger.
The data included the number of damaged O-rings, and the temperature (in
Fahrenheit) at the time of the launch. On the fateful day when the Challenger
exploded, the temperature was 31 F.
One might ask whether the probability that an O-ring is damaged is related
to the temperature. The following data are available:
c Studentlitteratur
141
No. of
Defective
O-rings
2
1
1
1
0
0
0
0
0
0
0
Temperao
ture F
53
57
58
63
66
67
67
67
68
69
70
No. of
Defective
O-rings
0
1
1
0
0
0
2
0
0
0
0
0
Temperao
ture F
70
70
70
72
73
75
75
76
76
78
79
81
Poisson model:
Criterion
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
21
21
21
21
16.8337
16.8337
28.1745
28.1745
-14.6442
Value/DF
0.8016
0.8016
1.3416
1.3416
Parameter
DF
Estimate
Standard
Error
Intercept
Temp
Scale
1
1
0
5.9691
-0.1034
1.0000
2.7628
0.0430
0.0000
ChiSquare
Pr > ChiSq
c Studentlitteratur
142
6.13. Exercises
Binomial model:
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
21
21
21
21
18.0863
18.0863
29.9802
29.9802
-30.1982
0.8613
0.8613
1.4276
1.4276
Parameter
DF
Estimate
Standard
Error
Intercept
Temp
Scale
1
1
0
5.0850
-0.1156
1.0000
3.0525
0.0470
0.0000
ChiSquare
Pr > ChiSq
A. Test whether temperature has any significant eect on the failure of Orings using
i) the Poisson model
ii) the binomial model
B. Predict the outcome of the response variable if the temperature is 31 F
i) for the Poisson model
ii) for the binomial model
C. Which of the two models do you prefer? Explain why!
D. Using your preferred model, calculate the probability that three or more
of the O-rings fail if the temperature is 31 F.
Exercise 6.9 Agresti (1996) discusses analysis of a set of accident data from
Maine. Passengers in all trac accidents during 1991 were classified by:
Gender
Location
Belt
Injury
c Studentlitteratur
143
DF
Value
Value/DF
5
5
5
5
23.3510
23.3510
23.3752
23.3752
536762.6081
4.6702
4.6702
4.6750
4.6750
Parameter
DF Estimate
Intercept
gender
F
gender
M
location
R
location
U
belt
N
belt
Y
injury
N
injury
Y
gender*location
gender*location
gender*location
gender*location
gender*belt
gender*belt
gender*belt
gender*belt
gender*injury
gender*injury
gender*injury
gender*injury
location*belt
location*belt
location*belt
location*belt
location*injury
location*injury
location*injury
location*injury
belt*injury
belt*injury
belt*injury
belt*injury
Scale
1
1
0
1
0
1
0
1
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
0
F
F
M
M
F
F
M
M
F
F
M
M
R
R
U
U
R
R
U
U
N
N
Y
Y
R
U
R
U
N
Y
N
Y
N
Y
N
Y
N
Y
N
Y
N
Y
N
Y
N
Y
N
Y
5.9599
0.6212
0.0000
0.2906
0.0000
0.7796
0.0000
3.3309
0.0000
-0.2099
0.0000
0.0000
0.0000
-0.4599
0.0000
0.0000
0.0000
-0.5405
0.0000
0.0000
0.0000
-0.0849
0.0000
0.0000
0.0000
-0.7550
0.0000
0.0000
0.0000
-0.8140
0.0000
0.0000
0.0000
1.0000
Standard
Error
0.0314
0.0288
0.0000
0.0290
0.0000
0.0291
0.0000
0.0310
0.0000
0.0161
0.0000
0.0000
0.0000
0.0157
0.0000
0.0000
0.0000
0.0272
0.0000
0.0000
0.0000
0.0162
0.0000
0.0000
0.0000
0.0269
0.0000
0.0000
0.0000
0.0276
0.0000
0.0000
0.0000
0.0000
Wald 95%
Limits
5.8984
0.5647
0.0000
0.2337
0.0000
0.7225
0.0000
3.2702
0.0000
-0.2415
0.0000
0.0000
0.0000
-0.4907
0.0000
0.0000
0.0000
-0.5939
0.0000
0.0000
0.0000
-0.1167
0.0000
0.0000
0.0000
-0.8078
0.0000
0.0000
0.0000
-0.8681
0.0000
0.0000
0.0000
1.0000
6.0213
0.6777
0.0000
0.3475
0.0000
0.8367
0.0000
3.3916
0.0000
-0.1783
0.0000
0.0000
0.0000
-0.4292
0.0000
0.0000
0.0000
-0.4872
0.0000
0.0000
0.0000
-0.0532
0.0000
0.0000
0.0000
-0.7022
0.0000
0.0000
0.0000
-0.7599
0.0000
0.0000
0.0000
1.0000
ChiSquare
36133.0
463.89
.
100.16
.
716.17
.
11563.8
.
169.50
.
.
.
860.14
.
.
.
394.36
.
.
.
27.50
.
.
.
784.94
.
.
.
868.65
.
.
.
Pr > ChiSq
<.0001
<.0001
.
<.0001
.
<.0001
.
<.0001
.
<.0001
.
.
.
<.0001
.
.
.
<.0001
.
.
.
<.0001
.
.
.
<.0001
.
.
.
<.0001
.
.
.
Calculate and interpret estimated odds ratios for the dierent factors.
c Studentlitteratur
7. Ordinal response
7.1
Arbitrary scoring
Example 7.1 Norton and Dunn (1985) studied the relation between snoring
and heart problems for a sample of 2484 patients. The data were obtained
through interviews with the patients. The amount of snoring was assessed
on a scale ranging from Never to Always, which is an ordinal variable.
An interesting question is whether there is any relation between snoring and
heart problems. The data are given in the following table:
Heart problems
Never
Yes
No
Total
24
1355
1379
Sometimes
35
603
638
Snoring
Often
Always
Total
21
192
213
30
224
254
110
2374
2484
145
146
12
Score
1
Never
2
Sometimes
3
Often
4
Always
The main interest lies in studying a possible dependence between snoring and
heart problems. A simple approach to analyzing these data is to ignore the
ordinal nature of the data and use a simple 2 test of independence or, in
this context, the corresponding log-linear model
(7.1)
log ij = + i + j + ()ij
This is a saturated model. The test of the hypothesis of independence corresponds to testing the hypothesis that the parameter ()ij is zero. For the
data on page 145, this gives a deviance of 21.97 on 3 df (p < 0.0001) which
is, of course, highly significant. This analysis, however, does not use the fact
that the snoring variable is ordinal.
A plot (Figure 7.1) of the percentage of persons with heart problems in each
snoring category suggests that this percentage increases nearly linearly, if we
choose the arbitrary scores (1, 2, 3, 4) for the snoring categories.
This suggests a simple way of accounting for the ordinal nature of the data.
Instead of entering the dependence between the variables as the interaction
term ()ij , as in the saturated model (7.1), we write the model as
(7.2)
log ij = + i + j + ui vj
where ui are arbitrary scores for the row variable and vj are scores for the
column variable. In this model, the term ui vj captures the linear part
of the dependence between the scores. This model is called a linear by linear
association model (LL model; see Agresti, 1996).
c Studentlitteratur
147
7. Ordinal response
DF
Value
Value/DF
2
2
2
2
.
6.2398
6.2398
6.3640
6.3640
13733.2247
3.1199
3.1199
3.1820
3.1820
.
No
Yes
Always
Never
Often
Sometime
DF
Estimate
Std Err
ChiSquare
Pr>Chi
1
1
0
1
1
1
0
1
0
1.9833
4.4319
0.0000
-1.0289
0.7912
-1.1353
0.0000
0.6545
1.0000
0.2258
0.2256
0.0000
0.0773
0.0479
0.0794
0.0000
0.0825
0.0000
77.1306
386.0188
.
177.1304
272.4822
204.5545
.
62.8977
.
0.0001
0.0001
.
0.0001
0.0001
0.0001
.
0.0001
.
148
7.2. RC models
7.2
RC models
The method of arbitrary scoring is often useful, but it is subjective in the sense
that dierent choices of scores for the ordinal variables may result in dierent
conclusions. An approach that has been suggested (see e.g. Andersen, 1980)
is to include the row and column scores as parameters of the model. Thus,
the model can be written as
(7.3)
log ij = + i + j + i vj
where i and vj are now parameters to be estimated from the data. This
model, called an RC model, is nonlinear since it includes a product term in
the row and column scores. Thus, the model is not formally a generalized
linear model. However, Agresti (1985) suggested methods for fitting this
model using standard software. This method is iterative. The row scores are
kept fixed and the model is fitted for the column scores. These column scores
are then kept fixed and the row scores are estimated. These two steps are
continued until convergence. The method seems to converge in most cases.
7.3
Proportional odds
The proportional odds model for an ordinal response variable is a model for
cumulative probabilities of type P (Y j) = p1 + p2 + . . . + pj , where for
simplicity we index the categories of the response variable with integers. The
cumulative logits are defined as
logit (P (Y j)) = log
P (Y j)
1 P (Y j)
(7.4)
The cumulative logits are defined for each of the categories of the response
except the first one. Thus, for a response variable with 5 categories we would
get 5 1 = 4 dierent cumulative logits.
c Studentlitteratur
149
7. Ordinal response
The proportional odds model for ordinal response suggests that all these
cumulative logit functions can be modeled as
logit (P (Y j)) = j + x
(7.5)
i.e. the functions have dierent intercepts i but a common slope . This
means that the odds ratio, for two dierent values x1 and x2 of the predictor
x, has the form
P (Y j|x2 ) /P (Y > j|x2 )
.
P (Y j|x1 ) /P (Y > j|x1 )
(7.6)
The log of this odds ratio equals (x2 x1 ), i.e. the log odds is proportional
to the dierence between x2 and x1 . This is why the model is called the
proportional odds model.
The proportional odds model is not formally a (univariate) generalized linear
model, although it can be seen as a kind of multivariate Glim. The model
states that the dierent cumulative logits, for the dierent ordinal values of
the response, are all parallel but with dierent intercepts. Thus, the model
gives, in a sense, a set of k 1 related models if the response has k scale
steps. The Genmod procedure in SAS version 8 (SAS 2000b), as well as the
Logistic procedure in SAS, can handle this type of models.
Example 7.2 We continue with the analysis of the data on page 145. For
the sake of illustration we use the ordinal snoring variable as the response, and
analyze the data to explore whether the risk of snoring depends on whether
the patient has heart problems. A simple way of analyzing this type of data
is to use the Logistic procedure of the SAS package:
PROC LOGISTIC DATA=snoring;
FREQ freq;
MODEL v = u;
RUN;
Criterion
Intercept
Only
Intercept
and
Covariates
AIC
SC
-2 LOG L
Score
5568.351
5585.804
5562.351
.
5505.632
5528.903
5497.632
.
150
Variable DF
INTERCP1
INTERCP2
INTERCP3
U
Parameter Standard
Wald
Pr >
Standardized
Estimate
Error Chi-Square Chi-Square
Estimate
1
1
1
1
0.2824
1.5545
2.2792
-1.4209
0.0414
0.0534
0.0687
0.1774
46.5871
846.5227
1101.5470
64.1619
0.0001
0.0001
0.0001
0.0001
.
.
.
-0.161188
Odds
Ratio
.
.
.
0.242
A similar analysis can be done using Proc Genmod in SAS version 8 or later.
The program can be written as
PROC GENMOD data=snoring order=data;
FREQ freq;
CLASS heart;
MODEL v = u
/dist=multinomial
link=cumlogit;
RUN;
The information is essentially the same as in the Logistic procedure but the
standard error estimates are slightly dierent. Also, Proc Genmod does not
automatically test the common slope assumption.
Analysis Of Parameter Estimates
Parameter
Intercept1
Intercept2
Intercept3
u
Scale
DF
Estimate
Standard
Error
1
1
1
1
0
0.2824
1.5545
2.2792
-1.4208
1.0000
0.0414
0.0535
0.0686
0.1742
0.0000
Wald 95%
Confidence Limits
0.2012
1.4497
2.1446
-1.7624
1.0000
0.3635
1.6594
2.4137
-1.0793
1.0000
ChiSquare
Pr > ChiSq
46.53
844.88
1102.40
66.49
<.0001
<.0001
<.0001
<.0001
7.4
Latent variables
151
7. Ordinal response
y=1
t1
y=2
t2
y=3
Figure 7.2: Ordinal variable with three scale steps generated by cutting a continuous
variable at two thresholds
if
if
y=s
if
< 1
1 < 2
(7.7)
s1
An example of this point of view is illustrated in Figure 7.2, where the latent
variable is assumed to have a symmetric distribution, for example a logistic
or a Normal distribution.
Although (7.7) can be formally seen as a kind of link function, modelling
the data by assuming a latent variable underlying the ordinal response is not
formally a generalized linear model.
However, it can be shown (see e.g. McCullagh and Nelder, 1989) that the
latent variable approach gives a model that is identical to the proportional
odds model with a logit link, for the case where the latent variable has a logistic distribution. The estimated intercepts would be the estimated thresholds
for the latent variables.
In a similar way, a proportional odds model using a complementary loglog link corresponds to a latent variable having a so called extreme value
distribution. This is the well-known proportional hazards model used in
survival analysis (Cox 1972).
c Studentlitteratur
152
20
16
y=3
12
y=2
8
y=1
0
0
A similar approach can also be used for the case where the latent variable
is assumed to follow a Normal distribution. In the Genmod or Logistic procedures in SAS it is possible to specify the form of the link function to be
logistic, complementary log-log, or Normal. This leads to a class of models
called ordinal regression models, for example ordinal logit regression or ordinal probit regression. The concept of an ordinal regression model can be
illustrated as in Figure 7.3. We observe the ordinal variable y that has values
1, 2 or 3. y = 1 is observed if the latent variable is smaller than the lowest
threshold which has a value close to 8. We observe y = 2 if, approximately,
8 < 11.5 and we observe y = 3 if > 11.5. In practice the scale of
cannot be determined. The scale of can be chosen arbitrarily, for example
such that the distribution of is standard Normal for one of the values of x.
Note that probit models and logistic regression models can also be derived
as models with latent variables. In these cases it is assumed that the observations are generated by a latent variable: if this latent variable is smaller
than a threshold we observe Y = 1, else Y = 0; see Figure 5.1 on page 86.
As a comparison with the results given on page 149 we have analyzed the data
on page 145 using the Logistic procedure and a Normal link. The following
results were obtained:
c Studentlitteratur
153
7. Ordinal response
Criterion
Intercept
Only
Intercept
and
Covariates
AIC
SC
-2 LOG L
Score
5568.351
5585.804
5562.351
.
5507.436
5530.706
5499.436
.
Variable
DF
INTERCP1
INTERCP2
INTERCP3
U
1
1
1
1
Parameter
Estimate
Standard
Error
Wald
Chi-Square
Pr >
Chi-Square
Standardized
Estimate
0.1749
0.9355
1.3266
-0.8415
0.0258
0.0300
0.0352
0.1071
46.0795
974.8408
1420.1155
61.6773
0.0001
0.0001
0.0001
0.0001
.
.
.
-0.173150
The fit of the model, and the conclusions, are similar to the logistic model.
The three thresholds are estimated to be 0.17; 0.94; and 1.33. For x = 0 this
would give the probabilities as 0.5694, 0.2558, 0.0825 and 0.0923. For x = 1
the mean value of is 0.8415 so the probabilities are 0.8463, 0.1159, 0.0227
and 0.0151.
7.5
A Genmod example
Example 7.3 Koch and Edwards (1988) considered analysis of data from a
clinical trial on the response to treatment for arthritis pain. The data are as
follows:
Gender
Female
Female
Male
Male
Treatment
Active
Placebo
Active
Placebo
Response
Marked Some None
16
5
6
6
7
19
5
2
7
1
0
10
154
cumulative probits, using the Genmod procedure of SAS (2000b). The data
were input in a form where the data lines had the form
F A 3 16
F A 2
...
The program was written as follows:
PROC GENMOD data=Koch order=formatted;
CLASS gender treat;
FREQ count;
MODEL response = gender treat gender*treat/
LINK=cumlogit aggregate=response TYPE3;
RUN;
Parameter
Intercept1
Intercept2
gender
gender
treat
treat
gender*treat
gender*treat
gender*treat
gender*treat
Scale
F
M
A
P
F
F
M
M
A
P
A
P
DF
Estimate
Standard
Error
1
1
1
0
1
0
1
0
0
0
0
3.6746
4.5251
-3.2358
0.0000
-3.7826
0.0000
2.1110
0.0000
0.0000
0.0000
1.0000
1.0125
1.0341
1.0710
0.0000
1.1390
0.0000
1.2461
0.0000
0.0000
0.0000
0.0000
Wald 95%
Confidence Limits
1.6901
2.4983
-5.3350
0.0000
-6.0150
0.0000
-0.3312
0.0000
0.0000
0.0000
1.0000
5.6591
6.5519
-1.1366
0.0000
-1.5503
0.0000
4.5533
0.0000
0.0000
0.0000
1.0000
ChiSquare
13.17
19.15
9.13
.
11.03
.
2.87
.
.
.
Source
gender
treat
gender*treat
DF
ChiSquare
Pr > ChiSq
1
1
1
18.01
28.15
3.60
<.0001
<.0001
0.0579
We can note that there is a slight (but not significant) interaction; that there
are significant gender dierences and that the treatment has a significant
eect. The signs of the parameters indicate that patients on active treatment
experienced a higher degree of pain relief and that the females experienced
better pain relief than the males. The cumulative probit model gave similar
results except that the interaction term was further from being significant
(p = 0.11).
c Studentlitteratur
155
7. Ordinal response
7.6
Exercises
Exercise 7.1 Ezdinli et al (1976) studied two treatments against lymphocytic lymphoma. After the experiment the tumour of each patient was graded
on an ordinal scale from Complete response to Progression. Examine
whether the treatments dier in their eciacy by fitting an appropriate ordinal regression model. You are also free to analyze the data using other
methods that you may have met during your training.
Complete response
Partial response
No change
Progression
Total
Treatment
BP CP
26
31
51
59
21
11
40
34
138
135
Total
57
110
32
74
273
Exercise 7.2 The following data, from Hosmer and Lemeshow, (1989), come
from a survey on womens attitudes towards mammography. The women
were asked the question How likely is it that mammography could find a
new case of breast cancer. They were also asked about recent experience of
mammography. Results:
Mammography
experience
Never
Over 1 year ago
Within the past year
c Studentlitteratur
8. Additional topics
8.1
Variance heterogeneity
(8.1)
(8.2)
157
158
OBS
1
2
3
4
5
6
7
8
9
PARM
INTERCEPT
MEDIUM
MEDIUM
MEDIUM
MEDIUM
MEDIUM
MEDIUM
MEDIUM
SCALE
LEVEL1
Diatrizo
Hexabrix
Isovist
Mannitol
Omnipaqu
Ringer
Ultravis
DF
Mean model
ESTIMATE
1
1
1
1
1
1
1
0
0
9.9075
11.4845
-5.9475
-8.2431
-2.2197
-1.5175
-9.6975
0.0000
1.0000
STDERR
0.3536
0.5701
0.4859
0.4859
0.4859
0.4743
0.5175
0.0000
0.0000
CHISQ
PVAL
785.2685
405.8269
149.8140
287.7796
20.8680
10.2347
351.0883
.
.
0.0001
0.0001
0.0001
0.0001
0.0001
0.0014
0.0001
.
.
PARM
INTERCEPT
MEDIUM
MEDIUM
MEDIUM
MEDIUM
MEDIUM
MEDIUM
MEDIUM
SCALE
LEVEL1
DF
ESTIMATE
STDERR
CHISQ
PVAL
Diatrizo
Hexabrix
Isovist
Mannitol
Omnipaqu
Ringer
Ultravis
1
1
1
1
1
1
1
0
0
2.9560
1.3196
-1.4736
-2.1732
-0.3517
0.0905
-6.2914
0.0000
0.5000
0.5000
0.8062
0.6872
0.6872
0.6872
0.6708
0.7319
0.0000
0.0000
34.9524
2.6788
4.5982
10.0016
0.2620
0.0182
73.8867
.
.
0.0001
0.1017
0.0320
0.0016
0.6087
0.8927
0.0001
.
.
8.2
Survival models
Survival data are data for which the response is the time a subject has survived a certain treatment or condition. Survival models are used in epidemiology, as well as in lifetime testing in industry. Censoring is a special feature
of survival data. Censoring means that the survival time is not known for
all individuals when the study is finished. For right censored observations
we only know that the survival time is at least the time at which censoring
occurred. Left censoring, i.e. observations for which we do not know e.g. the
duration of disease when the study started, is also possible.
Denote the density function for the survival time with f (t), and let the
Rt
corresponding distribution function be F (t) =
f (s) ds. The survival
function is defined as
S (t) = 1 F (t)
c Studentlitteratur
(8.3)
159
8. Additional topics
d log (S (t))
f (t)
=
S (t)
dt
(8.4)
The hazard function measures the instantaneous risk of dying, i.e. the probability of dying in the next small time interval of duration dt. The cumulative
hazard function is
H (t) =
Zt
h (s) ds
(8.5)
8.2.1
An example
160
AG
WBC Surv.
4400
56
3000
65
4000
17
1500
7
9000
16
5300
22
10000
3
19000
4
27000
2
28000
3
31000
8
26000
4
21000
3
79000
30
100000
4
100000
43
(WBC) for each patient is also given. The data are reproduced in Table
8.1.
As a first attempt, we model the data using a Gamma distribution. The log
of the WBC was used. The interaction ag*logwbc was not significant. The
program is
PROC GENMOD data=feigl;
CLASS ag;
MODEL survival = ag logwbc /
DIST=gamma obstats residuals;
MAKE obstats out=ut;
RUN;
c Studentlitteratur
161
8. Additional topics
Residual Model Diagnostics
Normal Plot of Residuals
I Chart of Residuals
3
2
Residual
Residual
1
0
-1
3.0SL=2.314
1
0
-1
X=-0.3884
-2
5
-3
-4
-2
-2
-1
-3.0SL=-3.091
10
20
30
Normal Score
Observation Number
Histogram of Residuals
6
1
Residual
Frequency
5
4
3
2
0
-1
1
-2
0
-2.0 -1.5 -1.0 -0.5 -0.0 0.5 1.0 1.5
50
Residual
100
150
200
250
Fit
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
30
30
30
30
.
40.0440
38.2985
29.6222
28.3310
-146.3814
1.3348
1.2766
0.9874
0.9444
.
+
-
DF
Estimate
Std Err
ChiSquare
Pr>Chi
1
1
0
1
1
-0.0020
-0.0344
0.0000
0.0061
0.9564
0.0262
0.0150
0.0000
0.0024
0.2065
0.0056
5.2495
.
6.5899
.
0.9403
0.0220
.
0.0103
.
The fit of the model is reasonable with a scaled deviance of 38.3 on 30 df,
Deviance/df =1.28. The eects of WBC and AG are both significant. We
asked the procedure to output predicted values and standardized Deviance
residuals to a file. A residual plot based on these data is given in Figure 8.1.
The distribution seems reasonable; one very large fitted value stands out.
c Studentlitteratur
162
8.3
8.3. Quasi-likelihood
Quasi-likelihood
In general linear models, the assumption that the observation come from a
Normal distribution is not crucial. Estimation of parameters in GLM:s is often done using some variation of Least squares, for which certain optimality
properties are valid even under non-normality. Thus, we can estimate parameters in, for example, regression models, without being too much worried
by non-normality.
Quasi-likelihood can give a similar peace of mind to users of generalized linear
models. In principle, we need to specify a distribution (Poisson, binomial,
Normal etc.) when we fit generalized linear models. However, Wedderburn
(1974) noted the following property of generalized linear models.
The score equations for the regression coecients have the form
X
i 1
vi (yi i ()) = 0
(8.6)
Note that this expression only contains the first two moments, the mean i
and the variance vi . Wedderburn (1974) suggested that this can be used
to define a class of estimators that do not require explicit expressions for
the distributions. A type of generalized linear models can be constructed by
specifying the linear predictor and the way the variance v depends on .
The integral of (8.6) can be seen as a kind of likelihood function. This integral
is
Q (yi , i ) =
Zi
yi t
dt + f (yi )
vi
(8.7)
where f (yi ) is some arbitrary function of yi . Q (yi , i ) is called a quasilikelihood. Maximizing (8.7) with respect to the parameters of the model
yields quasi-likelihood (QL) estimators. QL estimators can be shown to have
nice asymptotic properties. First, they are consistent, regardless of whether
the variance assumption vi = V (i ) is true, as long as the linear predictor is
correctly specified. Secondly, QL estimators are asymptotically unbiased and
ecient among the class of estimating equations which are linear functions
of the data (McCullagh, 1983).
Estimators of the variances of QL estimators can be obtained in dierent
ways. The matrix I of second order derivatives of (8.7) gives the QL equivalent of the Fisher information matrix. The inverse I1
is an estimator of the
covariance matrixof the parameter estimates. This is called the model-based
b . An alternative approach is to use the so called emestimator of Cov
pirical, or robust, estimator, which is less sensitive to assumptions regarding
c Studentlitteratur
163
8. Additional topics
variances and covariances. This is also called the sandwich estimator. It has
general form
b = I1 I1 I1
d
Cov
0
k
X
i=1
Vi1 i .
8.4
The quasi-likelihood approach is sometimes useful when the data show signs
of over-dispersion. Since the emprical variance estimates obtained in QL
estimation are rather robust against the variance assumption, QL estimation
is a viable alternative to the methods for modeling over-dispersion presented
in earlier chapters, at least if the sample is reasonably large. We will illustrate
this idea based on a set of data from Liang and Hanfelt (1994).
Example 8.3 Two groups of rats, each consisting of sixteen pregnant females, were fed dierent diets during pregnancy and lactation. The control
diet was a standard food whereas the treatment diet contained a certain
chemical agent. After three weeks it was recorded how many of the live born
pups that still were alive. The data are given as x/n where x is the number
of surviving pups and n is the total litter size.
Control
Treated
13/13
9/10
12/12
8/9
12/12
9/10
11/11
4/5
9/9
8/9
10/10
7/9
9/9
11/13
9/9
4/7
8/8
4/5
10/11
5/10
8/8
5/7
9/10
3/6
12/13
7/10
9/10
3/10
11/12
7/10
8/9
0/7
A standard logistic model has a rather bad fit with a deviance of 86.19 on
30 df, p < 0.0001. In this model the treatment eect is significant, both
when we use a Wald test (p = 0.0036) and when we use a likelihood ratio
test (p = 0.0027).
c Studentlitteratur
164
The bad fit may be caused by heterogeneity among the females: dierent
females may have dierent ability to take care of their pups. If it can be
assumed that the dispersion parameter is the same in both groups, this can
be modeled by including a dispersion parameter in the model, as discussed
in Chapter 5. Such a model gives a non-significant treatment eect (Wald
test: p = 0.0855; LR test: p = 0.0765.)
The quasi-likelihood estimates can be obtained in Proc Genmod by using the
following trick. Proc Genmod can use QL, but only in repeated-measures
models. We can then request a repeated-measures analysis but with only one
measurement per female. The program can be written as
PROC GENMOD data=tera;
CLASS treat litter;
MODEL x/n=treat /
DIST=bin LINK=logit type3;
REPEATED subject=litter;
RUN;
The output is given in two parts. The first part uses the model-based estimates of variances and these results are identical to the first output. The
second part presents the QL results:
Analysis Of GEE Parameter Estimates
Empirical Standard Error Estimates
Parameter
Intercept
treat
c
treat
t
Estimate
Standard
Error
1.2220
0.9612
0.0000
0.3813
0.4751
0.0000
95% Confidence
Limits
0.4747
0.0300
0.0000
1.9693
1.8925
0.0000
Z Pr > |Z|
3.20
2.02
.
0.0014
0.0431
.
Source
DF
ChiSquare
Pr > ChiSq
treat
2.89
0.0890
In these results the Wald test is significant (p = 0.043) but the Score test is
not (p = 0.0890).
In the paper by Liang and Hanfelt (1994), a simulation study compared the
performances of dierent methods for allowing for overdispersion in this type
of data. The methods included modeling as a beta-binomial distribution, and
two QL approaches. In the simulations, the overdispersion parameter was
c Studentlitteratur
165
8. Additional topics
8.5
data.
The main problem with repeated measures data is that observations within
one individual are correlated. There are several ways to model this correlation. We will here only consider the Generalized estimating equations
approach of Liang and Zeger (1986); see also Diggle, Liang and Zeger (1994).
This approach is available in the Genmod procedure in SAS (2000b). The
GEE approach can be seen as an extension of the quasi-likelihood approach
to a multivariate mean vector.
Models for repeated measures data have the same basic components as other
generalized linear models. We need to specify a link function, a distribution
and a linear predictor. But in addition we need to consider how observations
within individuals are correlated.
Suppose that we store all data for individual i in the vector Yi that has
elements Yi = [Yi1 , ..., Yini ]0 . The corresponding vector of mean values is
0
i = i1 , ..., ini . Let Vi be the covariance matrix of Yi . The values of
the independent variables for individual i at measurement (occasion) j are
0
collected in the vector xij = [xij1 , ..., xijp ]0 .
The vector contains the parameters to be estimated. The GEE approach
means that we estimate the parameters by solving the GEE equation
k
X
i=1
(8.8)
166
1 if t = 0
0 if t > m
span between the observations. The correlation is 0 for occasions more
than m time units apart.
1 if j = k
Exchangeable: Corr (Yij , Yik ) =
. All correlations are equal.
if j 6= k
1 if j = k
Unstructured: Corr (Yij , Yik ) =
jk if j 6= k
Autoregressive, AR(1): Corr (Yij , Yik ) = t for t = 0, 1, ..., ni j
As usual, the choice of model for the covariance structure is a compromise
between realism and parsimony. A model with more parameters is often
more realistic, but may be more dicult to interpret and may give convergence problems. The fixed structure means that the user enters all correlations, so there are no parameters to estimate. The exchangeable structure
includes only one parameter. The AR(1) structure also has only one parameter but it is often intuitively appealing since the correlations decrease with
increasing distance. If we assume unstructured correlations we need to estimate k (k 1) /2 correlations, while the m-dependent correlation structure
includes fewer correlations.
Example 8.4 Sixteen children (taken from the data of Lipsitz et al, 1994)
were followed from the age of 9 to the age of 12. The children were from
two dierent cities. The binary response variable was the wheezing status of
the child. The explanatory variables were city; age; and maternal smoking
status. The structure of the data is given in Table 8.2; the complete data set
is available from the publishers home page as the file Wheezing.dat.
c Studentlitteratur
167
8. Additional topics
Child
1
1
1
1
2
2
2
City
Portage
Portage
Portage
Portage
Kingston
Kingston
Kingston
Age
9
10
11
12
9
10
11
Smoke
0
0
0
0
1
2
2
Wheeze
1
1
1
0
1
1
0
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
60
60
60
60
.
76.9380
76.9380
63.9651
63.9651
-38.4690
1.2823
1.2823
1.0661
1.0661
.
c Studentlitteratur
168
PRM1
PRM2
PRM4
PRM5
5.71511
-0.13847
-0.96838
0.01587
-0.22386
0.45733
-0.01553
0.06353
-0.53133
-0.002411
0.05268
-0.16530
0.01658
0.01877
-0.01658
0.19088
PRM1
PRM2
PRM4
PRM5
9.33891
-0.40467
-0.97676
-0.15108
-0.85121
0.47378
0.29893
0.16125
-0.83232
0.05737
0.07775
-0.02187
-0.16667
0.04007
-0.002201
0.13032
Parameter
INTERCEPT
CITY
CITY
AGE
SMOKE
Scale
Kingston
Portage
Estimate
Empirical
Std Err
1.2754
0.1219
0.0000
-0.2036
-0.0928
0.9991
3.0560
0.6883
0.0000
0.2788
0.3610
.
7.2650
1.4709
0.0000
0.3429
0.6147
.
0.4174
0.1771
0.0000
-.7302
-.2571
.
Pr>|Z
0.6764
0.8595
0.0000
0.4652
0.7971
Estimates of the parameters of the model are given, along with their empirical
standard error estimates. None of the parameters are significantly dierent
from 0. This, of course, may be related to the rather small sample size.
8.6
Mixed models are models where some of the independent variables are assumed to be fixed, i.e. chosen beforehand, while others are seen as randomly
sampled from some population or distribution. Mixed models have proven to
be very useful in modeling dierent phenomena. An example of an application of mixed models is when several measurements have been taken on the
c Studentlitteratur
169
8. Additional topics
same individual. In such cases the eect of individual can often be included
in the model as a random eect.
A mixed linear model for a continuous response variable y can be written,
for each individual i, as
yi = Xi + Zi ui + ei
(8.9)
There are situations when the response is of a type amenable for GLIM
estimation but where there would be a need to assume that some of the independent variables are random. Breslow and Clayton (1993), and Wolfinger
and OConnell (1993) have explored a pseudo-likelihood approach to fitting
models such as (8.9) but where the distributions are free to be any member
of the exponential family, and where a link function is used to model the
expected response as a function of the linear predictor. A SAS macro Glimmix has been written to do the estimation. Essentially, the macro iterates
between Proc Mixed and Proc Genmod. The method and the macro are
described in Littell et al (1996).
Example 8.5 Thirty-three children between the ages of 6 and 16 years,
all suering from monosymptomatic nocturnal enuresis, were enrolled in a
study. The study was carried out with a double-blind randomized threeperiod cross-over design. The children received 0.4 mg. Desmopressin, 0.8
mg. Desmopressin, or placebo tablets at bedtime for five consecutive nights
with each dosage. A wash-out period of at least 48 hours without any medication was interspersed between treatment periods. Wet and dry nights were
documented; for more details about the study and its analysis see Neveus
et al (1999), and Olsson and Neveus (2000). The data consisted of nightly
recordings, where a dry night was recorded as 1 and a wet night as 0. The
nights were grouped into sets of five nights where the same treatment had
been given. The structure of the data is given in Table 8.3. Only one patient
is listed; the original data set contained 33 patients.
c Studentlitteratur
170
Table 8.3: Raw data for one patient in the enuresis study
Patient
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Period
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
Dose
1
1
1
1
1
0
0
0
0
0
2
2
2
2
2
Night
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
Dry
1
1
1
1
1
0
0
1
0
0
1
1
1
1
1
Following Jones and Kenward (1989), the linear predictor part of a general
model for our data may be written as
ijk = + sik + j + d[i,j] + d[i,j1]
(8.10)
ijk
1 ijk
(8.11)
c Studentlitteratur
171
8. Additional topics
Eects included
Dose
Dose
Dose,
Dose,
Dose,
Dose,
Dose,
Dose,
.0001
.0001
.0001
.0001
.0001
.0001
.0001
Seq
Period
After e.
After e., Period
After e., Seq.
Period, Seq
Period
Sequence
After
eect
.7442
.0938
.0759
.0898
.7762
.7577
.6272
.8713
.6573
c Studentlitteratur
172
8.7
8.7. Exercises
Exercises
Exercise 8.1 Survival times in weeks were recorded for patients with acute
leukaemia. For each patient the white cell count (wbc, in thousands) and
the AG factor was also recorded. Patients with positive AG factor had Auer
rods and/or granulate of the leukemia cells in the bone marrow at diagnosis
while the AG negative patients had not.
Time
65
108
56
5
143
156
121
26
65
1
100
4
22
56
134
39
1
16
wbc
2.3
10.5
9.4
52.0
7.0
0.8
10.0
32.0
100.0
100.0
4.3
17.0
35.0
4.4
2.6
5.4
100.0
6.0
AG
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Time
65
3
4
22
17
4
3
8
7
2
30
43
16
3
4
wbc
3.0
10.0
26.0
5.3
4.0
19.0
21.0
31.0
1.5
27.0
79.0
100.0
9.0
28.0
100.0
AG
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
The four largest wbc values are actually larger than 100. Construct a model
that can predict survival time based on wbc and ag. Note that the wbc
value may need to be transformed. Also note that there are no censored
observations so a straight-forward application of generalized linear models is
possible. Try a survival distribution based on the Gamma distribution.
Exercise 8.2 The data in Exercise 1.2 show some signs of being heteroscedastic. Re-analyze these data using the method discussed in Section 8. The data
are as follows:
The level of cortisol has been measured for three groups of patients with
dierent syndromes: a) adenoma b) bilateral hyperplasia c) cardinoma. The
c Studentlitteratur
173
8. Additional topics
Air flow
Variety 1
Variety 2
b
8.3
3.8
3.9
7.8
9.1
15.4
7.7
6.5
5.7
13.6
c
10.2
9.2
9.6
53.8
15.8
174
8.7. Exercises
Tube
1
2
3
4
5
6
7
8
9
10
11
x2
9
7
7
8
9
10
10
10
10
10
9
n2
10
10
10
10
10
10
10
10
10
10
10
Var1
F
F
F
F
F
F
F
F
F
F
F
Var2
K
K
K
K
K
K
K
K
K
K
K
Pot indicates pot number and Tube indicates tube number. n2 is the number
of lice in the tube, and x2 is the number of lice eating after two hours. Var1
and Var2 are codes for Variety 1 and Variety 2, respectively.
Formulate a model for these data that can answer the question whether the
eating preferences of the lice depends on Variety 1, Variety 2 or a combination
of these. Hint: Since repeated observations are made on the same plants it
may be reasonable to include Pot as a random factor in the model.
c Studentlitteratur
175
8. Additional topics
Pot
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
1
1
1
1
2
2
2
2
3
3
3
3
4
4
Tube
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
x2
9
7
7
8
9
10
10
10
10
10
9
7
8
9
8
9
9
9
10
10
8
7
10
7
9
8
10
8
10
10
10
10
7
10
7
10
8
7
10
7
9
7
7
8
7
10
9
8
8
9
7
9
10
8
6
9
9
8
8
9
8
8
8
8
7
8
8
7
9
8
9
6
7
8
n2
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
Var1
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
K
K
K
K
K
K
K
K
K
K
K
K
K
K
Var2
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
H
H
H
H
H
H
H
H
H
H
H
H
H
H
4
4
5
5
5
5
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
1
1
1
1
2
2
2
2
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
8
9
6
9
7
8
9
7
9
8
7
9
8
8
7
7
7
6
8
8
8
7
7
7
2
7
9
10
9
9
7
9
8
7
7
7
10
7
8
9
10
8
10
9
8
7
10
8
9
7
8
6
8
7
8
8
10
9
7
8
10
8
7
10
8
7
7
9
10
8
8
8
8
7
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
K
K
K
K
K
K
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
K
K
K
K
K
K
K
K
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
A
A
A
A
A
A
A
A
c Studentlitteratur
176
8.7. Exercises
3
3
3
3
4
4
4
4
5
5
5
5
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
1
1
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
6
3
7
8
4
8
9
8
6
5
10
7
8
5
5
6
8
10
3
7
6
9
6
8
8
5
8
6
5
6
9
7
8
10
9
9
8
8
7
7
7
8
10
6
10
9
8
9
9
8
8
9
7
7
6
4
8
7
5
7
7
5
6
6
10
9
8
8
6
7
5
8
7
6
c Studentlitteratur
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
K
K
K
K
K
K
K
K
K
K
K
K
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
H
H
A
A
A
A
A
A
A
A
A
A
A
A
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
A
A
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
9
8
9
10
8
5
5
9
8
8
10
9
9
9
8
7
9
5
9
9
9
5
10
8
9
7
8
10
7
10
9
7
8
8
8
7
10
9
9
10
10
10
10
6
9
8
8
10
10
8
4
8
8
8
7
9
9
8
9
8
9
8
9
9
8
9
8
8
8
7
8
8
8
9
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
8
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
177
8. Additional topics
5
5
5
5
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
10
9
10
8
9
7
8
7
8
6
8
7
10
7
9
8
8
10
8
9
10
10
8
9
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
F
F
F
F
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
F
F
F
F
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
c Studentlitteratur
Appendix A: Introduction
to matrix algebra
y1
y2
1 2 4
Example: A =
is a matrix with two rows and three columns.
1 6 3
b11 b12
b1c
b21 b22
Example: B =
is a matrix with r rows and c
..
brc
br1 br2
columns. The general element of the matrix B is bij . The first index denotes row, the second index denotes column.
Vectors are often written using lowercase symbols like x, while matrices are
often written using uppercase letters like A. Both matrices and vectors are
written in bold.
179
180
aij = aji
If x =
x1
x2
..
.
xn
vector with n elements.
x2
...
xn
is a row
1 2 4
A=
1 6 3
is
1 1
A0 = 2 6 .
4 3
181
3 0 1
Example: The matrix A = 0 1 2 is square and symmetric.
1 2 4
Definition: The elements aii in a square matrix are called the diagonal
elements.
Definition: An identity matrix I is a symmetric matrix
1
0
0
0
1
0
..
.
0
a1 0
0
0 a2
0
ar
The transpose is 10 =
1 1
1 .
1
1
..
.
1
Calculations on matrices
Addition, subtraction and multiplication can be defined for matrices.
Definition: Equality: Two matrices A and B with the same dimension r c
are equal if and only if aij = bij for all i and j, i.e. if all elements are equal.
Definition: Addition: The sum of two matrices A and B that have the same
dimension is the matrix that consists of the sum of the elements of A and B.
1 2 4
3 9 6
Example: If A =
and B =
then
1 6 3
4 2 1
4 11 10
A+B=
.
5 8 4
c Studentlitteratur
182
Matrix multiplication
b11 b12
a11 a12
a1c
b21 b22
a21 a22
and B =
Example: If A =
ar1 ar2
arc
br1 br2
.
A + B =
b1c
brc
then
For matrices that do not have the same dimensions, addition and subtraction
are not defined.
Matrix multiplication
Multiplication by a scalar
To multiply a matrix A by a scalar (= a number) c means that all elements
in A are multiplied by c.
1 2 4
4 8 16
Example: If A =
then 4 A =
1 6 3
4 24 12
k a1c
k a11 k a12
k a21 k a22
.
Example: k A =
k ar1 k ar2
k arc
Multiplication by a matrix
Matrix multiplication of type C = A B is defined only if the number of
columns in A is equal to the number of rows in B. If A has dimension p r
and B has dimension r q then the product A B will have dimension p q.
The elements of C are calculated as
r
X
cij =
aik bkj .
k=1
c Studentlitteratur
183
6 5 4
1 2 3
Example: If A =
and B = 1 1 1 then AB =
1 0 1
0 2 0
1 6 + 2 (1) + 3 0
15+21+32
1 4 + 2 (1) + 3 0
=
1 6 + 0 (1)
+ 1 0 1 5 + 0 1 + 1 2 1 4 + 0 (1) + 1 0
4
13
2
.
6 3 4
Idempotent matrices
Definition: A matrix A is idempotent if A A = A.
5 10
Example: The inverse of the matrix A =
is
3 2
0.1
0.5
1
A =
.
0.15 0.25
c Studentlitteratur
184
Generalized inverses
5 10
0.1
0.5
1
=
AA
3 2
0.15 0.25
1.0 0
=
= I.
0 1.0
It is possible that the inverse A1 does not exist. A is then said to be
singular. The following relations hold for inverses:
The inverse of a symmetric matrix is symmetric
0
1
(A0 ) = A1 .
Generalized inverses
A matrix B is said to be a generalized inverse of the matrix A if ABA = A.
The generalized inverse of a matrix A is denoted with A . If A is nonsingular
then A = A1 . When A is singular, A is not unique. A generalized
inverse of a matrix A can be calculated as
A = (A0 A)
A0 .
185
0
1 0 0 , u0 =
0 1 0 and v0 =
Example: The vectors t =
0 0 1 are linearly independent.
Definition: The degree of linear independence among a set of vectors is
called the rank of the matrix that is composed by the vectors.
The following properties hold for the rank of a matrix:
The rank of A1 is equal to the rank of A.
The rank of A0 A is equal to the rank of A (It is also true that the rank of
AA0 is equal to the rank of A).
The rank of a matrix A does not change if A is pre- or postmultiplied with
a nonsingular matrix.
Determinants
To each square matrix A belongs a unique scalar that is called the determinant of A. The determinant of A is written as |A|. The determinant of
n
Q
P
ai ,i .
a matrix of dimension n can be calculated as |A| =
(1)#((n))
i=1
Here, (n) denotes any permutation of the numbers 1, 2, . . . n. # (n) denotes the number of inversions of a permutation (n). This is the number
of exchanges of pairs of the numbers in (n) that are needed to bring them
back into natural order. Determinants of small matrices can be calculated
by hand, but for larger matrices we prefer to leave the work to computers.
If A is singular, then the determinant |A| = 0.
186
x0 x =
xy=
1y=
n
X
x1
x1
yi
x2
x2
...
...
10 1 =n
xn
xn
x1
x2
..
.
xn
y1
y2
..
.
yn
n
X
x2
=
i=1 i
n
X
xi yi
=
i=1
10 yn1 = (10 1)
10 y = y
i=1
Further reading
This chapter has only given a very brief and sketchy introduction to matrix
algebra. A more complete treatment can be found in textbooks such as Searle
(1982).
c Studentlitteratur
Appendix B: Inference
using likelihood methods
observation vector x0 = x1 x2 . . . xn .
The likelihood function of our sample is defined as
L=
n
Y
f (xi ; )
(B.1)
i=1
n
X
i=1
187
log (f (xi ; ))
(B.2)
188
P
d
log (f (xi ; ))
n
X
dl
f 0 (xi ; )
i=1
=
=
= 0.
(B.3)
d
d
f (xi ; )
i=1
where
I = E
"
d log (L (; x))
d
2 #
=E
"
dl
d
2 #
(B.5)
I1
is called the Cramr-Rao lower bound. I is called the Fisher infor
mation about . The connection between variance and information is that
an estimator that has small variance gives us more information about the
parameter.
b
.
N , I1
c Studentlitteratur
189
n
X
log (f (xi ; )) .
(B.6)
i=1
n
P
log (f (xi ; ))
l
i=1
=
=0
(B.7)
j
j
The asymptotic covariance matrix of is the inverse of the Fisher information
matrix I that has as its (j, k):th element
l
l
(B.8)
Ij,k = E
j
k
b of the parameter vector is asymptotThe Maximum likelihood estimator
ically multivariate Normal with mean and covariance matrix I1
.
Numerical procedures
For complex distributions the score equations may be dicult to solve analytically. Numerical procedures have been developed that mostly, but not
always, converge to the solution. Two commonly used procedures are the
Newton-Raphson method and Fishers method of scoring.
c Studentlitteratur
190
Numerical procedures
b = g (0 ) +
b 0 H ( 0 ) .
g
b = 0, this leads to a new approximation
Since g
1 = 0 g (0 ) H1 ( 0 ) .
(B.9)
Fishers scoring
Fishers scoring method is a variation of the Newton-Raphson method. The
basic idea is to replace the Hessian matrix H with its expected value. It holds
that E [H ()] = I , the Fisher information matrix.
There are two advantages to using the expected Hessian rather than the
Hessian itself. First, it can be shown that
2l
l
l
E
= E
(B.10)
j k
j
k
Thus, to calculate the expected Hessian we do not need to evaluate the second
l
.
order derivatives; it suces to calculate the first-order derivatives of type
j
A second advantage is that the expected Hessian is guaranteed to be positive
definite so some non-convergence problems with the Newton-Raphson method
do not occur. On the other hand, Fishers scoring method often converges
more slowly than the Newton-Raphson method. However, for distributions
in the exponential family, the Newton-Raphson method and Fishers scoring
method are equivalent.
Fishers scoring method can be regarded, at each step, as a kind of weighted
least squares procedure. In the generalized linear model context, the method
is also called Iteratively reweighted least squares.
c Studentlitteratur
Bibliography
[1] Aanes, W. A. (1961): Pingue (Hymenoxys richardsonii) poisoning in
sheep. American J. of Veterinary Research, 22, 47-52.
[2] Agresti, A. (1984): Analysis of ordered categorical data. New York, WIley.
[3] Agresti, A. (1990): Categorical data analysis. New York, Wiley.
[4] Agresti, A. (1996): An introduction to categorical data analysis. New
York, Wiley.
[5] Aitkin, M. (1987): Modelling variance heterogeneity in normal regression
using GLIM. Applied statistics, 36, 332-339.
[6] Akaike, H. (1973): Information theory and an extension of the maximum
likelihood principle. In: Petrov, B. N. and Cski, F. (eds): Second international symposium on inference theory, Budapest, Akadmiai Kiad,
pp. 267-281.
[7] Andersen, E. B. (1980): Discrete statistical models with social science
applications. Amsterdam, North-Holland.
[8] Anscombe, F. J. (1953): Contribution to the discussion of H. Hotellings
paper. J. Roy. Stat. Soc, B, 15, 229-30.
[9] Armitage, P. and Colton, T. (1998): Encyclopedia of Biostatistics. Chichester, Wiley.
[10] Ben-Akiva, M. and Lerman, S. R. (1985): Discrete choice analysis: Theory and application to travel demand. Cambridge, MIT press.
[11] Box, G. E. P. and Cox, D. R. (1964): An analysis of transformations. J.
Roy. Stat. Soc., A, 143, 383-430.
[12] Breslow, N. R. and Clayton, D. G. (1993): Approximate inference in
generalized linear mixed models. JASA, 88, 9-25.
191
192
Bibliography
[13] Brown, B. W.: (1980): Prediction analysis for binary data. In: Biostatistics Casebook, Eds. R. J. Miller, B. Efron, B. Brown and L. E. Moses.
New York, Wiley.
[14] Christensen, R. (1996): Analysis of variance, design and regression. London, Chapman & Hall.
[15] Cicirelli, M. F., Robinson, K. R. and Smith, L. D. (1983): Internal pH
of Xenopus oocytes: a study of the mechanism and role of pH changes
during meotic maturation. Developmental Biology, 100, 133-146.
[16] Collett, D. (1991): Modelling binary data. London, Chapman and Hall.
[17] Cox, D. R. (1972): Regression models and life tables. J. Roy. Stat. Soc,
B, 34, 187-220.
[18] Cox, D. R. and Lewis, P. A. W. (1966): The statistical analysis of series
of events. London, Chapman & Hall.
[19] Cox, D. R. and Snell, E. J. (1989): The analysis of binary data, 2nd ed.
London, Chapman and Hall.
[20] Diggle, P. J., Liang, K. Y. and Zeger, S. L. (1994): Analysis of longitudinal data. Oxford: Clarendon press.
[21] Dobson, A. J. (2002): An introduction to generalized linear models, second edition. London: Chapman & Hall/CRC Press.
[22] Draper, N. R. and Smith, H. (1998): Applied regression analysis, 3rd Ed.
New York, Wiley.
[23] Ezdinli, E., Pocock, S., Berard, C. W. et al (1976): Comparison of intensive versus moderate chemotherapy of lymphocytic lymphomas: a
progress report. Cancer, 38, 1060-1068.
[24] Fahrmeir, L. and Tutz, G. (1994; 2001): Multivariate statistical modeling
based on generalized linear models. New York, Springer.
[25] Feigl, P. and Zelen, M. (1965): Estimation of exponential survival probabilities with concomitant information. Biometrics, 21, 826-838.
[26] Finney, D. J. (1947, 1952): Probit analysis. A statistical treatment of the
sigmoid response curve. Cambridge, Cambridge University Press.
[27] Freeman, D. H. (1987): Applied categorical data analysis. New York,
Marcel Dekker.
[28] Gill, J. and Laughton, C. D. (2000): Generalized linear models: a unified
approach. New York, Sage publications.
c Studentlitteratur
Bibliography
193
[29] Francis, B., Green, M. and Payne, C. (Eds.) (1993): The GLIM system
manual, Release 4. London, Clarendon press.
[30] Haberman, S. (1978): Analysis of qualitative data. Vol. 1: Introductory
topics. New York, Academic Press.
[31] Hand, D. J., Daly, F., Lunn, A. D., McConway, K. J. and Ostrowski, E.
(1994): A handbook of small data sets. London, Chapman & Hall.
[32] Horowitz, J. (1982): An evaluation of the usefulness of two standard
goodness-of-fit indicators for comparing non-nested random utility models. Trans. Research Record, 874, 19-25.
[33] Hosmer, D. W. and Lemeshow, S. (1989): Applied logistic regression.
New York, Wiley.
[34] Hurn, M. W., Barker, N. W. and Magath, T. D. (1945): The determination of prothrombin time following the administration of dicumarol with
specific reference to thromboplastin. J. Lab. Clin. Med., 30, 432-447.
[35] Hutcheson, G. D. (1999): Introductory statistics using Generalized Linear Models. New York, Sage publications.
[36] Jones, B. and Kenward, M. G.: Design and analysis of cross-over trials.
London, Chapman and Hall.
[37] Jrgensen, B. (1987): Exponential dispersion models. Journal of the
Royal Statistical Society, B49, 127-162.
[38] Klein, J. and Moeschberger, M. (1997): Survival analysis: techniques for
censored and truncated data. New York, Springer.
[39] Koch, G. G. and Edwards, S. (1988): Clinical eciacy trials with categorical data. In: Biopharmaceutical statistics for drug development, K.
E. Peace, ed. New York, Marcel Dekker, pp. 403-451.
[40] Leemis, L. M. (1986): Relationships among common univariate distributions. American Statistician, 40, 134-146.
[41] Liang, K-Y. and Zeger, S. L. (1986): Longitudinal data analysis using
generalized linear models. Biometrika, 73, 13-22.
[42] Liang, K-Y and Hanfelt, J. (1994): On the use of the quasi-likelihood
method in teratological experiments. Biometrics, 50, 872-880.
[43] Lindahl, B., Stenlid, J., Olsson, S. and Finlay, R. (1999): Translocation
of 32 P between interacting mycelia of a wood-decomposing fungus and
ectomycorrhizal fungi in microcosm systems. New Phytol., 144, 183-193.
c Studentlitteratur
194
Bibliography
c Studentlitteratur
Bibliography
195
196
Bibliography
c Studentlitteratur
Exercise 1.1
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
time
-.0475000000
0.0292500000
0.05719172
0.00158621
-0.83
18.44
0.4441
<.0001
Source
Model
Error
Corrected Total
Leucine level
DF
Sum of
Squares
1
5
6
2.39557500
0.03522500
2.43080000
197
Mean Square
F Value
Pr > F
2.39557500
0.00704500
340.04
<.0001
198
It can be concluded that the leucine level increases with time. The
increase is nearly linear, in the studied time range.
Exercise 1.2
Source
DF
Sum of
Squares
Model
Error
Corrected Total
2
18
20
795.692190
1614.017333
2409.709524
Mean Square
F Value
Pr > F
397.846095
89.667630
4.44
0.0271
R-Square
Coeff Var
Root MSE
cortisol Mean
0.330203
100.3306
9.469299
9.438095
Source
DF
Type III SS
Mean Square
F Value
Pr > F
group
795.6921905
397.8460952
4.44
0.0271
The results suggest that there are significant dierences between the
groups (p = 0.0271). To study these dierences we prepare a table of
mean values, and a box plot:
The GLM Procedure
Level of
group
a
b
c
c Studentlitteratur
N
6
10
5
-----------cortisol---------Mean
Std Dev
2.9666667
8.1800000
19.7200000
0.9244818
3.7891072
19.2388149
199
Source
DF
Sum of
Squares
Model
Error
Corrected Total
2
21
23
2350.424299
549.234956
2899.659255
Mean Square
F Value
Pr > F
1175.212150
26.154046
44.93
<.0001
R-Square
Coeff Var
Root MSE
co2 Mean
0.810586
19.53056
5.114103
26.18513
Source
DF
Type III SS
Mean Square
F Value
Pr > F
level
days
1
1
264.809910
2085.614389
264.809910
2085.614389
10.13
79.74
0.0045
<.0001
Source
DF
Sum of
Squares
Mean Square
F Value
Pr > F
Model
Error
Corrected Total
3
20
23
2445.093241
454.566014
2899.659255
815.031080
22.728301
35.86
<.0001
Model ii)
Dependent Variable: co2
Source
level
days
days*level
R-Square
Coeff Var
Root MSE
co2 Mean
0.843235
18.20660
4.767421
26.18513
DF
Type III SS
Mean Square
F Value
Pr > F
1
1
1
47.785031
2085.614389
94.668942
47.785031
2085.614389
94.668942
2.10
91.76
4.17
0.1626
<.0001
0.0547
c Studentlitteratur
200
The test of parallelism is not significant (p = 0.0547). Still, for model
building purposes, I would prefer to retain the interaction term in the
model; see the discussion on model building strategy in the text. Thus, I
would use Model 2 for interpretation and plotting.
B. Estimates of model parameters for model ii) are as follows:
Parameter
Intercept
level
level
days
days*level
days*level
Estimate
High
Low
High
Low
-38.11803843
17.11095713
0.00000000
2.12991722
-0.74816925
0.00000000
B
B
B
B
B
B
Standard
Error
t Value
Pr > |t|
8.34443339
11.80081087
.
0.25921766
0.36658913
.
-4.57
1.45
.
8.22
-2.04
.
0.0002
0.1626
.
<.0001
0.0547
.
D. There are strongly significant eects of time and of nitrogen level. The
interaction, although not formally significant, indicates that the increase
of CO2 emission may be somewhat faster for the low nitrogen treatment.
Exercise 1.4
A. After taking logs of the count variable, a regression output is as follows:
c Studentlitteratur
201
Source
Model
Error
Corrected Total
DF
Sum of
Squares
1
3
4
5.69913226
0.02563161
5.72476387
Mean Square
F Value
Pr > F
5.69913226
0.00854387
667.04
0.0001
R-Square
Coeff Var
Root MSE
logcount Mean
0.995523
2.286363
0.092433
4.042798
Source
minutes
DF
1
Type III SS
5.69913226
Mean Square
5.69913226
F Value
667.04
Pr > F
0.0001
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
minutes
5.552649942
-0.050328398
0.07159833
0.00194866
77.55
-25.83
<.0001
0.0001
Exercise 2.1
We write the density as f (x) = ex = elog x which is an exponential family distribution. If we use = , then b () = log (),
c Studentlitteratur
202
a () = 1, and c (y, ) = 0. Forthe
function, we find that
variance
d
d
1
1
b0 = d
(log ()) = 1 and b00 = d
.
=
Exercise 2.2
e yi
(1e )yi !
e eyi ln
eln(1e ) eln(yi !)
b00 =
d
d
d
d
function.
exp () ln 1 ee
=
e
1+ee
= e
e
1+ee
ee e2e
2
(1+ee )
Exercise 2.3
It is not easy to find a well-fitting model for these data. One of the
best models is probably the one with a Gamma distribution and an inverse
link, but other models might also be considered. However, most models
we have tried do not indicate any significant relation between weight and
survival:
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
11
11
11
11
2.5315
13.4077
4.3154
22.8557
-55.0956
0.2301
1.2189
0.3923
2.0778
Parameter
DF
Estimate
Standard
Error
Intercept
weight
Scale
1
1
1
0.0178
0.0001
5.2964
0.0239
0.0004
2.0154
Wald 95%
Confidence Limits
-0.0291
-0.0006
2.5123
0.0647
0.0008
11.1655
ChiSquare
Pr > ChiSq
0.56
0.07
0.4560
0.7878
A graph of the data and the fitted line may explain why: one observation has an unusually long survival time. Since we have no other
information about the data, deletion of this observation cannot be justified.
c Studentlitteratur
203
Exercise 3.1
A. Data, predicted values and residuals are:
Obs
time
leucine
1
2
3
4
5
6
7
0
10
20
30
40
50
60
0.02
0.25
0.54
0.69
1.07
1.50
1.74
pred
-0.0475
0.2450
0.5375
0.8300
1.1225
1.4150
1.7075
res
0.0675
0.0050
0.0025
-0.1400
-0.0525
0.0850
0.0325
c Studentlitteratur
204
C. The Normal probability plot was obtained using Proc Univariate in SAS:
Residual
RStudent
Hat Diag
H
Cov
Ratio
1
2
3
4
5
6
7
0.0675
0.005000
0.002500
-0.1400
-0.0525
0.0850
0.0325
1.1284
0.0631
0.0294
-2.7205
-0.6490
1.2694
0.4870
0.4643
0.2857
0.1786
0.1429
0.1786
0.2857
0.4643
1.6783
2.1832
1.9014
0.2244
1.5570
1.1116
2.5993
------DFBETAS----DFFITS Intercept
time
1.0504
0.0399
0.0137
-1.1106
-0.3026
0.8028
0.4534
1.0504
0.0391
0.0119
-0.6161
-0.0375
-0.1574
-0.1744
-0.8740
-0.0282
-0.0061
0.0000
-0.1353
0.5677
0.3772
c Studentlitteratur
205
LEVEL
days
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
H
H
H
L
L
L
H
H
H
L
L
L
H
H
H
L
L
L
H
H
H
L
L
L
24
24
24
24
24
24
30
30
30
30
30
30
35
35
35
35
35
35
38
38
38
38
38
38
Co2
8.220
12.594
11.301
15.255
11.069
10.481
19.296
31.115
18.891
28.200
26.765
28.414
25.479
34.951
20.688
32.862
34.730
35.830
31.186
39.237
21.403
41.677
43.448
45.351
pred
12.1549
12.1549
12.1549
13.0000
13.0000
13.0000
20.4454
20.4454
20.4454
25.7795
25.7795
25.7795
27.3541
27.3541
27.3541
36.4291
36.4291
36.4291
31.4993
31.4993
31.4993
42.8188
42.8188
42.8188
res
-3.9349
0.4391
-0.8539
2.2550
-1.9310
-2.5190
-1.1494
10.6696
-1.5544
2.4205
0.9855
2.6345
-1.8751
7.5969
-6.6661
-3.5671
-1.6991
-0.5991
-0.3133
7.7377
-10.0963
-1.1418
0.6292
2.5322
hat
0.26090
0.26090
0.26090
0.26090
0.26090
0.26090
0.09239
0.09239
0.09239
0.09239
0.09239
0.09239
0.11456
0.11456
0.11456
0.11456
0.11456
0.11456
0.19882
0.19882
0.19882
0.19882
0.19882
0.19882
c Studentlitteratur
206
weight
survival
pred
res
1
2
3
4
5
6
7
8
9
10
11
12
13
46
55
61
75
64
75
71
59
64
67
60
63
66
44
27
24
24
36
36
44
44
120
29
36
36
36
44.4434
42.7112
41.6295
39.3067
41.1089
39.3067
39.9435
41.9839
41.1089
40.6013
41.8060
41.2810
40.7691
-0.01001
-0.42609
-0.50452
-0.45590
-0.12984
-0.08661
0.09831
0.04727
1.30216
-0.31864
-0.14589
-0.13383
-0.12188
B. In the plot of residuals against fitted values, one observation stands out
as a possible outlier:
c Studentlitteratur
207
D. The influence of each observation can be obtained via the Insight procedure. A plot of hat diagonal values against observation number is as
follows:
c Studentlitteratur
208
DF
Squares
Mean Square
F Value
Pr > F
treatment
poison
treatment*poison
Error
3
2
6
36
20.41428935
34.87711982
1.57077226
8.64308307
6.80476312
17.43855991
0.26179538
0.24008564
28.34
72.63
1.09
<.0001
<.0001
0.3867
Corrected Total
47
65.50526450
R-Square
0.868055
Coeff Var
18.68478
Root MSE
0.489985
z Mean
2.622376
c Studentlitteratur
209
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
36
36
36
36
1.9205
48.3179
1.8755
47.1866
50.0573
0.0533
1.3422
0.0521
1.3107
Source
treatment
poison
treatment*poison
3
2
6
43.76
59.31
10.04
<.0001
<.0001
0.1232
The conclusions are the same: significant eects of treatment and poison, no significant interaction. The residual plots for this model are:
c Studentlitteratur
210
Exercise 4.2
c Studentlitteratur
DF
Value
Value/DF
199
199
199
199
210.6205
210.6205
218.3507
218.3507
98.8650
1.0584
1.0584
1.0972
1.0972
211
It seems that the data deviate somewhat from an exponential distribution in the upper tail of the distribution.
Exercise 5.1
One questions with these data is whether we should include the factors
Exposure, Temperature and Humidity as class variables or as numeric
variables. One approach is to compare deviances for the dierent approaches, for a main eects model:
Types of terms
All class
Temperature numeric
Humidity also numeric
Also Exposure numeric
Deviance
30.865
31.2108
32.1509
55.0698
df
86
87
89
91
D/df
0.3582
0.3587
0.3612
0.6052
212
Temp*species (p = 0.9091); and Humidity*species (p = 0.3006). There
does not seem to be any need to include interactions.
p It is interesting
to note that an old-fashioned Anova on arcsin
pb suggests that the
interactions species*exposure, temp*exposure and humidity*exposure are
indeed significant. The generalized linear model approach may suer from
the fact that 41 of the 96 observations have pb = 0.
The model with only main eects, and with humidity and temperature
used as numeric variables, gives the following results:
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
89
89
89
89
32.1509
32.1509
27.9761
27.9761
-534.9172
0.3612
0.3612
0.3143
0.3143
Parameter
Intercept
Species
Species
Exposure
Exposure
Exposure
Exposure
Humidity
Temp
Scale
A
B
1
2
3
4
DF
Estimate
Standard
Error
1
1
0
1
1
1
0
1
1
0
5.6733
-1.2871
0.0000
-26.6441
-3.1793
-0.9434
0.0000
-0.1054
0.0930
1.0000
0.9636
0.1616
0.0000
32311.10
0.2988
0.1633
0.0000
0.0138
0.0192
0.0000
Wald 95%
Confidence Limits
3.7847
-1.6038
0.0000
-63355.2
-3.7649
-1.2635
0.0000
-0.1324
0.0555
1.0000
7.5618
-0.9703
0.0000
63301.95
-2.5937
-0.6232
0.0000
-0.0784
0.1305
1.0000
ChiSquare
Pr > ChiSq
34.67
63.44
.
0.00
113.24
33.35
.
58.56
23.59
<.0001
<.0001
.
0.9993
<.0001
<.0001
.
<.0001
<.0001
Source
Species
Exposure
Humidity
Temp
DF
ChiSquare
Pr > ChiSq
1
3
1
1
69.34
385.00
63.52
24.38
<.0001
<.0001
<.0001
<.0001
The survival is highly related to all four factors. Residual plots for this
model are as follows:
c Studentlitteratur
213
c Studentlitteratur
214
It seems that survival probability for women is high, and increases with
age, whereas only the young boys were rescued (women and children
first). One possibility to modeling is to use a dummy variable for children
under 10, and to use a linear age relation for ages above 10 years. If the
dummy variable for childhood is d, a model for these data can be written
as
logit(b
p) = 0 + 1 sex + 2 pclass
+d ( 2 + 3 age + 4 sex + 5 age sex + 6 pclass sex) .
This model assumes a separate survival probability for boys and girls
below 10, and a linear change in survival probability (dierent for males
and females) for persons above 10 years. This model fits fairly well to the
data, as judged by Deviance/df:
Criteria For Assessing Goodness Of Fit
Criterion
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
c Studentlitteratur
DF
Value
Value/DF
744
744
744
744
619.9224
619.9224
732.8209
732.8209
-309.9612
0.8332
0.8332
0.9850
0.9850
215
Source
Pclass
Sex
d
d*Age
d*Sex
d*Age*Sex
d*Pclass*Sex
DF
ChiSquare
Pr > ChiSq
2
1
1
1
1
1
4
20.00
0.50
3.27
4.94
7.72
2.47
33.19
<.0001
0.4777
0.0705
0.0262
0.0055
0.1158
<.0001
As an interpretation of the parameter estimates: there is a highly significant eect of passenger class, as well as an interaction between class and
sex for persons above 10 years. Sex (which actually should be interpreted
as sex of a child) is not significant: young boys and girls have similar
survival probabilities. The fact that d*Sex is significant means that there
are dierences in survival for males and females above 10 years. In this
analysis, passengers with missing age data have been excluded. However,
there seems to be a relation between missing age and passenger class: age
data are missing for 30% of first class passengers, 24% of second class
passengers but 55% of third class passengers, so the analysis should be
interpreted with care.
Exercise 5.3
A binomial model with treatment, ln(dose) and their interaction as
factors produces the following results:
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
11
11
11
11
22.7228
22.7228
20.5940
20.5940
-368.7886
2.0657
2.0657
1.8722
1.8722
Source
treatment
lnDose
lnDose*treatment
DF
ChiSquare
Pr > ChiSq
2
1
2
3.66
287.20
9.25
0.1601
<.0001
0.0098
216
LR Statistics For Type 3 Analysis
Source
treatment
lnDose
lnDose*treatment
Num DF
Den DF
F Value
Pr > F
ChiSquare
2
1
2
11
11
11
0.89
139.03
2.24
0.4395
<.0001
0.1529
1.77
139.03
4.48
Pr > ChiSq
0.4120
<.0001
0.1066
The p-values are rather sensitive to overdispersion. This analysis suggests that the interaction is not significant, i.e. that the slopes may be
equal. The observed proportions (on a logit scale) plotted against ln(dose)
are:
Exercise 5.4
A. H0 : pa = 0 against H1 : pa 6= 0 can be tested using the deviances.
The test statistic is (D1 D2 ) / (df1 df2 ) which, under H0 , is asymptotically distributed as 2 on (df1 df2 ) degrees of freedom. The condition
that model 1 is nested within model 2 is fulfilled. Assumptions: Independent observations; large sample. Result: (226.5177 226.4393) / (8 7) =
0.0784 which should be compared with 2 on 1 d.f. The 5% limit is 3.841;
the 1% limit is 6.635 and the 0.1% limit is 10.828. Our result is clearly
non-significant; H0 cannot be rejected.
B. H0 : pr = 0
H1 : pr 6= 0 can similarly be tested using the
deviances. The test statistic is (D1 D3 ) / (df1 df3 ) which, under H0 ,
is asymptotically distributed as 2 on (df1 df3 ) degrees of freedom. The
condition that model 1 is nested within model 3 is fulfilled. Assumptions:
Independent observations; large sample. Result:
(226.5177 216.4759) / (8 7) = 10.042 which should be compared with
2 on 1 d.f. The 5% limit is 3.841; the 1% limit is 6.635 and the 0.1%
c Studentlitteratur
217
limit is 10.828. Our result is significant at the 1% level but not at the
0.1% level. H0 is rejected.
p
C. The logit link is log (1p)
. When p is zero (or one) this is not defined.
The four cells with observed count =0 do not contribute to the likelihood.
When we replace 0 with 0.5 in these cells they are included, so we get an
extra four d.f. compared with model 3.
D. The odds ratios of not being infected are calculated as e . The corresponding odds ratios of being infected are the inverses of these. This
gives:
Planned: OR=e0.8311 = 0.436; OR of infection=1/0.436 = 2.294 Antibio: OR=e3.4991 = 33.086;OR of infection=1/33.086 = 0.030 Risk:
OR=e3.7172 = 0.024; OR of infection=1/0.024 = 41.667 Planned*Risk:
e2.4394 = 11.466; OR of infection=1/11.466 = 0.087.
In the presence of interactions, raw Odds ratios are not very informative. One might consider to calculate odds ratios separately for each cell
of the 2 2 2 cross-table. All odds ratios take one cell as the baseline,
with OR=1. We might use the cell Planned=0, Risk=0, Antibio=0 as
a baseline. The remaining cell odds ratios (of no infection) compared to
this baseline are:
Planned
1
Risk
Antibio
1
0
1
4.02
0.12
0
14.41
0.43
0
Risk
1
0.80
0.02
0
33.09
1.00
e
sponds to yb = 1+e
3.5342 = 0.9717 which is the predicted value. The raw
residual is y yb = 1 0.9716 = 0.0284.
For the second observation the predictors have the same value but y = 0
so the raw residual is 0 0.9716 = 0.9716.
The third and fourth observations have the same predicted values, obtained through logit() = 2.1440 0.8311 + 3.4991 = 4.812 which gives
e4.812
predicted value yb = 1+e
4.812 = 0.9919 and raw residuals 1 0.9919 =
0.0081 and 0 0.9919 = 0.9919, respectively.
c Studentlitteratur
218
Note that the counts (Wt) are not the values to predict!
Exercise 5.5
A. The model is g = + i + j + ()ij + zijk + ()j zijk + eijk , i = 1, 2;
j = 1, 2, 3; k = 1, 2, 3. This gives the model in matrix terms as y =
XB + e, where
1 1 0 1 0 0 1 0 0 0 0 0 z1 z1
0
0
1 1 0 1 0 0 1 0 0 0 0 0 z2 z2
0
0
1 1 0 1 0 0 1 0 0 0 0 0 z3 z3
0
0
1 1 0 0 1 0 0 1 0 0 0 0 z4
0
z4
0
1 1 0 0 1 0 0 1 0 0 0 0 z5
0
z5
0
1 1 0 0 1 0 0 1 0 0 0 0 z6
0
z6
0
1 1 0 0 0 1 0 0 1 0 0 0 z7
0
0
z7
1 1 0 0 0 1 0 0 1 0 0 0 z8
0
0
z8
1 1 0 0 0 1 0 0 1 0 0 0 z9
0
0
z9
;
X=
0
1 0 1 1 0 0 0 0 0 1 0 0 z10 z10 0
1 0 1 1 0 0 0 0 0 1 0 0 z11 z11 0
0
1 0 1 1 0 0 0 0 0 1 0 0 z12 z12 0
0
1 0 1 0 1 0 0 0 0 0 1 0 z13 0 z13 0
1 0 1 0 1 0 0 0 0 0 1 0 z14 0 z14 0
1 0 1 0 1 0 0 0 0 0 1 0 z15 0 z15 0
1 0 1 0 0 1 0 0 0 0 0 1 z16 0
0 z16
1 0 1 0 0 1 0 0 0 0 0 1 z17 0
0 z17
1 0 1 0 0 1 0 0 0 0 0 1 z18 0
0 z18
()
11
()
12
B=
()
13
()
21
()
22
()
23
()
1
()
2
()3
p
B. The inverse of the logit link g (p) = log 1p
is g1 =
c Studentlitteratur
ep
ep +1 .
219
Exercise 6.1
A model with a Poisson distribution and a log link gives the following
model information:
Criteria For Assessing Goodness Of Fit
Criterion
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
DF
Value
Value/DF
3
3
3
3
2.2906
2.2906
2.2453
2.2453
1911.7443
0.7635
0.7635
0.7484
0.7484
The fit of the model to the data is good, as judged by deviance/df. The
parameter estimates are as follows:
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
exposure
Scale
1
1
0
5.5713
-0.0513
1.0000
0.0567
0.0030
0.0000
Wald 95%
Confidence Limits
5.4602
-0.0572
1.0000
5.6825
-0.0455
1.0000
ChiSquare
Pr > ChiSq
9650.28
298.25
<.0001
<.0001
Exercise 6.2
Two Poisson models were fitted to the data: one with a log link, another
with an identity link. The model with a log link fitted marginally better,
as judged by the deviance/df criterion. First the log link results:
c Studentlitteratur
220
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
6
6
6
6
4.0033
4.0033
3.9505
3.9505
362.7354
0.6672
0.6672
0.6584
0.6584
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
Parameter
DF
Estimate
Standard
Error
Intercept
mode1
mode2
Scale
1
1
1
0
2.1752
0.0070
0.0025
1.0000
0.2555
0.0024
0.0028
0.0000
Wald 95%
Confidence Limits
1.6745
0.0023
-0.0030
1.0000
2.6759
0.0118
0.0081
1.0000
ChiSquare
Pr > ChiSq
72.50
8.34
0.81
<.0001
0.0039
0.3685
DF
Value
Value/DF
6
6
6
6
4.1971
4.1971
4.1567
4.1567
362.6385
0.6995
0.6995
0.6928
0.6928
Both models show a good fit; slightly better for the log link. Plots of
residuals vs. fitted values are similar for the two models. The normal
probability plot is slightly better for the identity link model:
c Studentlitteratur
221
deviance
38.695
14.587
33.756
36.908
df
25
13
21
23
It seems that a model with main eects, plus the type*yr_c interaction, would describe the data well. However, this model produces a
near-singular Hessian matrix.
Exercise 6.4
The model analyzed in the text is saturated, which means that the data
should agree perfectly with the model. The model for males is log(b
) =
b which gives log (320) = log(21.4) +
b i.e. = log (320)
log(t) +
0
0
0
b +
b .
log(21.4) = 2.7049. The model for females is log (b
) = log(t) +
0
1
b
We get 1 = log(175) log(17.3) 2.7049 = 0.39082. These results
agree with the computer outputs in the text.
Exercise 6.5
c Studentlitteratur
222
(D1 D2 ) / (df1 df2 ) = (27.857 16.2676) / (19 18) = 11.589 which
is used as an asymptotic 2 on 1 d.f. The 5% limit is 3.841; the 1% limit
is 6.635 and the 0.1% limit is 10.828. Our observed value is even larger
than 10.828; the result is clearly significant and H0 can be rejected at the
0.1% level.
The Wald test of the same hypothesis uses the test statistic
0.5878
0.1764
b
0
b)
s.e.(
c Studentlitteratur
223
The Residual vs. Fits plots shows some tendency towards an inverse
trumpet shape, with a decreasing variance for increasing yb. The Normal
plot is rather straight, with a couple of deviating observations at each
end.
Exercise 6.8
A. The test of the hypothesis of no relation between temperature and probability of failure is obtained by calculating the dierence in deviance between the null model and the estimated model. These dierences can be
interpreted as 2 variates on 1 d.f., for which the 5% limit is 3.841 and
the 1% limit is 6.635.
i) Poisson model: 2 = 22.434 16.8337 = 5.600; this is significant at
the 5% level (p = 0.018).
c Studentlitteratur
224
ii) Binomial model: 2 = 24.2304 18.0863 = 6.1441; again significant
at the 5% level (p = 0.0132).
Both models indicate a significant relationships between failure risk
and temperature. Note, however, that the number of observations and, in
particular, the number of failures, is so small that the asymptotics may
not work.
B. Predicted values at 31o F are
i) Poisson model: g (b
) = 5.9691 0.1034 31 = 2.7637. Since g () is a
log link, this gives
b = exp (2.7637) = 15.858.
C. The Poisson model has the disadvantage that the expected number of failing O-rings is actually larger than the total number on board: we predict
16 failures among 6 O-rings. The Binomial model is more reasonable.
b . In the presence of
The odds ratios can be calculated as exp
i
interactions the main eect odds ratios are not very illuminating, so we
only consider the interactions. In the table we abbreviate Gender=G;
Location=L; Injury=I and Belt use=B. We interpret the parameters by
the ordered values in the SAS printout. Since N is (alphabetically) before
Y, the odds ratio for, for example, B*I means that persons with B=No
have I=No less often. This, of course, could be stated as Users of seat
belts are injured less often. For the dierent interaction eects in the
model we get:
c Studentlitteratur
225
Term
G*L
OR
exp (0.2099) = 0.811
G*B
G*I
L*B
L*I
B*I
Comment
Females traveled in rural areas
less often than males.
Females avoided belt use less often than males.
Females were uninjured less often than males.
Belts are avoided less often in rural areas.
Passengers are uninjured less often in rural areas
Non-users of belts are uninjured
less often
Exercise 7.1
A generalized linear model with a multinomial distribution and a cumulative logit link gave the following result:
Analysis Of Parameter Estimates
Parameter
Intercept1
Intercept2
Intercept3
treatment
treatment
Scale
BP
CP
DF
Estimate
Standard
Error
1
1
1
1
0
0
-1.1607
-0.6222
1.1782
0.3219
0.0000
1.0000
0.1814
0.1705
0.1817
0.2216
0.0000
0.0000
Wald 95%
Confidence Limits
-1.5163
-0.9564
0.8220
-0.1124
0.0000
1.0000
-0.8051
-0.2881
1.5344
0.7563
0.0000
1.0000
ChiSquare
Pr > ChiSq
40.93
13.32
42.03
2.11
.
<.0001
0.0003
<.0001
0.1462
.
Parameter
Intercept1
Intercept2
mammo
mammo
mammo
Scale
< 1 year
> 1 year
Never
DF
Estimate
Standard
Error
1
1
1
1
0
0
-2.7703
-0.4759
-1.4753
-0.4926
0.0000
1.0000
0.2500
0.1337
0.3247
0.2928
0.0000
0.0000
Wald 95%
Confidence Limits
ChiSquare
Pr > ChiSq
-3.2602
-0.7380
-2.1117
-1.0664
0.0000
1.0000
122.80
12.66
20.64
2.83
.
<.0001
0.0004
<.0001
0.0924
.
-2.2803
-0.2137
-0.8388
0.0812
0.0000
1.0000
c Studentlitteratur
226
An alternative model is the linear by linear association model. For
these data, this model gives:
Analysis Of Parameter Estimates
Parameter
Intercept
cancer
cancer
cancer
mammo
mammo
mammo
c*m
Scale
0
1
2
< 1 year
> 1 year
Never
DF
Estimate
Standard
Error
1
1
1
0
1
1
0
1
0
3.6821
-1.0564
0.0171
0.0000
-3.0357
-2.2606
0.0000
0.6437
1.0000
0.0707
0.1036
0.0411
0.0000
0.1375
0.0688
0.0000
0.0348
0.0000
Wald 95%
Confidence Limits
3.5435
-1.2595
-0.0633
0.0000
-3.3052
-2.3954
0.0000
0.5756
1.0000
3.8206
-0.8533
0.0976
0.0000
-2.7662
-2.1258
0.0000
0.7119
1.0000
ChiSquare
Pr > ChiSq
2711.93
103.95
0.17
.
487.41
1079.83
.
343.06
<.0001
<.0001
0.6763
.
<.0001
<.0001
.
<.0001
Parameter
Intercept
ag
ag
lwbc
Scale
0
1
DF
Estimate
Standard
Error
1
1
0
1
1
0.0057
0.0431
0.0000
0.0061
0.9968
0.0036
0.0174
0.0000
0.0024
0.2160
Wald 95%
Confidence Limits
-0.0014
0.0089
0.0000
0.0014
0.6518
0.0128
0.0773
0.0000
0.0109
1.5242
ChiSquare
Pr > ChiSq
2.44
6.09
.
6.37
0.1180
0.0136
.
0.0116
A plot of observed survival times for the two groups, along with the
survival times predicted by the model, is as follows:
c Studentlitteratur
227
Exercise 8.2
The data were run using the macro for variance heterogeneity listed
in the Genmod manual. The results for the mean value model was as
follows:
Mean model
Intercept
group
group
group
Scale
a
b
c
1 19.7200
1 -16.7533
1 -11.5400
0
0.0000
0
1.0000
StdErr
LowerCL
UpperCL
7.6955
4.6370
7.7032 -31.8514
7.7790 -26.7866
0.0000
0.0000
0.0000
1.0000
34.8030
-1.6553
3.7066
0.0000
1.0000
ChiSq
Prob
ChiSq
6.57 0.0104
4.73 0.0296
2.20 0.1379
.
.
_
_
Intercept
group
group
group
Scale
a
b
c
1
1
1
0
0
5.6907
-6.0301
-3.1318
0.0000
0.5000
Prob
ChiSq
StdErr
LowerCL
UpperCL
ChiSq
0.6325
0.8563
0.7746
0.0000
0.0000
4.4511
-7.7085
-4.6500
0.0000
0.5000
6.9303
-4.3517
-1.6136
0.0000
0.5000
80.96 <.0001
49.58 <.0001
16.35 <.0001
.
.
_
_
c Studentlitteratur
228
submitted.
%glimmix(
data=labexp,
stmts=%str(
class pot var1 var2;
model x2/n2 = var1 var2 var1*var2;
random pot*var1*var2;
),
error=binomial, link=logit );
run;
Effect
Intercept
Var1
Var1
Var1
Var1
Var2
Var2
Var2
Var2
Var1*Var2
Var1*Var2
Var1*Var2
Var1*Var2
Var1*Var2
Var1*Var2
Var1*Var2
Var1*Var2
Var1*Var2
Var1*Var2
Var1*Var2
Var1*Var2
Var1*Var2
Var1*Var2
Var1*Var2
Var1*Var2
Var1
Var2
Estimate
Standard
Error
DF
t Value
Pr > |t|
A
F
H
K
A
F
H
K
A
F
H
K
A
F
H
K
A
F
H
K
1.6849
-0.1987
0.4159
-0.1333
0
-0.6850
0.1817
-0.4186
0
0.9072
-0.9360
-0.3074
0
0.2112
-0.5441
-0.6812
0
0.4641
-0.07013
0.4596
0
0
0
0
0
0.2299
0.3172
0.3447
0.3194
.
0.3047
0.3324
0.3107
.
0.4402
0.4423
0.4266
.
0.4581
0.4800
0.4501
.
0.4323
0.4600
0.4426
.
.
.
.
.
64
64
64
64
.
64
64
64
.
64
64
64
.
64
64
64
.
64
64
64
.
.
.
.
.
7.33
-0.63
1.21
-0.42
.
-2.25
0.55
-1.35
.
2.06
-2.12
-0.72
.
0.46
-1.13
-1.51
.
1.07
-0.15
1.04
.
.
.
.
.
<.0001
0.5332
0.2320
0.6779
.
0.0280
0.5865
0.1826
.
0.0434
0.0382
0.4737
.
0.6464
0.2612
0.1351
.
0.2870
0.8793
0.3030
.
.
.
.
.
A
F
H
K
A
A
A
A
F
F
F
F
H
H
H
H
K
K
K
K
Effect
Var1
Var2
Var1*Var2
Num
DF
Den
DF
F Value
Pr > F
3
3
9
64
64
64
3.22
4.37
3.05
0.0285
0.0074
0.0042
There is a significant interaction between varieties, i.e. some combinations of varieties are more palatable than others to the lice. This conclusion may be followed up by comparing least squares mean values for the
dierent combinations.
c Studentlitteratur
Index
compound distribution, 101, 134
computer software, 24
conditional independence, 119
conditional odds ratio, 120
confidence interval, 7
constraints, 4
contingency table, 111
contrast, 15
Cooks distance, 60
correlation structure, 166
count data, 111
covariance analysis, 21
Cramr-Rao inequality, 188
cross-over design, 169
cumulative logits, 148
dependent variable, 2
design matrix, 2, 42
deterministic model, 1
deviance, 45
deviance residual, 57
Dfbeta, 60
dilution assay, 86
dispersion parameter, 39
dummy variable, 12, 14
Bernoulli distribution, 87
binomial distribution, 37, 88, 113
Bonferroni adjustment, 16
boxplot, 17
canonical link, 42
canonical parameter, 37
capture-recapture data, 122
censoring, 158
chi-square distribution, 73
chi-square test, 117
class variables, 26
classification variables, 12
coecient of determination, 5
comparisonwise error rate, 15
complementary log-log link, 40, 86
ED50, 92
empirical estimator
robust estimator
sandwich estimator, 162
estimable functions, 23
exchangeable correlation structure,
166
expected frequencies, 112
experimentwise error rate, 15
229
230
exponential dispersion family, 37
exponential distribution, 53, 75
exponential family, 31, 36, 37
extreme value distribution, 87, 151
F test, 6
factorial experiments, 18
Fisher information, 188
Fishers scoring, 190
fitted value, 4
fixed eects, 169
frequency table, 111
full model, 45
gamma distribution, 73
gamma function, 73
GEE, 165
general linear model, ix, 1, 2
Generalized estimating equations,
165
generalized inverse, 4
generalized linear model, ix, 36
geometric distribution, 134
Glimmix, 169
Gumbel distribution, 87
hat matrix, 56
hat notation, 3
hazard function, 159
Hessian matrix, 189
homogenous association, 119
homoscedasticity, 24
identity link, 40
independent variable, 2
influential observations, 55, 59
interaction, 18, 112
intercept, 3
intrinsically nonlinear models, 23
iteratively reweighted least squares,
44, 190
Kaplan-Meier estimates, 159
latent variable, 151
c Studentlitteratur
Index
231
Index
quasi-likelihood, 162
R-square, 5
random eects, 169
rate data, 131
RC model, 148
relative risk, 98
repeated measures data, 165
residual, 1, 3, 4, 56
residual plots, 55
residual sum of squares, 5
response variable, 31
response variables, binary, 32
response variables, binomial, 32, 33
response variables, continuous, 32
response variables, counts, 32, 34
response variables, rates, 32, 35
SAS, 8, 15, 16, 26
saturated model, 45, 113
scale parameter, 133
scaled deviance, 45
score equation, 188
score residual, 57
score test, 48
sequential sum of squares, 8
simple linear regression, 8
statistical independence, 112
statistical model, 1
sum of squares, 4
survival data, 158
survival function, 158
t test, 12
tests on subsets of parameters, 7
tolerance distribution, 85
total sum of squares, 4
truncated Poisson distribution, 53
type 1 SS, 8
Type 1 test, 49
type 2 SS, 8
type 3 SS, 8
Type 3 test, 49
type 4 SS, 8
c Studentlitteratur
232
underdispersion, 61
variance function, 39
variance heterogeneity, 157
Wald test, 47
Wilcoxon-Mann-Whitney test, 52
c Studentlitteratur
Index