Statistical Model For The Plant and Soil Sciences
Statistical Model For The Plant and Soil Sciences
Statistical Model For The Plant and Soil Sciences
STATISTICAL MODELS
for the Plant and Soil Sciences
Oliver Schabenberger
Francis J. Pierce
CRC PR E S S
Boca Raton London New York Washington, D.C.
© 2002 by CRC Press LLC
1405/disclaimer Page 1 Tuesday, October 2, 2001 9:43 AM
Schabenberger, Oliver.
Contemporary statistical models for the plant and soil sciences / Oliver Schabenberger
and Francis J. Pierce.
p. cm.
Includes bibliographical references (p. ).
ISBN 1-58488-111-9 (alk. paper)
1. Plants, Cultivated—Statistical methods. 2. Soil science—Statistical methods. I.
Pierce, F. J. (Francis J.) II. Title.
This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with
permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish
reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials
or for the consequences of their use.
Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical,
including photocopying, microfilming, and recording, or by any information storage or retrieval system, without prior
permission in writing from the publisher.
The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, for creating new works,
or for resale. Specific permission must be obtained in writing from CRC Press LLC for such copying.
Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation, without intent to infringe.
Preface
About the Authors
1 Statistical Models
1.1 Mathematical and Statistical Models
1.2 Functional Aspects of Models
1.3 The Inferential Steps — Estimation and Testing
1.4 >-Tests in Terms of Statistical Models
1.5 Embedding Hypotheses
1.6 Hypothesis and Significance Testing — Interpretation of the : -Value
1.7 Classes of Statistical Models
1.7.1 The Basic Component Equation
1.7.2 Linear and Nonlinear Models
1.7.3 Regression and Analysis of Variance Models
1.7.4 Univariate and Multivariate Models
1.7.5 Fixed, Random, and Mixed Effects Models
1.7.6 Generalized Linear Models
1.7.7 Errors in Variable Models
2 Data Structures
2.1 Introduction
2.2 Classification by Response Type
2.3 Classification by Study Type
2.4 Clustered Data
2.4.1 Clustering through Hierarchical Random Processes
2.4.2 Clustering through Repeated Measurements
2.5 Autocorrelated Data
2.5.1 The Autocorrelation Function
2.5.2 Consequences of Ignoring Autocorrelation
2.5.3 Autocorrelation in Designed Experiments
2.6 From Independent to Spatial Data — a Progression of Clustering
5 Nonlinear Models
5.1 Introduction
5.2 Models as Laws or Tools
5.3 Linear Polynomials Approximate Nonlinear Models
5.4 Fitting a Nonlinear Model to Data
5.4.1 Estimating the Parameters
5.4.2 Tracking Convergence
5.4.3 Starting Values.
5.4.4 Goodness-of-Fit
5.5 Hypothesis Tests and Confidence Intervals
5.5.1 Testing the Linear Hypothesis
5.5.2 Confidence and Prediction Intervals
5.6 Transformations
5.6.1 Transformation to Linearity
5.6.2 Transformation to Stabilize the Variance
5.7 Parameterization of Nonlinear Models
5.7.1 Intrinsic and Parameter-Effects Curvature
5.7.2 Reparameterization through Defining Relationships
5.8 Applications
5.8.1 Basic Nonlinear Analysis with The SAS® System — Mitscherlich's
Yield Equation
5.8.2 The Sampling Distribution of Nonlinear Estimators —
the Mitscherlich Equation Revisited
5.8.3 Linear-Plateau Models and Their Relatives — a Study of Corn
Yields from Tennessee
5.8.4 Critical R S$ Concentrations as a Function of Sampling Depth —
Comparing Join-Points in Plateau Models
5.8.5 Factorial Treatment Structure with Nonlinear Response
5.8.6 Modeling Hormetic Dose Response through Switching Functions
5.8.7 Modeling a Yield-Density Relationship
5.8.8 Weighted Nonlinear Least Squares Analysis with
Heteroscedastic Errors
To the Reader
Statistics is essentially a discipline of the twentieth century, and for several decades it was
keenly involved with problems of interpreting and analyzing empirical data that originate in
agronomic investigations. The vernacular of experimental design in use today bears evidence
of the agricultural connection and origin of this body of theory. Omnipresent terms, such as
block or split-plot, emanated from descriptions of blocks of land and experimental plots in
agronomic field designs. The theory of randomization in experimental work was developed
by Fisher to neutralize in particular the spatial effects among experimental units he realized
existed among field plots. Despite its many origins in agronomic problems, statistics today is
often unrecognizable in this context. Numerous recent methodological approaches and
advances originated in other subject-matter areas and agronomists frequently find it difficult
to see their immediate relation to questions that their disciplines raise. On the other hand,
statisticians often fail to recognize the riches of challenging data analytical problems
contemporary plant and soil science provides. One could gain the impressions that
• statistical methods of concern to plant and soil scientists are completely developed and
understood;
• the analytical tools of classical statistical analysis learned in a one- or two-semester
course for non-statistics majors are sufficient to cope with data analytical problems;
• recent methodological work in statistics applies to other disciplines such as human
health, sociology, or economics, and has no bearing on the work of the agronomist;
• there is no need to consider contemporary statistical methods and no gain in doing so.
These impressions are incorrect. Data collected in many investigations and the circum-
stances under which they are accrued often bear little resemblance to classically designed ex-
periments. Much of the data analysis in the plant and soil sciences is nevertheless viewed in
the experimental design framework. Ground and remote sensing technology, yield monito-
ring, and geographic information systems are but a few examples where analysis cannot
necessarily be cast, nor should it be coerced, into a standard analysis of variance framework.
As our understanding of the biological/physical/environmental/ecological mechanisms in-
creases, we are more and more interested in what some have termed the space/time dynamics
of the processes we observe or set into motion by experimentation. It is one thing to collect
data in space and/or over time, it is another matter to apply the appropriate statistical tools to
infer what the data are trying to tell us. While many of the advances in statistical
methodologies in past decades have not explicitly focused on agronomic applications, it
would be incorrect to assume that these methods are not fruitfully applied there. Geostatistical
methods, mixed models for repeated measures and longitudinal data, generalized linear
models for non-normal (= non-Gaussian) data, and nonlinear models are cases in point.
To the User
Contemporary statistical models cannot be appreciated to their full potential without a
good understanding of theory. Hence, we place emphasis on that. They also cannot be applied
to their full potential without the aid of statistical software. Hence, we place emphasis on that.
The main chapters are roughly equally divided between coverage of essential theory and
applications. Additional theoretical derivations and mathematical details needed to develop a
deeper understanding of the models can be found on the companion CD-ROM. The choice to
focus on The SAS® System for calculations was simple. It is, in our opinion, the most
powerful statistical computing platform and the most widely available and accepted com-
puting environment for statistical problems in academia, industry, and government. In rare
cases when procedures in SAS® were not available and macros too cumbersome we
employed the S-PLUS® package, in particular the S+SpatialStats® module. The important
portions of the executed computer code are shown in the text along with the output. All data
sets and SAS® or S-PLUS® codes are contained on the CD-ROM.
To the Instructor
This text is both a reference and textbook and was developed with a reader in mind who
has had a first course in statistics, covering simple and multiple linear regression, analysis of
variance, who is familiar with the principles of experimental design and is willing to absorb a
Francis J. Pierce is the director of the center for precision agricultural systems at Washington
State University, located at the WSU Irrigated Agriculture
Research & Extension Center (IRAEC) in Prosser, Washington.
He is also a professor in the departments of crop and soil sciences
and biological systems engineering and directs the WSU Public
Agricultural Weather System. Dr. Pierce received his M.S. and
Ph.D. degrees in soil science from the University of Minnesota in
1980 and 1984. He spent the next 16 years at Michigan State
University, where he has served as professor of soil science in the
department of crop and soil sciences since 1995. His expertise is
in soil management and he has been involved in the development
and evaluation of precision agriculture since 1991. The Center for
Precision Agricultural Systems was funded by the Washington
Legislature as part of the University's Advanced Technology Initiative in 1999. As center
director, Dr. Pierce's mission is to advance the science and practice of precision agriculture in
Washington. The center's efforts will support the competitive production of Washington's
agricultural commodities, stimulate the state's economic development, and protect the region's
environmental and natural resources. Dr. Pierce has edited three other books, Soil
Management for Sustainability, Advances in Soil Conservation, and The State of Site-Specific
Management for Agriculture.
© 2002 by CRC Press LLC
Chapter 1
Statistical Models
“A theory has only the alternative of being right or wrong. A model has a
third possibility: it may be right, but irrelevant.” Manfred Eigen. In the
Physicist's Conception of Nature (Jagdish Mehra, Ed.) 1973.
The ability to represent phenomena and processes of the biological, physical, chemical, and
social world through models is one of the great scientific achievements of humankind. Scien-
tific models isolate and abstract the elementary facts and relationships of interest and provide
the logical structure in which a system is studied and from which inferences are drawn.
Identifying the important components of a system and isolating the facts of primary interest is
necessary to focus on those aspects relevant to a particular inquiry. Abstraction is necessary
to cast the facts in a logical system that is concise, deepens our insight, and is understood by
others to foster communication, critique, and technology transfer. Mathematics is the most
universal and powerful logical system, and it comes as no surprise that most scientific models
in the life sciences or elsewhere are either developed as mathematical abstractions of real phe-
nomena or can be expressed as such. A purely mathematical model is a mechanistic
(= deterministic) device in that for a given set of inputs, it predicts the output with absolute
certainty. It leaves nothing to chance. Beltrami (1998, p. 86), for example, develops the
following mathematical model for the concentration ! of a pollutant in a river at point = and
time >:
!a=ß >b !! a= ->bexpe .>f. [1.1]
where / is a random variable with mean !, variance 5 # , and some probability distribution.
Allowing for the random deviation /, model [1.2] now states explicitly that !a=ß >b is a ran-
dom variable and the expression !! a= ->bexpe .>f is the expected value or average
pollutant concentration,
Ec!a=ß >bd !! a= ->bexpe .>f.
Olkin et al. (1978, p. 4) conclude: “The assumption that chance phenomena exist and can
be described, whether true or not, has proved valuable in almost every discipline.” Of the
many reasons for incorporating stochastic elements in scientific models, an incomplete list
includes the following.
• The model is not correct for a particular observation, but correct on average.
• Omissions and assumptions are typically necessary to abstract a phenomenon.
• Even if the nature of all influences were known, it may be impossible to measure or
even observe all the variables.
• Scientists do not develop models without validation and calibration with real data. The
innate variability (nonconstancy) of empirical data stems from systematic and random
effects. Random measurement errors, observational (sampling) errors due to sampling
a population rather than measuring its entirety, experimental errors due to lack of
homogeneity in the experimental material or the application of treatments, account for
stochastic variation in the data even if all systematic effects are accounted for.
• Randomness is often introduced deliberately because it yields representative samples
from which unbiased inferences can be drawn. A random sample from a population
will represent the population (on average), regardless of the sample size. Treatments
are assigned to experimental units by a random mechanism to neutralize the effects of
unaccounted sources of variation which enables unbiased estimates of treatment
means and their differences (Fisher 1935). Replication of treatments guarantees that
experimental error variation can be estimated. Only in combination with
randomization will this estimate be free of bias.
• Stochastic models are often more parsimonious than deterministic models and easier to
study. A deterministic model for the germination of seeds from a large lot, for
example, would incorporate a plethora of factors, their actions and interactions. The
plant species and variety, storage conditions, the germination environment, amount of
non-seed material in the lot, seed-to-seed differences in nutrient content, plant-to-plant
interactions, competition, soil conditions, etc. must be accounted for. Alternatively, we
can think of the germination of a particular seed from the lot as a Bernoulli random
variable with success (germination) probability 1. That is, if ]3 takes on the value " if
seed 3 germinates and the value ! otherwise, then the probability distribution of ]3 is
simply
1 C3 "
:aC3 b
"1 C3 ! .
Statistical models, in terminology that we adopt for this text, are stochastic models that
contain unknown constants (parameters). In the river pollution example, the model
!a=ß >b !! a= ->bexpe .>f /, Ec/d !ß Varc/d 5 #
is a stochastic model if all parameters a!! ß -ß .ß 5 # b are known. (Note that / is not a constant
but a random variable. Its mean and variance are constants, however.) Otherwise it is a statis-
tical model and those constants that are unknown must be estimated from data. In the seed
germination example, the germination probability 1 is unknown, hence the model
1 C3 "
: a C3 b
"1 C3 !
is a statistical one. The parameter 1 is estimated based on a sample of 8 seeds from the lot.
This usage of the term parameter is consistent with statistical theory but not necessarily with
modeling practice. Any quantity that drives a model is often termed a parameter of the
model. We will refer to parameters only if they are unknown constants. Variables that can be
measured, such as plant density in the model of a yield-density relationship are, not parame-
ters. The rate of change of plant yield as a function of plant density is a parameter.
• Statistical models represent a mechanism from which data with the same
statistical properties as the observed data can be generated.
The model errors /3 are assumed to be independent and identically distributed a33. b accor-
ding to a Gaussian distribution (we use this denomination instead of Normal distribution
throughout) with mean ! and variance 5 # . As a consequence, ]3 is also distributed as a
Gaussian random variable with mean Ec]3 d "! "" B3 and variance 5 # ,
]3 µ K"! "" B3 ß 5 # .
The ]3 are not identically distributed because their means are different, but they remain inde-
pendent (a result of drawing a random sample). Since a Gaussian distribution is completely
specified by its mean and variance, the distribution of the ]3 is completely known, once
values for the parameters "! , "" , and 5 # are known. For many statistical purposes the assump-
tion of Gaussian errors is more than what is required. To derive unbiased estimators of the
intercept "! and slope "" , it is sufficient that the errors have zero mean. A simple linear
regression model with lesser assumptions than [1.3] would be, for example,
]3 "! "" B3 /3 , /3 µ 33. !ß 5 # .
The errors are assumed independent zero mean random variables with equal variance (homo-
scedastic), but their distribution is otherwise not specified. This is sometimes referred to as
the first-two-moments specification of the model. Only the mean and variance of the ]3 can
be inferred:
Ec]3 d "! "" B3
Varc]3 d 5 # .
If the parameters "! , "" , and 5 # were known, this model would be an incomplete description
of the distributional properties of the response ]3 . Implicit in the description of distributional
properties is a separation of variability into known sources, e.g., the dependency of ] on B,
and unknown sources (error) and a description of the form of the dependency. Here, ] is
assumed to depend linearly on the regressor. Expressing which regressors ] depends on indi-
vidually and simultaneously and how this dependency can be crafted mathematically is one
important aspect of statistical modeling.
To conceptualize what constitutes a useful statistical model, we appeal to what we con-
sider the most important functional aspect. A statistical model provides a mechanism to gene-
rate the essence of the data such that the properties of the data generated under the model are
statistically equivalent to the observed data. In other words, the observed data can be con-
sidered as one particular realization of the stochastic process that is implied by the model. If
the relevant features of the data cannot be realized under the assumed model, it is not useful.
The upper left panel in Figure 1.1 shows 8 #" yield observations as a function of the
amount of nitrogen fertilization. Various candidate models exist to model the relationship
between plant yield and fertilizer input. One class of models, the linear-plateau models
(§5.8.3, §5.8.4), are segmented models connecting a linear regression with a flat plateau
yield. The upper right panel of Figure 1.1 shows the distributional specification of a linear
plateau model. If ! denotes the nitrogen concentration at which the two segments connect, the
model for the average plant yield at concentration R can be written as
"! "" R R !
Ec] 3/6. d
"! "" ! R !.
If M aBb is the indicator function returning value " if the condition B holds and ! otherwise, the
statistical model can also be expressed as
] 3/6. a"! "" R bM aR !b a"! "" !bM aR !b /, / µ K !ß 5 # . [1.4]
80 80
70 70
Yield
Yield
60 60
50 Yield = ( β 0 + β 1 N ) I ( N ≤ α )
50
+ ( β 0 + β 1α ) I ( N > α ) + e
40 40 e ~ G ( 0, σ 2 )
80 80
70 70
Yield
Yield
60 60
Figure 1.1. Yield data as a function of R input (upper left panel), linear-plateau model as an
assumed data-generating mechanism (upper right panel). Fit of linear-plateau model and com-
peting models are shown in the lower panels.
ó: ]3 "! /3
ô: ]3 "! "" B3 /3
õ: ]3 "! "" B3 "# B#3 /3
ö: ]3 "! "" B3 "# D3 /3 .
Models ó through õ are nested models, so are ó, ô, and ö. The restriction "# ! in
model õ produces model ô, the restriction "" ! in model ô produces the intercept-only
model ó, and "" "# ! yields ó from õ. To decide among these three models, one can
commence by fitting model õ and perform hypothesis tests for the respective restrictions.
Based on the results of these tests we are led to the best model among ó, ô, õ. Should we
have started with model ö instead? Model ö is a two-regressor multiple linear regression
model, and a comparison between models õ and ö by means of a hypothesis test is not
possible; the models are not nested. Other criteria must be employed to discriminate between
them. The appropriate criteria will depend on the intended use of the statistical model. If it is
important that the model fits well to the data at hand, one may rely on the coefficient of deter-
mination (V # ). To guard against overfitting, Mallow's C: statistic or likelihood-based statis-
tics such as Akaike's information criterion (AIC) can be used. If precise predictions are re-
quired then one can compare models based on cross-validation criteria or the PRESS statistic.
Variance inflation factors and other collinearity diagnostics come to the fore if statistical
properties of the parameter estimates are important (see, e.g., Myers 1990, and our §§4.4,
A4.8.3). Depending on which criteria are chosen, different models might emerge as best.
Among the informal procedures of model critique are various graphical displays, such as
plots of residuals against predicted values and regressors, partial residual plots, normal proba-
bility and Q-Q plots, and so forth. These are indispensable tools of statistical analysis but they
are often overused and misused. As we will see in §4.4 the standard collection of residual
measures in linear models are ill-suited to pass judgment about whether the unobservable
model errors are Gaussian-distributed, or not. For most applications, it is not the Gaussian
assumption whose violation is most damaging, but the homogeneous variance and the inde-
pendence assumptions (§4.5). In nonlinear models the behavior of the residuals in a correctly
chosen model can be very different from the textbook behavior of fitted residuals in a linear
model (§5.7.1). Plotting studentized residuals against fitted values in a linear model, one
expects a band of random scatter about zero. In a nonlinear model where intrinsic curvature is
large, one should look for a negative trend between the residuals and fitted values as a sign of
a well-fitting model.
Besides model discrimination based on statistical procedures or displays, the subject
matter hopefully plays a substantive role in choosing among competing models. Interpreta-
bility and parsimony are critical assets of a useful statistical model. Nothing is gained by
building models that are so large and complex that they are no longer interpretable as a whole
or involve factors that are impossible to observe in practice. Adding variables to a regression
model will necessarily increase V # , but can also create conditions where the relationships
among the regressor variables render estimation unstable, predictions imprecise, and
interpretation increasingly difficult (see §§4.4.4, 4.4.5 on collinearity, its impact, diagnosis,
and remedy). Medieval Franciscan monk William of Ockham (1285-1349) is credited with
coining pluralitas non est ponenda sine neccesitate or plurality should not be assumed
(posited) without necessity. Also known as Ockham's Razor, this tenet is often loosely
phrased as “among competing explanations, pick the simplest one.” When choosing among
competing statistical models, simple does not just imply the smallest possible model. The
selected model should be simple to fit, simple to interpret, simple to justify, and simple to
apply. Nonlinear models, for example, have long been considered difficult to fit to data. Even
recently, Black (1993, p. 65) refers to the “drudgery connected with the actual fitting” of non-
linear models. Although there are issues to be considered when modeling nonlinear relation-
ships that do not come to bear with linear models, the actual process of fitting a nonlinear
model with today's computing support is hardly more difficult than fitting a linear regression
model (see §5.4).
Returning to the yield-response example in Figure 1.1, of the four models plotted in the
lower panels, the straight-line regression model is ruled out because of poor fit. The other
three models, however, have very similar goodness-of-fit statistics and we consider them
competitors. Table 1.1 gives the formulas for the mean yield under these models. Each is a
three-parameter model, the first two are nonlinear, the quadratic polynomial is a linear model.
All three models are easy to fit with statistical software. The selection thus boils down to their
interpretability and justifiability.
Each model contains one parameter measuring the average yield if no R is applied. The
linear-plateau model achieves the yield maximum of "! "" ! at precisely R !; the
Mitscherlich equation approaches the yield maximum - asymptotically (as R Ä _). The
quadratic polynomial does not have a yield plateau or asymptote, but achieves a yield maxi-
mum at R #" Î### . Increasing yields are recorded for R #" Î### and decreasing
yields for R #" Î### . Since the yield increase is linear in the plateau model (up to
R !), a single parameter describes the rate of change in yield. In the Mitscherlich model
where the transition between yield minimum 0 and upper asymptote ! is smooth, no single
parameter measures the rate of change, but one parameter a,b governs it:
` Ec] 3/6. d
,a- 0bexpe ,R f.
`R
The standard interpretation of regression coefficients in a multiple linear regression equation
is to measure the change in the mean response if the associated regressor increases by one
unit while all other regressors are held constant. In the quadratic polynomial the linear and
quadratic coefficients a#" ß ## b cannot be interpreted this way. Changing R while holding R #
constant is not possible. The quadratic polynomial has a linear rate of change,
` Ec] 3/6. d
#" ### R ,
`R
Table 1.1. Three-parameter yield response models and the interpretation of their parameters
No. of
Model Ec] 3/6. d params. Interpretation
"! : Ec] 3/6. d at R !
"" : Change in Ec] 3/6. d per
"! "" R R ! " kg/ha additional R
Linear-plateau " " ! $
! " R ! prior to reaching plateau
!: R amount where plateau
is reached
-: Upper Ec] 3/6. d asymptote
Mitscherlich - a0 -bexpe ,R f $ 0: Ec] 3/6. d at R !
,: Governs rate of change
#! : Ec] 3/6. d at R !
# #" : ` Ec] 3/6. d/`R at R !
Quadr. polynomial #! # " R # # R $
" #
## : ` Ec] 3/6. d/`R #
#
Interpretability of the model parameters clearly favors the nonlinear models. Biological
relationships rarely exhibit sharp transitions and kinks. Smooth, gradual transitions are more
likely. The Mitscherlich model may be more easily justifiable than the linear-plateau model.
If no decline of yields was observed over the range of R applied, resorting to a model that
will invariably have a maximum at some R amount, be it within the observed range or out-
side of it, is tenuous.
One appeal of the linear-plateau model is to estimate the amount of R at which the two
segments connect, a special case of what is known in dose-response studies as an effective (or
critical) dosage. To compute an effective dosage in the Mitscherlich model the user must
specify the response the dosage is supposed to achieve. For example, the nitrogen fertilizer
amount RO that produces O % of the asymptotic yield in the Mitscherlich model is obtained
by solving -OÎ"!! - a0 -bexpe ,RO f for RO ,
" - "!! O
RO ln .
, -0 "!!
• The two most important statistical estimation principles are the principles of
least squares and the maximum likelihood principle. Each appears in many
different flavors.
After selecting a statistical model its parameters must be estimated. If the fitted model is
accepted as a useful abstraction of the phenomenon under study, further inferential steps in-
volve the calculation of confidence bounds for the parameters, the testing of hypotheses about
the parameters, and the calculation of predicted values. Of the many estimation principles at
our disposal, the most important ones are the least squares and the maximum likelihood prin-
ciples. Almost all of the estimation methods that are discussed and applied in subsequent
chapters are applications of these basic principles. Least squares (LS) was advanced and
maximum likelihood (ML) proposed by Carl Friedrich Gauss (1777-1855) in the early
nineteenth century. R.A. Fisher is usually credited with the (re-)discovery of the likelihood
principle.
Least Squares
Assume that a statistical model for the observed data ]3 can be expressed as
]3 03 a)" ß âß ): b /3 , [1.5]
where the )4 a4 "ß âß :b are parameters of the mean function 03 ab, and the /3 are zero mean
random variables. The distribution of the /3 can depend on other parameters, but not on the
)4 . LS is a semi-parametric principle, in that only the mean and variance of the /3 as well as
their covariances (correlations) are needed to derive estimates. The distribution of the /3 can
be otherwise unspecified. In fact, least squares can be motivated as a geometric rather than a
statistical principle (§4.2.1). The assertion found in many places that the /3 are Gaussian ran-
dom variables is not needed to derive the estimators of )" ß âß ): .
Different flavors of the LS principle are distinguished according to the variances and co-
variances of the error terms. In ordinary least squares (OLS) estimation the /3 are uncorrela-
ted and homoscedastic ÐVarc/3 d 5 # Ñ. Weighted least squares (WLS) assumes uncorrelated
errors but allows their variances to differ ÐVarc/3 d 53# Ñ. Generalized least squares (GLS)
accommodates correlations among the errors and estimated generalized least squares (EGLS)
allows these correlations to be unknown. There are other varieties of the least squares
principles, but these four are of primary concern in this text. If the mean function
03 a)" ß âß ): b is nonlinear in the parameters )" ß âß ): , the respective methods are referred to
as nonlinear OLS, nonlinear WLS, and so forth (see §5).
The general philosophy of least squares estimation is most easily demonstrated for the
case of OLS. The principle seeks to find those values s)" ß âß s): that minimize the sum of
squares
8 8
W a)" ß âß ): b "aC3 03 a)" ß âß ): bb# "/3# .
3" 3"
This is typically accomplished by taking partial derivatives of W a)" ß âß ): b with respect to the
)4 and setting them to zero. The system of equations
`W a)" ß âß ): bÎ` )" !
ã
`W a)" ß âß ): bÎ` ): !
is known as the normal equations. For linear and nonlinear models a solution is best
described in terms of matrices and vectors; it is deferred until necessary linear algebra tools
have been discussed in §3. The solutions s)" ß âß s): to this minimization problem are called
the ordinary least squares estimators (OLSE). The residual sum of squares WWV is obtained
by evaluating W a)" ß âß ): b at the least squares estimate.
Least squares estimators have many appealing properties. In linear models, for example,
• the s)4 are linear functions of the observations C" ß âß C8 , which makes it easy to estab-
lish statistical properties of the s)4 and to test hypotheses about the unknown )4 .
• The linear combination +"s)" â +:s): is the best linear unbiased estimator (BLUE)
of +" )" â +: ): (Gauss-Markov theorem). If, in addition, the /3 are Gaussian-
distributed, then +"s)" â +:s): is the minimum variance unbiased estimator of
+" )" â +: ): . No other unbiased estimator can beat its performance, linear or not.
• If the /3 are Gaussian, then two nested models can be compared with a sum of squares
reduction test. If WWV0 and WWV< denote the residual sums of squares in the full and
reduced model, respectively, and Q WV0 the residual mean square in the full model,
then the statistic
aWWV< WWV0 bÎ;
J9,=
Q WV0
A downside of least squares estimation is its focus on parameters of the mean function
03 ab. The principle does not lend itself to estimation of parameters associated with the distri-
bution of the errors /3 , for example, the variance of the model errors. In least squares esti-
mation, these parameters must be obtained by other principles.
Maximum Likelihood
Maximum likelihood is a parametric principle; it requires that the joint distribution of the
observations ]" ß âß ]8 is known except for the parameters to be estimated. For example, one
may assume that ]" ß âß ]8 follow a multivariate Gaussian distribution (§3.7) and base ML
estimation on this fact. If the observations are statistically independent the joint density (C
continuous) or mass (C discrete) function is the product of the marginal distributions of the ]3 ,
and the likelihood is calculated as the product of individual contributions, one for each
sample. Consider the case where
" if a seed germinates
]3
! otherwise,
a binary response variable. If a random sample of 8 seeds is obtained from a seed lot then the
probability mass function of C3 is
:aC3 à 1b 1C3 a" 1b"C3 ,
where the parameter 1 denotes the probability of germination in the seed lot. The joint mass
function of the random sample becomes
8
:aC" ß âß C8 à 1b $1C3 a" 1b"C3 1< a" 1b8< , [1.6]
3"
with < !83" C3 , the number of germinated seeds. For any given value µ 1 , the probability
:aC" ß âß C8 à µ
1 b can be thought of as the probability of observing the sample C" ß âß C8 if the
germination probability is µ 1 . The maximum likelihood principle estimates 1 by that value
which maximizes :aC" ß âß C8 à 1b, because this is the value most likely to have generated the
data.
Since :aC" ß âß C8 à 1b is now considered a function of 1 for a given sample C" ß âß C8 , we
write ¿a1à C" ß âß C8 b for the function to be maximized and call it the likelihood function.
Whatever technical device is necessary, maximum likelihood estimators (MLEs) are found as
those values that maximize ¿a1à C" ß âß C8 b or, equivalently, maximize the log-likelihood
function lne¿a1à C" ß âß C8 bf 6a1à C" ß âß C8 b. Direct maximization is often possible. Such
is the case in the seed germination example. From [1.6] the log-likelihood is computed as
6a1à C" ß âß C8 b <lne1f a8 <blne" 1f, and taking the derivative with respect to 1
yields
`6a1à C" ß âß C8 bÎ` 1 <Î1 a8 <bÎa" 1b.
• If the data are Gaussian, MLEs of mean parameters are identical to least squares
estimates;
• MLEs are functionally invariant. If 1
s C is the MLE of 1, then lne1
s Î a" 1
sbf is the
MLE of lne1Îa" 1bf, the logit of 1, for example.
• MLEs are usually asymptotically efficient. With increasing sample size their distribu-
tion tends to a Gaussian distribution, they are asymptotically unbiased and the most
efficient estimators.
On the downside we note that MLEs do not necessarily exist, are not necessarily unique,
and are often biased estimators. Variations of the likelihood idea of particular importance for
the discussion in this text are restricted maximum likelihood (REML, §7), quasi-likelihood
(QL, §8), and composite likelihood (CL, §9).
To compare two nested models in the least squares framework, the sum of squares
reduction test is a convenient and powerful device. It is intuitive in that a restriction imposed
on a statistical model necessarily will result in an increase of the residual sum of squares.
Whether that increase is statistically significant can be assessed by comparing the J9,= statis-
tic against appropriate cutoff values or by calculating the :-value of the J9,= statistic under
the null distribution (see §1.6). If ¿0 is the likelihood in a statistical model and ¿< is the
likelihood if the model is reduced according to a restriction imposed on the full model, then
¿< cannot exceed ¿0 . In the discrete case where likelihoods have interpretation of true proba-
bilities, the ratio ¿0 ο< expresses how much more likely it is that the full model generated
the data compared to the reduced model. A similar interpretation applies in the case where ]
is continuous, although the likelihood ratio then does not measure a ratio of probabilities but a
ratio of densities. If the ratio is sufficiently large, then the reduced model should be rejected.
For many important cases, e.g., when the data are Gaussian-distributed, the distribution of
¿0 ο< or a function thereof is known. In general, the likelihood ratio statistic
A #lne¿0 ο< f #e60 6< f [1.7]
has an asymptotic Chi-squared distribution with ; degrees of freedom, where ; equals the
number of restrictions imposed on the full model. In other words, ; is equal to the number of
parameters in the full model minus the number of parameters in the reduced model. The re-
duced model is rejected in favor of the full model at the ! "!!% significance level if A
exceeds ;#!ß; , the ! right-tail probability cutoff of a ;;# distribution. In cases where an exact
likelihood-ratio test is possible, it is preferred over the asymptotic test, which is exact only as
sample size tends to infinity.
As in the sum of squares reduction test, the likelihood ratio test requires that the models
being compared are nested. In the seed germination example the full model
:aC3 à 1b 1C3 a" 1b"C3
leaves unspecified the germination probability 1 . To test whether the germination probability
in the seed lot takes on a given value, 1! , the model reduces to
:aC3 à 1! b 1!C3 a" 1! b"C3
under the hypothesis L! : 1 1! . The log-likelihood in the full model is evaluated at the
maximum likelihood estimate 1 s <Î8. There are no unknowns in the reduced model and its
log-likelihood is evaluated at the hypothesized value 1 1! . The two log-likelihoods become
6 0 6 a1
sà C" ß âß C8 b <lne<Î8f a8 <blne" <Î8f
6< 6a1! à C" ß âß C8 b <lne1! f a8 <blne" 1! f.
After some minor manipulations the likelihood-ratio test statistic can be written as
< 8<
A #e60 6< f #<ln #a8 <bln .
81! 8 81!
Note that 81! is the expected number of seeds germinating if the null hypothesis is true.
Similarly, 8 81! is the expected number of seeds that fail to germinate under L! . The
quantities < and 8 < are the observed numbers of seeds in the two categories. The likeli-
hood-ratio statistic in this case takes on the familiar form (Agresti 1990),
observed count
A " observed count ln .
all categories
expected count
which is the ratio of a parameter estimate and its estimated standard error, is appropriate to
test the hypothesis L! : "4 !. This test turns out to be equivalent to a comparison of two
models. The full model containing the regressor associated with "4 and a reduced model from
which the regressor has been removed. The comparison of nested models lurks in other
procedures, too, which on the surface do not appear to have much in common with statistical
models. In this section we formulate the well-known one- and two-sample (pooled) >-tests in
terms of statistical models and show how the comparison of two nested models is equivalent
to the standard tests.
One-Sample >-Test
The one-sample >-test of the hypothesis that the mean . of a population takes on a particular
value .! , L! : . .! , is appropriate if the data are a random sample from a Gaussian popu-
lation with unknown mean . and unknown variance 5 # . The general setup of the test as dis-
cussed in an introductory statistics course is as follows. Let ]" ß âß ]8 denote a random
sample from a Ka.ß 5 # b distribution. To test L! : . .! against the alternative L" : . Á .! ,
where = is the sample standard deviation, against the !Î# (right-tailed) cutoff of a >8" distri-
bution. If >9,= >!Î#ß8" , reject L! at the ! significance level. We note in passing that all
cutoff values in this text are understood as cutoffs for right-tailed probabilities, e.g.,
Pra>8 >!ß8 b !.
First notice that a two-sided >-test is equivalent to a one-sided J -test where the critical
value is obtained as the ! cutoff from an J distribution with one numerator and 8 " deno-
minator degrees of freedom and the test statistic is the square of >9,= . An equivalent test of
L! :. .! against L" : . Á .! thus rejects L! at the ! "!!% significance level if
J9,= >#9,= J!ß"ß8" >#!Î#ß8" .
The statistical models reflecting the null and alternative hypothesis are
L! true: ó: ]3 .! /3 , /3 µ K !ß 5 #
L" true: ô: ]3 . /3 , /3 µ K !ß 5 #
Model ô is the full model because . is not specified under the alternative and by imposing
the constraint . .! , model ó is obtained from model ô. The two models are thus nested
and we can compare how well they fit the data by calculating their respective residual sum of
squares WWV !83" ÐC3 E sc]3 dÑ# . Here, E
sc]3 d denotes the mean of ]3 evaluated at the least
squares estimates. Under the null hypothesis there are no parameters since .! is a known
constant, so that WWVó !83" aC3 .! b# . Under the alternative the least squares estimate
of the unknown mean . is the sample mean C. The residual sum of squares thus takes on the
familiar form WWVô !83" aC3 C b# . Some simple manipulations yield
The residual mean square in the full model, Q WV0 , is the sample variance
=# a8 "b" !83" aC3 C b# and the sum of squares reduction test statistic becomes
8aC .! b#
J9,= >#9,= .
=#
The critical value for an ! "!!% level test is J!ß"ß8" and the sum of squares reduction test
is thereby shown to be equivalent to the standard one-sample >-test.
A likelihood-ratio test comparing models ó and ô can also be developed. The
probability density function of a Ka.ß 5 # b random variable is
8 8 " 8
6.ß 5 # à C" ß âß C8 lne#1f ln5 # # "aC3 .b# . [1.8]
# # #5 3"
In the full model where both . and 5 # are unknown, the respective MLEs are the solutions to
8 #
these equations, namely . s C and 5 s # 8" !38 aC3 C b . Notice that the MLE of the error
#
variance is not the sample variance = . It is a biased estimator related to the sample variance
by 5s # =# a8 "bÎ8. In the reduced model the mean is fixed at .! and only 5 # is a
8 #
parameter of the model. The MLE in model ó becomes 5 s #! 8" !3" aC3 .! b . The
#
likelihood ratio test statistic is obtained by evaluating the log-likelihoods at . s, 5
s in model ô
s #! in model ó. Perhaps surprisingly, A reduces to
and at .! , 5
#
s #!
5 s # aC .! b
5
A 8ln 8 ln .
s#
5 s#
5
#
The second expression uses the fact that 5 s #! 5
s # aC .! b . If the sample mean is far from
the hypothesized value, the variance estimate in the reduced model will be considerably larger
than that in the full model. That is the case if . is far removed from .! , because C is an
unbiased estimators of the true mean .. Consequently, we reject L! : . .! for large values
of A. Based on the fact that A has an approximate ;#" distribution, the decision rule can be
formulated to reject L! if A ;!# ß" . However, we may be able to determine a function of the
data in which A increases monotonically. If this function has a known rather than an approxi-
s #! Î5
mate distribution, an exact test is possible. It is sufficient to concentrate on 5 s # to this end
# #
since A is increasing in 5
s ! Î5
s . Writing
aC .! b#
s #! Î5
5 s# "
s#
5
s # =# a8 "bÎ8, we obtain
and using the fact that 5
aC .! b# " 8aC .! b# "
s #! Î5
5 s# " #
" #
" J9,= .
5
s 8" = 8"
Instead of rejecting for large values of A we can also reject for large values of J9,= . Since the
distribution of J9,= under the null hypothesis is J"ßa8"b , an exact test is possible, and this test
is the same as the sum of squares reduction test.
Pooled X -Test
In the two-sample case the hypothesis that two populations have the same mean, L! : ." .# ,
can be tested with the pooled >-test under the following assumptions. ]"4 , 4 "ß âß 8" are a
random sample from a Ka." ß 5 # b distribution and ]#4 , 4 "ß âß 8# are a random sample
from a Ka.# ß 5 # b distribution, drawn independently of the first sample. The common variance
5 # is unknown and can be estimated as the pooled sample variance from which the procedure
derives its name:
a8" "b=#" a8# "b=##
=#: .
8" 8# #
The procedure for testing L! : ." .# against L" : ." Á .# is to compare the value of the test
statistic
lC " C # l
>9,=
" "
Ê=#: 8" 8#
The distributional assumptions for the errors reflect the independence of samples from a
group, among the groups, and the equal variance assumption.
For the two possible values of D34 , the model can be expressed as
." /34 3 " (= group 1)
]34
." " /34 3 # (= group 2).
The parameter " measures the difference between the means in the two groups,
Ec]"4 d Ec]#4 d ." .# " . The hypothesis L! : ." .# ! is the same as L! : " !.
The reduced and full models to be compared are
L! true: ó: ]34 ." /34 , /34 µ K !ß 5 #
L" true: ô: ]34 ." D34 " /34 , /34 µ K !ß 5 # .
It is a nice exercise to derive the least squares estimators of ." and " in the full and reduced
models and to calculate the residual sums of squares from it. Briefly, for the full model, one
obtains . s" C " , "s C # C " , and WWVô a8" "b=# a8# "b=## , Q WVô
"
#
=: a"Î8" "Î8# b. In the reduced model the least squares estimate of ." becomes . s"
a8" C" 8# C# bÎa8" 8# b and
8" 8#
WWVó WWVô aC C # b# .
8" 8# "
The test statistic for the sum of squares reduction test,
WWVó WWVô Î" aC " C # b #
J9,=
Q WVô =#: 8"" "
8#
The idea of the sum of squares reduction test is intuitive. Impose a restriction on a model and
determine whether the resulting increase in an uncertainty measure is statistically significant.
If the change in the residual sum of squares is significant, we conclude that the restriction
does not hold. One could call the procedure a sum of squares increment test, but we prefer to
view it in terms of the reduction that is observed when the restriction is lifted from the
reduced model. The restriction is the null hypothesis, and its rejection leads to the rejection of
the reduced model. It is advantageous to formulate statistical models so that hypotheses of
interest can be tested through comparisons of nested models. We then say that the hypotheses
of interest can be embedded in the model.
As an example, consider the comparison of two simple linear regression lines, one for
each of two groups (control and treated group, for example). The possible scenarios are (i) the
same trend in both groups, (ii) different intercepts but the same slopes, (iii) same intercepts
but different slopes, and (iv) different intercepts and different slopes (Figure 1.2). Which of
the four scenarios best describes the mechanism that generated the observed data can be de-
termined by specifying a full model representing case (iv) in which the other three scenarios
are nested. We choose case (iv) as the full model because it has the most unknowns. Let ]34
denote the 4th observation from group 3 a3 "ß #b and define a dummy variable
" if observation from group "
D34
! if observation from group #.
where B34 is the value of the continuous regressor for observation 4 from group 3. To see how
the dummy variable D34 creates two separate trends in the groups, consider
Ec]34 lD34 "d "! "" a"# "$ bB34
Ec]34 lB34 !d "! "# B34 .
The intercepts are a"! "" b in group " and "! in group #. Similarly, the slopes are a"# "$ b
in group " and "# in group #. The restrictions (null hypotheses) that reduce the full model to
the other three cases are
(i): L! À "" "$ ! (ii): L! À "$ ! (iii): L! À "" !.
Notice that the term B34 D34 has the form of an interaction between the regressor and the
dummy variable that identifies group membership. If "$ !, the lines are parallel. This is the
very meaning of the absence of an interaction: the comparison of groups no longer depends
on the value of B.
The hypotheses can be tested by fitting the full and the three reduced models and per-
forming the sum of squares reduction tests. For linear models, the results of reduction tests in-
volving only one regressor or effect are given by standard regression packages, and a good
package is capable of testing more complicated constraints such as (i) based on a fit of the full
model only (§4.2.3).
E[Yij] E[Yij]
Group 1
10 Group 1 10 (ii)
(i)
Group 2
Group 2
5 5
0 x 0 x
0 3 6 9 0 3 6 9
10 (iii) 10 (iv)
Group 2
Group 1
5 5
0 x 0 x
0 3 6 9 0 3 6 9
Many statistical models can be expressed in alternative ways and this can change the
formulation of the hypothesis. Consider a completely randomized experiment with < replica-
tions of > treatments. The linear statistical model for this experiment can be written in at least
two ways, known as the means and effects models (§4.3.1)
The treatment effects 73 are simply .3 . and . is the average of the treatment means .3 .
Under the hypothesis of equal treatment means, L! :." .# â .> , the means model re-
duces to ]34 . /34 , where . is the unknown mean common to all treatments. The equiva-
lent hypothesis in the effects model is L! : 7" 7# â 7> . Since !>3" 73 ! by construc-
tion, one can also state the hypothesis as L! : all 73 !. Notice that the two-sample >-test
problem in §1.4 is a special case of this problem with > #ß < 8" 8# .
In particular for nonlinear models, it may not be obvious how to embed a hypothesis in a
model. This is the case when the model is not expressed in terms of the quantities of interest.
Recall the Mitscherlich yield equation
Ec] d - Ð0 -Ñexpe ,Bf, [1.9]
where - is the upper yield asymptote, 0 is the yield at B !, and , governs the rate of
change. Imagine that B is the amount of a nutrient applied and we are interested in estimating
and testing hypotheses about the amount of the nutrient already in the soil. Call this parameter
!. Black (1993, p. 273) terms " ! the availability index of the nutrient in the soil. It
turns out that ! is related to the three parameters in [1.9],
! lnea- 0bÎ-fÎ,.
Once estimates of - , 0, and , have been obtained, this quantity can be estimated by plugging
in the estimates. The standard error of this estimate of ! will be very difficult to obtain owing
to the nonlinearity of the relationship. Furthermore, to test the restriction that ! #!, for
example, requires fitting a reduced model in which lnea- 0bÎ-fÎ, #!. This is not a
well-defined problem.
To enable estimation and testing of hypotheses in nonlinear models, the model should be
rewritten to contain the quantities of interest. This process, termed reparameterization
(§5.7), yields for the Mitscherlich equation
Ec] d -a" expe ,aB !bfb. [1.10]
The parameter 0 in model [1.9] was replaced by -a" expe,!fb and after collecting terms,
one arrives at [1.10]. The sum of squares decomposition, residuals, and fit statistics are
identical when the two models are fit to data. Testing the hypothesis ! #! is now
straightforward. Obtain the estimate of ! and its estimated standard error from a statistical
package (see §§5.4, 5.8.1) and calculate a confidence interval for !. If it does not contain the
value #!, reject L! :! #!. Alternatively, fit the reduced model Ec] d
-a" expe ,aB #!bfb and perform a sum of squares reduction test.
A distinction is made in statistical theory between hypothesis and significance testing. The
former relies on comparing the observed value of a test statistic with a critical value and to
reject the null hypothesis if the observed value is more extreme than the critical value. Most
statistical computing packages apply the significance testing approach because it does not
involve critical values. In order to derive a critical value, one must decide first on the Type-I
error probability ! to reject a null hypothesis that is true. The significance approach relies on
calculating the :-value of a test statistic, the probability to obtain a value of the test statistic at
least as extreme as the observed one, provided that the null hypothesis is true. The connection
between the two approaches lies in the Type-I error rate !. If one rejects the null hypothesis
when the :-value is less than !, and fails to reject otherwise, significance and hypothesis
testing lead to the same decisions. Statistical tests done by hand are almost always performed
as hypothesis tests, and the results of tests carried out with computers are usually reported as
:-values. We will not make a formal distinction between the two approaches here and note
that :-values are more informative than decisions based on critical values. To attach *, **,
***, or some notation to the results of tests that are significant at the ! !Þ!&, !Þ!", and
!Þ!!" level is commonplace but arbitrary. When the :-value is reported each reader can draw
his/her own conclusion about the fate of the null hypothesis.
Even if results are reported with notations such as *, **, *** or by attaching lettering to
an ordered list of treatment means, these displays are often obtained by converting :-values
from statistical output. The ubiquitous :-values are probably the most misunderstood and
misinterpreted quantities in applied statistical work. To draw correct conclusions from output
of statistical packages it is imperative to interpret them properly. Common misconceptions are
that (i) the :-value measures an error probability for the rejection of the hypothesis, (ii) the :-
value measures the probability that the null hypothesis is true, (iii) small :-values imply that
the alternative hypothesis is correct. To rectify these misconceptions we briefly discuss the
rationale of hypothesis testing from a probably unfamiliar angle, Monte-Carlo testing, and
demonstrate the calculation of :-values with a spatial point pattern example.
The frequentist approach to measuring model-data agreement is based on the notion of
comparing a model against different data sets. Assume a particular model holds (is true) for
the time being. In the test of two nested models we assume that the restriction imposed on the
full model holds and we accept the reduced model unless we can find evidence to the
contrary. That is, we are working under the assumption that the null hypothesis holds until it
is rejected. Because there is uncertainty in the outcome of the experiment we do not expect
the observed data and the postulated model to agree perfectly. But if chance is the only expla-
nation for the disagreement between data and model, there is no reason to reject the model
from a statistical point of view. It may fit the data poorly because of large variability in the
data, but it remains correct on average.
The problem, of course, is that we observe only one experimental outcome and do not see
other sets of data that have been generated by the model under investigation. If that were the
case we could devise the following test procedure. Calculate a test statistic from the observed
data. Generate all possible data sets consistent with the null hypothesis if this number is finite
or generate a sufficiently large number of data sets if there are infinitely many experimental
outcomes. Denote the number of data sets so generated by 5 . Calculate the test statistic in
each of the 5 realizations. Since we assume the null hypothesis to be true, the value of the test
statistic calculated from the observed data is added to the test statistics calculated from the
generated data sets and the 5 " values are ranked. If the data were generated by a
mechanism that does not agree with the model under investigation, the observed value of the
test statistic should be unusual, and its rank should be extreme. At this point we need to
invoke a decision rule according to which values of the test statistic are deemed sufficiently
rare or unusual to reject L! . If the observed value is among those values considered rare
enough to reject L! , this is the decision that follows. The critical rank is a measure of the
acceptability of the model (McPherson 1990). The decision rule cannot alone be the attained
rank of a test statistic, for example, "reject L! if the observed test statistic ranks fifth." If 5 is
large the probability of a particular value can be very small. As 5 tends to infinity, the
probability to observe a particular rank tends to zero. Instead we define cases deemed
inconsistent with L! by a range of ranks. Outcomes at least as extreme as the critical rank
lead to the rejection of L! . This approach of testing hypotheses is known under several
names. In the design of experiment it is termed the randomization approach. If the number
of possible data sets under L! is finite it is also known as permutation testing. If a random
sample of the possible data sets is drawn it is referred to as Monte-Carlo testing (see, e.g.
Kempthorne 1952, 1955; Kempthorne and Doerfler 1969; Rubinstein 1981; Diggle 1983,
Hinkelmann and Kempthorne 1994, our §A4.8.2 and §9.7.3)
The reader is most likely familiar with procedures that calculate an observed value of a
test statistic and then (i) compare the value against a cutoff from a tabulated probability distri-
bution or (ii) calculate the :-value of the test statistic. The procedure based on generating data
sets under the null hypothesis as outlined above is no different from this classical approach.
The distribution table from which cutoffs are obtained or the distribution from which :-values
are calculated reflect the probability distribution of the test statistic if the null hypothesis is
true. The list of 5 test statistic obtained from data sets generated under the null hypothesis
also reflects the distribution of the test statistic under L! . The critical value (cutoff) in a test
corresponds to the critical rank in the permutation/Monte-Carlo/randomization procedure.
To illustrate the calculation of :-values and the Monte-Carlo approach, we consider the
data shown in Figure 1.3, which might represent the locations on a field at which "&! weeds
emerged. Chapter 9.7 provides an in-depth discussion of spatial point patterns and their
analysis. We wish to test whether a process that places weeds completely at random
(uniformly and independently) in the field could give rise to the distribution shown in Figure
1.3 or whether the data generating process is clustered. In a clustered process, events exhibit
more grouping than in a spatially completely random process. If we calculate the average
distance between an event and its nearest neighbor, we expect this distance to be smaller in a
clustered process. For the observed pattern (Figure 1.3) the average nearest neighbor distance
is C !Þ!$*'%.
1.0
0.8
Y-Coordinate
0.6
0.4
0.2
0.0
Observed average
nearest neighbor
200 distance
n=27 n=173
150
Density
100
50
0
0.035 0.036 0.037 0.038 0.039 0.040 0.041 0.042 0.043 0.044 0.044 0.045 0.046
Two hundred point patterns were then generated from a process that is spatially
completely neutral. Events are placed uniformly and independently of each other. This
process represents the stochastic model consistent with the null hypothesis. Each of the
5 #!! simulated processes has the same boundary box as the observed pattern and the same
number of points a"&!b. Figure 1.4 shows the histogram of the 5 " #!" average nearest-
neighbor distances. Note that the observed distance aC !Þ!$*'%b is part of the histogram.
The line in the figure is a nonparametric estimate of the probability density of the average
nearest-neighbor statistics. This density is only an estimate of the true distribution function of
the test statistic since infinitely many realizations are possible under the null hypothesis.
The observed statistic is the #)th smallest among the #!" values. Under the alternative
hypothesis of a clustered process, the average nearest-neighbor distance should be smaller
than under the null hypothesis. Plants that appear in clusters are closer to each other on
average than plants distributed completely at random. Small average nearest-neighbor dis-
tances are thus extreme under the null hypothesis. For a &% significance test the critical rank
would be "!Þ!&. If the observed value ranks "!th or lower, L! is rejected. This is not the case
and we fail to reject the hypothesis that the plants are distributed completely at random. There
is insufficient evidence at the &% significance level to conclude that a clustered process gives
rise to the spatial distribution in Figure 1.3. The :-value is calculated as the proportion of
values at least as extreme as the observed value, hence : #)Î#!" !Þ"$*. Had we tested
whether the observed point distribution is more regular than a completely random pattern,
large average nearest neighbor distances would be consistent with the alternative hypothesis
and the :-value would be "(%Î#!" !Þ)''.
Because the null hypothesis is true when the 5 #!! patterns are generated and for the
purpose of the statistical decision, the observed value is judged as if the null hypothesis is
true; the :-value is not a measure for the probability that L! is wrong. The :-value is a
conditional probability under the assumption that L! is correct. As this probability becomes
smaller and smaller, we will eventually distrust the condition.
In a test that is not based on Monte-Carlo arguments, the estimated density in Figure 1.4
is replaced by the exact probability density function of the test statistic. This distribution can
be obtained by complete enumeration in randomization or permutation tests, by deriving the
distribution of a test statistic from first principles, or by approximation. In the >-test examples
of §1.4, the distribution of J9,= under L! is known to be that of an J random variable and
>9,= is known to be distributed as a > random variable. The likelihood ratio test statistic [1.7] is
approximately distributed as a Chi-squared random variable.
In practice, rejection of the null hypothesis is tantamount to the acceptance of the alterna-
tive hypothesis. Implicit in this step is the assumption that if an outcome is extreme under the
null hypothesis, it is not so under the alternative. Consider the case of a simple null and
alternative hypotheses, e.g., L! :. .! , L" :. ." . If ." and .! are close, an experimental
outcome extreme under L! is also extreme under L" . Rejection of L! should then not prompt
the acceptance of L" . But this test most likely will have a large probability of a Type-II error,
to fail to reject an incorrect null hypothesis (low power). These situations can be avoided by
controlling the Type-II error of the test, which ultimately implies collecting samples of
sufficient size (sufficient to achieve a desired power for a stipulated difference ." .! ).
Finally, we remind the reader of the difference between statistical and practical significance.
Just because statistical significance is attached to a result does not imply that it is a meaning-
ful result from a practical point of view. If it takes 8 &ß !!! samples to detect a significant
difference between two treatments, their actual difference is probably so small that hardly
anyone will be interested in knowing it.
where B!3 ß âß B53 are measured variables and )! ß )" ß âß ): are parameters. The response is
typically denoted ] , and an appropriate number of subscripts must be added to associate a
single response with the parts of the model structure. If a single subscript is sufficient the
basic component equation of a statistical model becomes
]3 0 aB!3 ß B"3 ß B#3 ß ÞÞÞß B53 ß )! ß )" ß ÞÞÞß ): b /3 . [1.11]
The specification of the component equation is not complete without the means,
variances, and covariances of all random variables involved and if possible, their distribution
laws. If these are unknown, they add additional parameters to the model. The assumption that
the user's model is correct is reflected in the zero mean assumption of the errors aEc/3 d !b.
Since then Ec]3 d 0 aB!3 ß B"3 ß B#3 ß ÞÞÞß B53 ß )! ß )" ß ÞÞÞß ): b, the function 0 ab is often called the
mean function of the model.
The process of fitting the model to the observed responses involves estimation of the un-
known quantities in the systematic part and the parameters of the error distribution. Once the
parameters are estimated the fitted values can be calculated:
s 3 0 B!3 ß B"3 ß B#3 ß ÞÞÞß B53 ß s)! ß s)" ß ÞÞÞß s): s0 3
]
A caret placed over a symbol denotes an estimated quantity. Fitted values are calculated for
the observed values of B!3 ß âß B53 . Values calculated for any combination of the B variables,
whether part of the data set or not, are termed predicted values. It is usually assumed that the
fitted residual s/3 is an estimate of the unobservable model error /3 which justifies model
diagnostics based on residuals. But unless the fitted values estimate the systematic part of the
model without bias and the model is correct aEc/3 d !b, the fitted residuals will not even
have a zero mean. And the fitted values s0 3 may be biased estimators of 03 , even if the model
is correctly specified. This is common when 0 is a nonlinear function of the parameters.
The measured variables contributing to the systematic part of the model are termed here
covariates. In regression applications, they are also referred to as regressors or independent
variables, while in analysis of variance models the term covariate is sometimes reserved for
those variables which are measured on a continuous scale. The term independent variable
should be avoided, since it is not clear what the variable is independent of. The label is popu-
lar though, since the response is often referred to as the dependent variable. In many
regression models the covariates are in fact very highly dependent on each other; therefore,
the term independent variable is misleading. Covariates that can take on only two values, ! or
", are also called design variables, dummy variables, or binary variables. They are typical in
analysis of variance models. In observational studies (see §2.3) covariates are also called ex-
planatory variables. We prefer the term covariate to encompass all of the above. The precise
nature of a covariate will be clear from context.
The distinction between linear and nonlinear models is often obstructed by references to
graphs of the predicted values. If a graph of the predicted values appears to have curvature,
the underlying statistical model may still be linear. The polynomial
]3 "! "" B3 "# B#3 /3
is a linear model, but when sC is graphed vs. B, the predicted values exhibit curvature. The
acid test for linearity is as follows: if the derivatives of the model's systematic part with
respect to the parameters do not depend on any of the parameters, the model is linear. Other-
wise, the model is nonlinear. For example,
]3 "! "" B3 /3
has a linear mean function, since
` a"! "" B3 bÎ` "! "
` a"! "" B3 bÎ` "" B3
and neither of the derivatives depends on any parameters. The quadratic polynomial
]3 "! "" B3 "# B#3 /3 is also a linear model, since
` "! "" B3 "# B#3 Î` "! "
` "! "" B3 "# B#3 Î` "" B3
` "! "" B3 "# B#3 Î` "# B#3 .
Linear models with curved mean function are termed curvilinear. The model
]3 "! " /"" B3 /3 ,
If a model is nonlinear in at least one parameter, the entire model is considered nonlinear.
Linearity refers to linearity in the parameters, not the covariates. Transformations of the
covariates such as /B , lnaBb, "ÎB, ÈB do not change the linearity of the model, although they
will affect the degree of curvature seen in a plot of C against B. Polynomial models which
raise a covariate to successively increasing powers are always linear models.
where .3 is the mean yield if the 3th level of the growth regulator is applied. The double sub-
script is used to emphasize that multiple observations can share the same growth regulator
level (replications). This model can be expanded using a series of dummy covariates,
D"4 ß âß D%4 , say. Let D34 take on the value " if the 4th observation received the 3th level of the
growth regulator, and ! otherwise. The expanded ANOVA model then becomes
]34 .3 /34
." D"4 .# D#4 .$ D$4 .% D%4 /34 .
In this form, the ANOVA model is a multiple regression model with four covariates and no
intercept.
The role of the dummy covariates in ANOVA models is to select the parameters (effects)
associated with a particular response. The relationship between plot yield and growth regula-
tion is described more parsimoniously in the regression model, which contains only two para-
meters, the intercept )! and the slope )" . The ANOVA model allots four parameters
a." ß âß .% b to describe the systematic part of the model. The downside of the regression
model is that if the relationship between C and B is not linear, the model will not apply and
inferences based on the model may be incorrect.
ANOVA and linear regression models can be cast in the same framework; they are both
linear statistical models. Classification models may contain continuous covariates in addition
to design variables, and regression models may contain binary covariates in addition to
continuous ones. An example of the first type of model arises when adjustments are made for
known systematic differences in initial conditions among experimental units to which the
treatments are applied. Assume, for example, that the soils of the plots on which growth
regulators were applied had different lime requirements. Let ?34 be the lime requirement of
the 4th plot receiving the 3th rate of the regulator; then the systematic effect of lime require-
ment on plot yield can be accounted for by incorporating ?34 as a continuous covariate in the
classification model:
]34 .3 )?34 /34 [1.12]
The presence of these interactions is easily tested with a sum of squares reduction test since
the previous two models are nested aL! : )" )# )$ )% b.
The same problem can be approached from a regression standpoint. Consider the initial
model, ]34 )! )?34 /34 , linking plot yield to lime requirement. Because a distinct
number of rates were applied on the plot, the simple linear regression can be extended to
accommodate separate intercepts for the growth regulators. Replace the common intercept )!
by
.3 ." D"4 .# D#4 .$ D$4 .% D%4
and model [1.12] results. Whether a regression model is enlarged to accommodate a classifi-
cation variable or a classification model is enlarged to accommodate a continuous covariate,
the same models result.
Most experiments produce more than just a single response. Statistical models that model one
response independently of other experimental outcomes are called univariate models, where-
as multivariate models simultaneously model several response variables. Models with more
than one covariate are sometimes incorrectly termed multivariate models. The multiple linear
regression model ]3 "! "" B"3 "# B#3 /3 is a univariate model.
The advantage of multivariate over univariate models is that multivariate models incorpo-
rate the relationships between experimental outcomes into the analysis. This is particularly
meaningful if the multivariate responses are observations of the same attribute at different
locations or time points. When data are collected as longitudinal, repeated measures, or
spatial data, the temporal and spatial dependencies among the observations must be taken into
account (§2.5). In a repeated measures study, for example, this requires modeling the obser-
vations jointly, rather than through separate analyses by time points. By separately analyzing
the outcomes by year in a multi-year study, little insight is gained into the time-dependency of
the system.
Multivariate responses in this text are confined to the special case where the same
response variable is measured repeatedly, that is, longitudinal, repeated measures, and spatial
data. Developing statistical models for such data (§§ 7, 8, 9) requires a good understanding of
the notion and consequences of clustering in data (discussed in §2.4 and §7.1).
• The distinction of fixed and random effects applies to the unknown model
components:
— a fixed effect is an unknown constant (does not vary),
— a random effect is a random variable.
• Fixed effects model: All effects are fixed (apart from the error)
• Random effects model: All effects are random (apart from intercept)
• Mixed effects model: Some effects are fixed, others are random (not count-
ing an intercept and the model error)
The distinction between fixed, random, and mixed effects models is not related to the nature
of the covariates, but the unknown quantities of the statistical model. In this text we assume
that covariate values are not associated with error. A fixed effects model contains only
constants in its systematic part and one random variable (the error term). The variance of the
error term measures residual variability. Most traditional regression models are of this type.
In designed experiments, fixed effects models arise when the levels of the treatments are
chosen deliberately by the researcher as the only levels of interest. A fixed effects model for a
randomized complete block design with one treatment factor implies that the blocks are pre-
determined as well as the factor levels.
A random effects model consists of random variables only, apart from a possible grand
mean. These arise when multiple random processes are in operation. Consider sampling two
hundred bags of seeds from a large seed lot. Fifty laboratories are randomly selected to
receive four bags each for analysis of germination percentage and seed purity from a list of
laboratories. Upon repetition of this experiment a different set of laboratories would be
selected to receive different bags of seeds. Two random processes are at work. One source of
variability is due to selecting laboratories at random from the population of all possible
laboratories that could have performed the analysis. A second source of variability stems
from randomly determining which particular four bags of seeds are sent to a laboratory. This
variability is a measure for the heterogeneity within the seed lot, and the first source repre-
sents variability among laboratories. If the two random processes are independent, the
variance of a single germination test result is the sum of two variance components,
Varc]34 d 56# 5,# .
Here, 56# measures lab-to-lab variability and 5,# variability in test results within a lab (seed lot
heterogeneity). A statistical model for this experiment is
]34 . !3 /34
3 "ß âß &!à 4 "ß âß %,
where !3 is a random variable with mean ! and variance 56# , /34 is a random variable (inde-
pendent of the !3 ) with mean ! and variance 5,# , and ]34 is the germination percentage repor-
ted by the 3th lab for the 4th bag. The grand mean is expressed by ., the true germination per-
centage of the lot. A fixed grand mean should always be included in random effects models
unless the response has zero average.
Mixed effects models arise when some of the model components are fixed, while others
are random. A mixed model contains at least two random variables (counting the model errors
/) and two unknown constants in the systematic part (counting the grand mean). Mixed
models can be found in multifactor experiments where levels of some factors are predeter-
mined while levels of other factors are chosen at random. If two levels of water stress (irrigat-
ed, not irrigated) are combined with six genotypes selected from a list of $! possible geno-
types, a two-factor mixed model results:
]345 . !3 #4 a!# b34 /345 .
Here, ]345 is the response of genotype 4 under water stress level 3 in replicate 5 . The !3 's
denote the fixed effects of water stress, the #4 's the random effects of genotype with mean !
and variance 5## . Interaction terms such as a!# b34 are random effects, if at least one of the
factors involved in the interaction is a random factor. Here, a!# b34 is a random effect with
#
mean ! and variance 5!# . The /345 finally denote the experimental errors. Mixed model struc-
tures also result when treatments are allocated to experimental units by separate randomiza-
tions. A split-plot design randomly allocates levels of the whole-plot factor to large experi-
mental units (whole-plots) and independently thereof randomly allocates levels of one or
more other factors within the whole-plots. The two randomizations generate two types of
experimental errors, one associated with the whole-plots, one associated with the sub-plots. If
the levels of the whole- and sub-plot treatment factor are selected at random, the resulting
model is a random model. If the levels of at least one of the factors were predetermined, a
mixed model results.
In observational studies (see §2.3), mixed effects models have gained considerable popu-
larity for longitudinal data structures (Jones 1993; Longford 1993; Diggle, Liang, and Zeger
1994; Vonesh and Chinchilli 1997; Gregoire et al. 1997, Verbeke and Molenberghs 1997;
Littell et al. 1996). Longitudinal data are measurements taken repeatedly on observational
units without the creation of experimental conditions by the experimenter. These units are
often termed subjects or clusters. In the absence of randomization, mixed effects arise
because some "parameters" of the model (slopes, intercept) are assumed to vary at random
from subject to subject while other parameters remain constant across subjects. Chapter 7
discusses mixed models for longitudinal and repeated measures data in great detail. The
distinction between fixed and random effects and its bearing on data analysis and data
interpretation are discussed there in more detail.
Generalized linear models (GLMs) are statistical models for a large family of probability
distributions known as the exponential family (§6.2.1). This family includes such important
distributions as the Gaussian, Gamma, Chi-squared, Beta, Bernoulli, Binomial, and Poisson
distributions. We consider generalized linear models among the most important statistical
models today. The frequency (or lack thereof) with which they are applied in problems in the
plant and soil sciences belies their importance. They are based on work by Nelder and
Wedderburn (1972) and Wedderburn (1974), subsequently popularized in the monograph by
McCullagh and Nelder (1989). If responses are continuous, modelers typically resort to linear
or nonlinear statistical models of the kind
]3 0 aB!3 ß B"3 ß B#3 ß ÞÞÞß B53 ß )! ß )" ß ÞÞÞß ): b /3 ,
where it is assumed that the model residuals have zero mean, are independent, and have some
common variance Varc/3 d 5 # . For purposes of parameter estimation, these assumptions are
usually sufficient. For purposes of statistical inference, such as the test of hypotheses or the
calculation of confidence intervals, distributional assumptions about the model residuals are
added. All too often, the errors are assumed to follow a Gaussian distribution. There are many
instances in which the Gaussian assumption is not tenable. For example, if the response is not
a continuous characteristic, but a frequency count, or when the error distribution is clearly
skewed. Generalized linear models allow the modeling of such data when the response distri-
bution is a member of the exponential family (§6.2.1). Since the Gaussian distribution is a
member of the exponential family, linear regression and analysis of variance methods are spe-
cial cases of generalized linear models.
Besides non-Gaussian error distributions, generalized linear models utilize a model com-
ponent known as the link function. This is a transformation which maps the expected values
of the response onto a scale where covariate effects are additive. For a simple linear regres-
sion model with Gaussian error,
]3 "! "" B3 /3 à /3 µ K !ß 5 # ,
the expectation of the response is already linear. The applicable link function is the identity
function. Assume that we are concerned with a binary response, for example, whether a parti-
cular plant disease is present or absent. The mean (expected value) of the response is the
probability 1 that the disease occurs and a model is sought that relates the mean to some
environmental factor B. It would be unreasonable to model this probability as a linear
function of B, 1 "! "" B. There is no guarantee that the predicted values are between !
and ", the only acceptable range for probabilities. A monotone function that maps values
between ! and " onto the real line is the logit function
1
( ln .
"1
Rather then modeling 1 as a linear function of B, it is the transformed value ( that is modeled
as a linear function of B,
1
( ln "! "" B.
"1
Since the logit function links the mean 1 to the covariate, it is called the link function of the
model. For any given value of (, the mean response is calculated by inverting the
relationship,
" "
1 .
" expe (f " expe "! "" Bf
In a generalized linear model, a linear function of covariates is selected in the same way
as in a regression or classification model. Under a distributional assumption for the responses
and after selecting a link function, the unknown parameters can be estimated. There are
important differences between applying a link function and transformations such as the
arcsine, square root, logarithmic transform that are frequently applied in statistical work. The
latter transformations are applied to the individual responses ] in order to achieve greater
variance homogeneity and/or symmetry, usually followed by a standard linear model analysis
on the transformed scale assuming Gaussian errors. In a generalized linear model the link
function transforms the mean response Ec] d and the distributional properties of the response
are not changed. A Binomial random variable is analyzed as a Binomial random variable, a
Poisson random variable as a Poisson random variable.
Because ( is a linear function of the covariates, statistical inference about the model para-
meters is straightforward. Tests for treatment main effects and interactions, for example, are
simple if the outcomes of an experiment with factorial treatment structure are analyzed as a
generalized linear model. They are much more involved if the model is a general nonlinear
model. Chapter 6 provides a thorough discussion of generalized linear models and numerous
applications. We mention in passing that statistical models appropriate for ordinal outcomes
such as visual ratings of plant quality or injury can be derived as extensions of generalized
linear models (see §6.5)
Data Structures
“Modern statisticians are familiar with the notions that any finite body of
data contains only a limited amount of information, on any point under
examination; that this limit is set by the nature of the data themselves, and
cannot be increased by any amount of ingenuity expended in their
statistical examination: that the statistician's task, in fact, is limited to the
extraction of the whole of the available information on any particular
issue.” Fisher, R.A., The Design of Experiments, 4th ed., Edinburgh:
Oliver and Boyd, 1947, p. 39)
2.1 Introduction
2.2 Classification by Response Type
2.3 Classification by Study Type
2.4 Clustered Data
2.4.1 Clustering through Hierarchical Random Processes
2.4.2 Clustering through Repeated Measurements
2.5 Autocorrelated Data
2.5.1 The Autocorrelation Function
2.5.2 Consequences of Ignoring Autocorrelation
2.5.3 Autocorrelation in Designed Experiments
2.6 From Independent to Spatial Data — a Progression of Clustering
In other instances the statistical model must be altered to accommodate a different data
structure. Consider the Mitscherlich or linear-plateau model of §1.2. We tacitly assumed there
that observations corresponding to different levels of R input were independent. If the #" R
levels are randomly assigned to some experimental units, this assumption is reasonable. Even
if there are replications of each R level the models fitted in §1.2 still apply if the replicate
observations for a particular R level are averaged. Now imagine the following alteration of
the experimental protocol. Each experimental unit receives ! 51Î2+ R at the beginning of the
study, and every few days & 51Î2+ are added until eventually all experimental units have
received all of the #" R levels. Whether the statistical model remains valid and if not, how it
needs to be altered, depends on the changes in the data structure that have been incurred by
the protocol alteration.
A data structure comprises three key aspects. The response type (e.g., continuous or
discrete, §2.2), the study type (e.g., designed experiment vs. observational study, §2.3), and
the degree of data clustering (the hierarchical nature of the data, §2.4). Agronomists are
mostly familiar with statistical models for continuous response from designed experiments.
The statistical models underpinning the analyses are directed by the treatment, error control,
and observational design and require comparatively little interaction between user and model.
The temptation to apply the same types of analyses, i.e., the same types of statistical models,
in other situations, is understandable and may explain why analysis of proportions or ordinal
data by analysis of variance methods is common. But if the statistical model reflects a
mechanism that cannot generate data with the same pertinent features as the data at hand, if it
generates a different kind of data, how can inferences based on these models be reliable?
The analytical task is to construct models from the classes in §1.7 that represent appropri-
ate generating mechanisms. Discrete response data, for example, will lead to generalized
linear models, continuous responses with nonlinear mean function will lead to nonlinear
models. Hierarchical structures in the data, for example, from splitting experimental units,
often call for mixed model structures. The powerful array of tools with which many (most?)
data analytic problems in the plant and soil sciences can be tackled is attained by combining
Continuous
Number of possible values not countable
Distance between values well-defined
Discrete
Number of possible values countable
Counts Categorical
Support consists of true counts Support consists of labels, possibly numbers
There is a transition from discrete to continuous variables, which is best illustrated using
proportions. Consider counting the number of plants \ out of a total of 5 plants that die after
application of an herbicide. Since both \ and 5 are integers, the support of ] , the proportion
of dead plants, is discrete:
" # 5"
!ß ß ß âß ß ".
5 5 5
As 5 increases so does the number of elements in the support and provided 5 is sufficiently
large, it can be justified to consider the support infinitely large and no longer countable. The
discrete proportion is then treated for analytic purposes as a continuous variable.
The two fundamental situations in which data are gathered are the designed experiment and
the observational study. Control over experimental conditions and deliberate varying of these
conditions is occasionally cited as the defining feature of a designed experiment (e.g.,
McPherson 1990, Neter et al. 1990). Observational studies then are experiments where condi-
tions are beyond the control of the experimenter and covariates are merely observed. We do
not fully agree with this delineation of designed and observational experiments. Application
of treatments is an insufficient criterion for design and the existence of factors not controlled
by the investigator does not rule out a designed experiment, provided uncontrolled effects can
be properly neutralized via randomization. Unless the principles of experimental design, ran-
domization, replication, and across-unit homogeneity (blocking) are observed, data should
not be considered generated by a designed experiment. This narrow definition is necessary
since designed experiments are understood to lead to cause-and-effect conclusions rather than
associative interpretations. Experiments are usually designed as comparative experiments
where a change in treatment levels is to be shown to be the cause of changes in the response.
Experimental control must be exercised properly, which implies that (i) treatments are
randomly allocated to experimental units to neutralize the effects of uncontrolled factors; (ii)
treatments are replicated to allow the estimation of experimental error variance; and (iii)
experimental units are grouped into homogeneous blocks prior to treatment application to
eliminate controllable factors that are related to the response. The only negotiable of the three
principles is that of blocking. If a variable by which the experimental units should be blocked
is not taken into account, the experimental design will lead to unbiased estimates of treatment
a) b) c)
N
A1 A2 A2 A1 A1 A2
A1 A2 A1 A2 A2 A2
A1 A2 A2 A1 A1 A1
A1 A2 A2 A1 A2 A1
Figure 2.2. Three randomizations of two treatments aE" ß E# b in four replications. (a) and (b)
are complete random assignments, whereas (c) restricts each treatment to appear exactly twice
in the east and twice in the west strip of the experimental field.
• While dependencies and interrelations may exist among the units within a
cluster, it is often reasonable to treat observations from different clusters as
independent.
Clustering of data refers to the hierarchical structure in data. It is an important feature of the
data-generating mechanism and as such it plays a critical role in formulating statistical
models, in particular mixed models and models for spatial data. In general, a cluster repre-
sents a collection of observations that are somehow stochastically related, whereas observa-
tions from different clusters are typically independent (stochastically unrelated). The two pri-
mary situations that lead to clustered data structures are (i) hierarchical random processes and
(ii) repeated measurements (Figure 2.3). Grouping of observations into sequences of repeated
observations collected on the same entity or subject has long been recognized as a clustered
data structure. Clustering through hierarchical random processes such as subsampling or split-
ting of experimental units also gives rise to hierarchical data structures.
Subsampling in time
Splitting in space
Figure 2.3. Frequently encountered situations that give rise to clustered data.
õ ]34 . 73 /34
73 µ 33. !ß 57# à /34 µ 33. !ß 5/# .
In all models the /34 denote experimental errors. In ô the /34 are the whole-plot experimental
errors and the .345 are the sub-plot experimental errors. In ó the .345 are subsampling (obser-
vational) errors. The 73 in õ are the random treatment effects. Regardless of the type of
design, the error terms in all three models are independent by virtue of the random selection
of observational units or the random assignment of treatments. Every model contains two ran-
dom effects where the second effect has one more subscript. The clusters are formed by the
/34 in ó and ô and by the 73 in õ. While the within-cluster units are uncorrelated, it turns
out that the responses ]345 and ]34 are not necessarily independent of each other. For two sub-
samples from the same experimental unit in ó and two observations from the same whole-
plot in ô, we have Covc]345 ß ]345w d 5/# . For two replicates of the same treatment in õ, we
obtain Covc]34 ß ]34w d 57# . This is an important feature of statistical models with multiple,
nested random effects. They induce correlations of the responses (§7.5.1).
Example 2.1. The phosphate load of pastures under rotational grazing is investigated
for two forage species, alfalfa and birdsfoot trefoil. Since each pasture contains a
watering hole where heifers may concentrate, Bray-1 soil P (Bray-P1) is measured at
various distances from the watering hole. The layout of the observational units for one
replicate of the alfalfa treatment is shown in Figure 2.4.
Ray 2
10 ft
{
Ray 1
Watering
hole
Figure 2.4. Alfalfa replicate in soil P study. The replicate serves as the cluster for the
three rays along which soil samples are collected in "!-ft spacing. Each ray is a cluster
of seven soil samples.
The measurements along each ray are ordered along a spatial metric. Similarly, the
distances between measurements on different rays within a replicate are defined by the
Euclidean distance between any two points of soil sampling.
A critical difference between this design and a subsampling or split-plot design is the
systematic arrangement of elements within a cluster. This lack of randomization is particu-
larly apparent in longitudinal or repeated measures structures in time. A common practice is
to analyze repeated measures data as if it arises from a split-plot experiment. The repeated
measurements made on the experimental units of a basic design such as a randomized block
design are assumed to constitute a split of these units. This practice is not appropriate unless
6.5
6.0
Mean pH
5.5
Depth = 0-2"
5.0
Depth = 2-4"
Depth = 4-6"
Depth = 6-8"
No Tillage
Mean across depths
4.5
0 2 4 6 8 10
Figure 2.5. Depth Time sample mean pH for the No-Tillage treatment. Cross-hairs
depict sample mean pH at a given depth.
The correlation 3BC between two random variables measures the strength
(and direction) of the (linear) dependency between \ and ] .
where Covc\ß ] d denotes the covariance between the two random variables. The coefficient
ranges from –" to " and measures the strength of the linear dependency between \ and ] . It
is related to the coefficient of determination aV # b in a linear regression of ] on \ (or \ on
] ) by V # 3BC#
. A positive value of 3BC implies that an above-average value of \ is likely to
be paired with an above-average value of ] . A negative correlation coefficient implies that
above-average values of \ are paired with below-average values of ] . Autocorrelation coef-
ficients are defined in the same fashion, a covariance divided by the square root of a variance
product. Instead of two different variables \ and ] , the covariance and variances pertain to
The variance of the sequence at time >3 is Varc] a>3 bd Ea] a>3 b .a>3 bb# and the auto-
correlation between observations at times >3 and >4 is measured by the correlation
Covc] a>3 bß ] a>4 bd
Corrc] a>3 bß ] a>4 bd . [2.2]
ÈVarc] a>3 bdVarc] a>4 bd
If data measured repeatedly are uncorrelated, they should scatter around the trend over
time in an unpredictable, nonsystematic fashion (open circles in Figure 2.6). Data with posi-
tive autocorrelation show long runs of positive or negative residuals. If an observation is
likely to be below (above) average at some time point, it was likely to be below (above)
average in the immediate past. While negative product-moment correlations between random
variables are common, negative autocorrelation is fairly rare. It would imply that above
(below) average values are likely to be preceded by below (above) average values. This is
usually indication of an incorrectly specified mean function. For example, a circadian rhythm
or seasonal fluctuation was omitted.
1.2
1.0
0.8
0.6
0.4
0.2
Residual
-0.0
-0.2
-0.4
-0.6
-0.8
-1.0
-1.2
0 2 4 6 8 10 12 14 16 18 20
Time
Figure 2.6. Autocorrelated and independent data with the same dispersion, 5 # !Þ$. Open
circles depict independent observations, closed circles observations with positive autocorrela-
tion. Data are shown as deviations from the mean (residuals). A run of negative residuals for
the autocorrelated data occurs between times ' and "%.
The correlations among observations from the same whole-plot are constant. The function
Vab does not depend on which particular two sub-plot observations are considered. This
structure is known as the compound-symmetric or exchangeable correlation structure. The
process of randomization makes the ordering of the sub-plot units exchangeable. If repeated
measures experiments are analyzed as split-plot designs it is implicitly assumed that the auto-
correlation structure of the repeated measure is exchangeable.
With temporal data it is usually more appropriate to assume that correlations (covarian-
ces) decrease with increasing temporal separation. The autocorrelation function V a2b
approaches ! as the time lag 2 increases. When modeling V a2b or G a2b directly, we rely on
models for the covariance function that behave in a way the user deems reasonable. The
We are interested in comparing the means of the two groups, L! : ." .# !. First
assume that the correlations are ignored. Then one would estimate ." and .# by the respec-
tive sample means
" 8 " 8
]" "]" a>3 b ]# "]# a>3 b
8 3" 8 3"
and determine their variances to be VarÒ] " Ó VarÒ] # Ó 5 # Î8. The test statistic (assuming
5 # known) for L! : ." .# ! is
]" ]#
^9,= .
5 È#Î8
C" C#
^9,= ^9,= .
5 É 8# a" a8 "b3b
The incorrect test statistic will be too large, and the :-value of that test will be too small. One
may declare the two groups as significantly different, whereas the correct test may fail to find
significant differences in the group means. The evidence against the null hypothesis has been
overstated by ignoring the correlation.
Another way of approaching this issue is in terms of the effective sample size. One can
ask: “How many samples of the uncorrelated kind provide the same precision as a sample of
correlated observations?” Cressie (1993, p. 15) calls this the equivalent number of indepen-
dent observations. Let 8 denote the number of samples of the correlated kind and 8w the equi-
valent number of independent observations. The effective sample size is calculated as
Var] assuming independence 8
8w 8 .
Var] under autocorrelation " a8 "b3
Ten observations equicorrelated with 3 !Þ$ provide as much information as #Þ( inde-
pendent observations. This seems like a hefty penalty for having correlated data. Recall that
the group mean difference was estimated as the difference of the arithmetic sample means,
] " ] # . Although this difference is an unbiased estimate of the group mean difference
." .# , it is an efficient estimate only in the case of uncorrelated data. If the correlations are
taken into account, more efficient estimators of ." .# are available and the increase in
efficiency works to offset the smaller effective sample size.
When predicting new observations based on a statistical model with correlated errors, it
turns out that the correlations enhance the predictive ability. Geostatistical kriging methods,
for example, utilize the spatial autocorrelation among observations to predict an attribute of
interest at a location where no observation has been collected (§9.4). The autocorrelation
between the unobserved attributed and the observed values allows one to glean information
• Spatial data is a special case of clustered data where the entire data
comprises a single cluster.
— Unclustered data: 5 8ß 83 "
— Longitudinal data: 5 8ß 83 "
— Spatial data: 5 "ß 8" 8.
As far as the correlation in data is concerned this text considers three prominent types of data
structures. Models for independent (uncorrelated) data that do not exhibit clustering are
covered in Chapters 4 to 6, statistical models for clustered data where observations from
different clusters are uncorrelated but correlations within a cluster are possible are considered
in Chapters 7 and 8. Models for spatial data are discussed in §9. Although the statistical tools
can differ greatly from chapter to chapter, there is a natural progression in these three data
structures. Before we can make this progression more precise, a few introductory comments
about spatial data are in order. More detailed coverage of spatial data types and their
underlying stochastic processes is deferred until §9.1.
A data set is termed spatial if, along with the attribute of interest, ] , the spatial locations
of the attributes are recorded. Let s denote the vector of coordinates at which ] was
observed. If we restrict discussion for the time being to observations collected in the plane,
then s is a two-dimensional vector containing the longitude and latitude. In a time series we
Consider a square field with sixteen equally spaced grid points. Soil samples are collected at
the sixteen points and analyzed for soil organic matter (SOM) content (Figure 2.7a). Al-
though the grid points are regularly spaced, the observations are a realization of geostatistical
data, not lattice data. Soil samples could have been collected anywhere in the field. If, how-
ever, a grid point is considered the centroid of a rectangular area represented by the point,
these would be lattice data. Because the data are spatial, every SOM measurement might be
correlated with every other measurement. If a cluster is the collection of observations that are
potentially correlated, the SOM data comprise a single cluster a5 "b of size 8 "'.
Panels (b) and (c) of Figure 2.7 show two experimental design choices to compare four
treatments. A completely randomized design with four replications is arranged in panel (b)
and a completely randomized design with two replications and two subsamples per experi-
C D D B
12 12
Coordinate S2
Coordinate S2
A A B D
8 8
B D C A
4 4
A B C C
0 0
0 4 8 12 16 0 4 8 12 16
Coordinate S1 Coordinate S1
c) k = 8; ni = 2; i = 1,...,k = 8
16
A D
12
Coordinate S2
C B
8
A C
4
B D
0
0 4 8 12 16
Coordinate S1
Figure 2.7. Relationship between clustered and spatial data. Panel a) shows the grid points at
which a spatial data set for soil organic matter is collected. A completely randomized design
(CRD) with four treatments and four replicates is shown in (b), and a CRD with two repli-
cates of four treatments and two subsamples per experimental unit is shown in (c).
Statistical models that can accommodate all three levels of clustering are particularly
appealing (to us). The mixed models of §7 and §8 have this property. They reduce to standard
models for uncorrelated data when each observation represents a cluster by itself and to
(certain) models for spatial data when the entire data is considered a single cluster. It is
reasonable to assume that the SOM data in Figure 2.7a are correlated while the CRD data in
Figure 2.7b are independent, because they appeal to different data generating mechanisms;
randomization in Figure 2.7b and a stochastic process (a random field) in Figure 2.7a. More
and more agronomists are not just faced with a single data structure in a given experiment.
Some responses may be continuous, others discrete. Some data have a design context, other a
spatial context. Some data are longitudinal, some are cross-sectional. The organization of the
remainder of this book by response type and level of clustering is shown in Figure 2.8.
Response
§4 §7 §9
linear
Gaussian assumption
Continuous
reasonable
Response
Type of Response
nonlinear
§5 §8 §9
continuous but
non-Gaussian
§6 §8 §9
Discrete,
Figure 2.8. Organization of chapters by response type and level of clustering in data.
3.1 Introduction
3.2 Matrices and Vectors
3.3 Basic Matrix Operations
3.4 Matrix Inversion — Regular and Generalized Inverse
3.5 Mean, Variance, and Covariance of Random Vectors
3.6 The Trace and Expectation of Quadratic Forms
3.7 The Multivariate Gaussian Distribution
3.8 Matrix and Vector Differentiation
3.9 Using Matrix Algebra to Specify Models
3.9.1 Linear Models
3.9.2 Nonlinear Models
3.9.3 Variance-Covariance Matrices and Clustering
3.1 Introduction
The discussion of the various statistical models in the following chapters requires linear
algebra tools. The reader familiar with the basic concepts and operations such as matrix addi-
tion, subtraction, multiplication, transposition, inversion and the expectation and variance of a
random vector may skip this chapter. For the reader unfamiliar with these concepts or in need
of a refresher, this chapter provides the necessary background and tools. We have compiled
the most important rules and results needed to follow the notation and mathematical opera-
tions in subsequent chapters. Texts by Graybill (1969), Rao and Mitra (1971), Searle (1982),
Healy (1986), Magnus (1988), and others provide many more details about matrix algebra
useful in statistics. Specific techniques such as Taylor series expansions of vector-valued
functions are introduced as needed.
Without using matrices and vectors, mathematical expressions in statistical model infer-
ence quickly become unwieldy. Consider a two-way a+ , b factorial treatment structure in a
randomized complete block design with < blocks. The experimental error (residual) sum of
squares in terms of scalar quantities is obtained as
+ , <
#
WWV " " "C345 C 3ÞÞ C Þ4Þ C ÞÞ5 C 34Þ C 3Þ5 C Þ45 C ÞÞÞ ,
3" 4" 5"
where C345 denotes an observation for level 3 of factor E, level 4 of factor F in block 5 . C 3ÞÞ is
the sample mean of all observations for level 3 of E and so forth. If only one treatment factor
is involved, the experimental error sum of squares fomula becomes
> <
#
WWV ""C34 C 3Þ C Þ4 C ÞÞ .
3" j"
Using matrices and vectors, the residual sum of squares can be written in either case as
WWV yw aI Hby for a properly defined vector y and matrix H.
Consider the following three linear regression models:
]3 "! "" B3 /3
]3 "" B3 /3
]3 "! "" B3 "# B#3 /3 ,
a simple linear regression, a straight line regression through the origin, and a quadratic poly-
nomial. If the least squares estimates are expressed in terms of scalar quantities, we get for the
simple linear regression model the formulas
8 8
s! C "
" s"B s " "aC3 CbaB3 Bb "aB3 Bb# ,
"
3" 3"
and
s! C "
" s"B "
s#D
8 8 8 8
#
! aD3 D bC3 ! aB3 BbaD3 D b ! aB3 BbC3 ! aD3 D b
s"
"
3" 3" 3" 3"
8 # 8 8
# #
! aB3 BbaD3 D b ! aB3 Bb ! aD3 D b
3" 3" 3"
for the quadratic polynomial aD3 B#3 b. By properly defining matrices X, Y, e, and " , we can
write either of these models as Y X" e and the least squares estimates are
s aXw Xb" Xw y.
"
Matrix algebra allows us to efficiently develop and discuss theory and methods of statistical
models.
In this text matrices are rectangular arrays of real numbers (we do not consider complex num-
bers). The size of a matrix is called its order and written as (row column). A a$ %b
matrix has three rows and four columns. When referring to individual elements of a matrix, a
double subscript denotes the row and column of the matrix in which the element is located.
For example, if A is a a8 5 b matrix, +$& is the element positioned in the third row, fifth
column. We sometimes write A c+34 d to show that the individual elements of A are the +34 .
If it is necessary to explicitly identify the order of A, a subscript is used, for example, Aa85b .
Matrices (and vectors) are distinguished from scalars with bold-face lettering. Matrices
with a single row or column dimension are called vectors. A vector is usually denoted by a
lowercase boldface letter, e.g., z, y. If uppercase lettering such as Z or Y is used for vectors, it
is implied that the elements of the vector are random variables. This can cause a little con-
An a8 "b vector is also called a column vector, a" 5 b vectors are referred to as row
vectors. By convention all vectors in this text are column vectors unless explicitly stated
otherwise. Any matrix can be partitioned into a series of column vectors. Aa85b can be
thought of as the horizontal concatenation of 5 a8 "b column vectors: A ca" ß a# ß ÞÞÞß a5 d.
For example,
w
Ô !" ×
Ö !w# Ù
Ö Ù
A can also be viewed as a vertical concatenation of row vectors: A Ö !w$ Ù, where
Ö wÙ
!%
Õ !w Ø
&
!w" c"ß #Þ$ß !ß "d.
We now define some special matrices encountered frequently throughout the text.
• 18 À the unit vector; an a8 "b vector whose elements are ".
• I8 À the identity matrix of size 8; an a8 8b matrix with ones on the diagonal and
zeros elsewhere,
Ô" ! ! ! !×
Ö! " ! ! !Ù
Ö Ù
I& Ö ! ! " ! ! ÙÞ
Ö Ù
! ! ! " !
Õ! ! ! ! "Ø
If the order of these matrices is obvious from the context, the subscripts are omitted.
• The sum of two matrices is the elementwise sum of its elements. The
difference between two matrices is the elementwise difference between its
elements.
• Two matrices A and B are conformable for addition and subtraction if they
have the same order and for multiplication A*B if the number of columns in
A equals the number of rows in B.
Basic matrix operations are addition, subtraction, multiplication, transposition, and inversion.
Matrix inversion is discussed in §3.4, since it requires some additional comments. The
transpose of a matrix A is obtained by exchanging its rows and columns. The first row
becomes the first column, the second row the second column, and so forth. It is denoted by
attaching a single quote (w ) to the matrix symbol. Symbolically, Aw c+43 d. The a& %b matrix
A in [3.1] has transpose
A and B are both of order a$ $b, while C and D are of order a$ #b. The sum
Ô" $ ! * ×Ô # *×
CD #! $" # # .
Õ # a #b " $ ØÕ % #Ø
The operations A Cß A C, B D, for example, are not possible since the matrices do not
conform. Addition and subtraction are commutative, i.e., A B B A and can be com-
bined with transposition,
aA Bbw Aw Bw . [3.3]
Multiplication of two matrices requires a different kind of conformity. The product A*B
is possible if the number of columns in A equals the number of rows in B. For example, A*C
is defined if A has order a$ $b and C has order a$ #b. C*A is not possible, however. The
order of the matrix product equals the number of rows of the first and the number of columns
of the second matrix. The product of an a8 5 b and a a5 :b matrix is an a8 :b matrix.
The product of a row vector aw and a column vector b is termed the inner product of a and b.
Because a a" 5 b matrix is multiplied with a a5 "b matrix, the inner product is a scalar.
Ô ," × 5
Ö, Ù
aw b c+" ß ÞÞÞß +5 dÖ # Ù +" ," +# ,# ÞÞÞ +5 ,5 "+3 ,3 . [3.4]
ã 3"
Õ ,5 Ø
The square root of the inner product of a vector with itself is the (Euclidean) norm of the
vector, denoted llall:
Í
Í 5
Í
llall Èaw a "+3# . [3.5]
Ì 3"
The norm of the difference of two vectors a and b measures their Euclidean distance:
Í
Í 5
Í
lla bll aa bb aa bb "a+3 ,3 b# .
É w
[3.6]
Ì 3"
It plays an important role in statistics for spatial data (§9) where a and b are vectors of spatial
coordinates (longitude and latitude).
Multiplication of the matrices Aa85b and Ba5:b can be expressed as a series of inner
products. Partition Aw as Aw c!" ß !# ß ÞÞÞß !8 d and Ba5:b cb" ß b# ß ÞÞÞß b: dÞ Here, !3 is the 3th
row of A and b3 is the 3th column of B. The elements of the matrix product A*B can be
written as a matrix whose typical elements are the inner products of the rows of A with the
columns of B
Aa85b Ba5:b c!w3 b4 da8:b . [3.7]
Ô" # !× Ô" !×
A $ " $ ,B # $ .
Õ% " #Ø Õ# "Ø
Then matrix can be partitioned into vectors corresponding to the rows of A and the columns
of B as
w
Ô !" b " !w" b# ×
Aa85b Ba5:b c!w3 b4 d !w# b" !w# b#
Õ !w b " !w$ b# Ø
$
Matrix multiplication is not a symmetric operation. AB does not yield the same product
as BA. First, for both products to be defined, A and B must have the same row and column
order (must be square). Even then, the outcome of the multiplication might differ. The
operation AB is called postmultiplication of A by B, BA is called premultiplication of A by
B.
To multiply a matrix by a scalar, simply multiply every element of the matrix by the scalar:
- A c-+34 d. [3.8]
CaA Bb CA CB
aABbC AaBCb
[3.9]
- aA Bb - A - B
aA BbaC Db AC AD BC BD.
Matrices are called square, when their row and column dimensions are identical. If, fur-
thermore, +34 +43 for all 4 Á 3, the matrix is called symmetric. Symmetric matrices are en-
countered frequently in the form of variance-covariance or correlation matrices (see §3.5). A
symmetric matrix can be constructed from any matrix A with the operations Aw A or AAw . If
all off-diagonal cells of a matrix are zero, i.e. +34 ! if 4 Á 3, the matrix is called diagonal
and is obviously symmetric. The identity matrix is a diagonal matrix with "'s on the diagonal.
On occasion, diagonal matrices are written as Diagaab, where a is a vector of diagonal
elements. If, for example, aw c"ß #ß $ß %ß &d, then
Ô" ! ! ! !×
Ö! # ! ! !Ù
Ö Ù
Diagaab Ö ! ! $ ! ! Ù.
Ö Ù
! ! ! % !
Õ! ! ! ! &Ø
• The rank of a matrix equals the number of its linearly independent columns.
• Only square matrices of full rank have an inverse. These are called non-
singular matrices. If the inverse of A exists, it is unique.
Multiplying a scalar with its reciprocal produces the multiplicative identity, - a"Î- b ",
provided - Á !. For matrices an operation such as this simple division is not so straightfor-
ward. The multiplicative identity for matrices should be the identity matrix I. The matrix B
which yields AB I is called the inverse of A, denoted A" . Unfortunately, an inverse does
not exist for all matrices. Matrices which are not square do not have regular inverses. If A"
exists, A is called nonsingular and we have
A" A AA" I.
To see how important it is to have access to the inverse of a matrix, consider the follow-
ing equation which expresses the vector y as a linear function of c,
To solve for c we need to eliminate the matrix X from the right-hand side of the equation. But
X is not square and thus cannot be inverted. However, Xw X is a square, symmetric matrix.
Premultiply both sides of the equation with Xw
Xw y Xw Xc. [3.11]
If the inverse of Xw X exists, premultiply both sides of the equation with it to isolate c:
" "
aXw Xb Xw y aXw Xb Xw Xc Ic c.
For the inverse of a matrix to exist, the matrix must be of full rank. The rank of a matrix
A, denoted <aAb, is equal to the number of its linearly independent columns. For example,
Ô" " !×
Ö" " !Ù
AÖ Ù
" ! "
Õ" ! "Ø
has three columns, a" , a# , and a$ . But because a" a# a$ , the columns of A are not linear-
ly independent. In general, let a" , a# , ..., a5 be a set of column vectors. If a set of scalars
-" ß ÞÞÞß -5 can be found, not all of which are zero, such that
-" a" -# a# ÞÞÞ -5 a5 !, [3Þ12]
the 5 column vectors are said to be linearly dependent. If the only set of constants for which
[3.12] holds is -" -# ÞÞÞ -5 !, the vectors are linearly independentÞ An a8 5 b
matrix X whose rank is less than 5 is called rank-deficient (or singular) and Xw X does not
have a regular inverse. A few important results about the rank of a matrix and the rank of
matrix products follow:
<aAb <aAw b <aAw Ab <aAAw b
<aABb mine<aAbß <aBbf [3.13]
<aA Bb <aAb <aBb
If X is rank-deficient, Xw X will still be symmetric, but its inverse does not exist. How can we
then isolate c in [3.11]? It can be shown that for any matrix A, a matrix A can be found
which satisfies
AA A A. [3.14]
A is called the generalized inverse or pseudo-inverse of A. The terms g-inverse or condi-
tional inverse are also in use. It can be shown that a solution of Xw y Xw Xc is obtained with
a generalized inverse as c aXw Xb Xw y. Apparently, if X is not of full rank, all we have to
do is substitute the generalized inverse for the regular inverse. Unfortunately, generalized
inverses are not unique. The condition [3.14] is satisfied by (infinitely) many matrices. The
solution c is hence not unique either. Assume that G is a generalized inverse of Xw X. Then
any vector
c GXw y aGXw X Ibd
is a solution where d is a conformable but otherwise arbitrary vector (Searle 1971, p. 9; Rao
and Mitra 1971, p. 44). In analysis of variance models, X contains dummy variables coding
the treatment and design effects and is typically rank-deficient (see §3.9.1). Statistical
packages that use different generalized inverses will return different estimates of these
effects. This would pose a considerable problem, but fortunately, several important properties
of generalized inverses come to our rescue in statistical inference. If G is a generalized
inverse of Xw X, then
(i) Gw is also a generalized inverse of Xw X;
(ii) Gw X is a generalized inverse of X;
(iii) XGXw is invariant to the choice of G;
(iv) Xw XGXw Xw and XGXw X X;
Consider the third result, for example. If XGXw is invariant to the choice of the particular
generalized inverse, then XGXw y is also invariant. But GXw y aXw Xb Xw y is the solution
derived above and so Xc is invariant. In statistical models, c is often the least squares estimate
of the model parameters and X a regressor or design matrix. While the estimates c will
depend on the choice of the generalized inverse, the predicted values will not. The two
statistical packages should report the same fitted values.
For any matrix A there is one unique matrix B that satisfies the following conditions:
(i) ABA A (ii) BAB B (iii) aBAbw BA (iv) aABbw ABÞ [3.15]
Because of (i), B is a generalized inverse of A. The matrix satisfying (i) to (iv) of [3.15]
is called the Moore-Penrose inverse, named after work by Penrose (1955) and Moore (1920).
Different classes of generalized inverses have been defined depending on subsets of the four
conditions. A matrix B satisfying (i) is the standard generalized inverse. If B satisfies (i) and
(ii), it is termed the reflexive generalized inverse according to Urquhart (1968). Special cases
of generalized inverses satisfying (i), (ii), (iii) or (i), (ii), (iv) are the left and right inverses.
Let Aa:5b be of rank 5 . Then Aw A is a a5 5 b matrix of rank 5 by [3.13] and its inverse
exists. The matrix aAw Ab" Aw is called the left inverse of A, since left multiplication of A by
aAw Ab" Aw produces the identiy matrix:
"
aAw Ab Aw A I.
Similarly, if Aa:5b is of rank :, then Aw aAAw b" is its right inverse, since
AA aAAw b" IÞ Left and right inverses are not regular inverses, but generalized inverses.
w
Let Ca5:b Aw aAAw b" then AC I, but the product CA cannot be computed since the
matrices do not conform to multiplication. It is easy to verify, however, that ACA A, hence
C is a generalized inverse of A by [3.14]. Left and right inverses are sometimes called
normalized generalized inverse matrices (Rohde 1966, Morris and Odell 1968, Urquhart
1968). Searle (1971, pp. 1-3) explains how to construct a generalized inverse that satisfies
[3.15]. Given a matrix A and arbitrary generalized inverses B of aAAw b and C of aAw Ab, the
unique Moore-Penrose inverse can be constructed as Aw BACAw .
For inverses and all generalized inverses, we note the following results (here B is rank-
deficient and A is of full rank):
aB bw aBw b
A" w aAw b"
A" " A
[3.16]
aABb" B" A"
aB b B
<aB b <aBb.
Finding the inverse of a matrix is simple for certain patterned matrices such as full-rank
a# #b matrices, diagonal matrices, and block-diagonal matrices. The inverse of a full-rank
a# #b matrix
+"" +"#
A
+#" +##
is
" +## +"#
A" . [3.17]
+## +"" +"# +#" +#" +""
The inverse of a full-rank diagonal matrix Da55b Diagaab is obtained by replacing the
diagonal elements by their reciprocals
" " "
D" Diag ß ß ÞÞÞß . [3.18]
+" +# +5
A block-diagonal matrix is akin to a diagonal matrix, where matrices instead of scalars form
the diagonal. For example,
Ô B" 0 0 0 ×
Ö 0 B# 0 0 Ù
BÖ Ù,
0 0 B$ 0
Õ 0 0 0 B% Ø
is a block-diagonal matrix where the matrices B" ß âß B% form the blocks. The inverse of a
block-diagonal matrix is obtained by separately inverting the matrices on the diagonal, provi-
ded these inverses exist:
"
Ô B" 0 0 0 ×
Ö 0 B" 0 0 Ù
B" Ö
Ö 0
# Ù.
0 B"
$ 0 Ù
Õ 0 0 0 B"
%
Ø
• The expectation of a random vector is the vector of the expected values of its
elements.
Ô Ec]" d ×
Ö Ec]# d Ù
Ö Ù
EcYd cEc]3 dd Ö Ec]$ d Ù. [3.19]
Ö Ù
ã
Õ Ec]8 d Ø
The covariance matrix between two random vectors Ya5"b and Ua:"b is a a5 :b
matrix, its 34th element is the covariance between ]3 and Y4 .
CovcYß Ud cCovc]3 ß Y4 dd.
The covariance of linear combinations of random vectors are evaluated similarly to the scalar
case (assuming W and V are also random vectors):
CovcAYß Ud ACovcYß Ud
CovcYß BUd CovcYß UdBw
CovcAYß BUd ACovcYß UdBw [3.23]
Covc+Y , Uß - W . Vd +- CovcYß Wd ,- CovcUß Wd
+. CovcYß Vd ,. CovcUß Vd.
The variance-covariance matrix contains on its diagonal the variances of the observations and
the covariances among the random vector's elements in the off-diagonal cells. Variance-co-
To designate the mean and variance of a random vector Y, we use the notation
Y µ aEcYdß VarcYdb. For example, homoscedastic zero mean errors e are designated as
e µ a 0 ß 5 # Ib .
The elements of a random vector Y are said to be uncorrelated if the variance-covariance
matrix of Y is a diagonal matrix
Ô Varc]" d ! ! á ! ×
Ö ! Varc]# d ! á ! Ù
Ö Ù
VarYa5"b Ö ! ! Varc]$ d á ! Ù.
Ö Ù
ã ã ã ä ã
Õ ! ! ! á Varc]5 d Ø
Two random vectors Y" and Y# are said to be uncorrelated if their variance-covariance matrix
is block-diagonal
Y" VarcY" d 0
Var .
Y# 0 VarcY# d
The rules above for working with the covariance of two (or more) random vectors can be
readily extended to variance-covariance matrices:
VarcAYd AVarcYdAw
VarcY ad VarcYd
Varcaw Yd aw VarcYda [3.24]
#
Varc+Yd + VarcYd
Varc+Y , Ud +# VarcYd , # VarcUd #+, CovcYß Ud.
• The trace of a matrix A, denoted tr(A), equals the sum of its diagonal
elements.
The trace of a matrix A is the sum of its diagonal elements, denoted traAb and plays an impor-
tant role in determining the expected value of quadratic forms in random vectors. If y is an
a8 "b vector and A an a8 8b matrix, yw Ay is a quadratic form in y. Notice that quadratic
forms are scalars. Consider a regression model Y X" e, where the elements of e have !
mean and constant variance 5 # , e µ a0ß 5 # Ib. The total sum of squares corrected for the mean
WWX7 Yw aI 11w Î8bY
is a quadratic form as well as the regression (model) and residual sums of squares:
WWQ7 Yw aH 11w Î8bY
WWV Yw aI HbY
"
H X aX w X b X w .
To determine distributional properties of these sums of squares and expected mean squares,
we need to know, for example, EcWWV d.
If Y has mean . and variance-covariance matrix V, the quadratic form Yw AY has
expected value
EcYw AYd traAVb .w A.. [3.25]
In evaluating such expectations, several properties of the trace operator tra•b are helpful:
(i) traABCb traBCAb traCABb
(ii) tr(A B) traAb traBb
(iii) yw Ay trayw Ayb
[3.26]
(iv) traAb traAw b
(v) tra- Ab - traAb
(vi) traAb <aAb if AA A and Aw A.
Property (i) states that the trace is invariant under cyclic permutations and (ii) that the trace of
the sum of two matrices is identical to the sum of their traces. Property (iii) emphasizes that
quadratic forms are scalars and any scalar is of course equal to its trace (a scalar is a a" "b
matrix). We can now apply [3.25] with A aI Hb, V 5 # I, . X" to find the expected
value of WWV in the linear regression model:
EcWWV d EcYw aI HbYd
traI Hb5 # I " w Xw aI HbX"
5 # traI Hb " w Xw X" " w Xw HX" .
At this point we notice that aI Hb is symmetric and that aI HbaI Hb aI Hb. We can
apply rule (vi) and find traI Hb 8 <aHb. Furthermore,
"
" w Xw HX" " w Xw XaXw Xb Xw X" " w Xw X" ,
The Gaussian (Normal) distributions are arguably the most important family of distributions
in all of statistics. This does not stem from the fact that many attributes are Gaussian-distribu-
ted. Most outcomes are not Gaussian. We thus prefer the label Gaussian distribution over
Normal distribution. First, statistical methods are usually more simple and mathematically
straightforward if data are Gaussian-distributed. Second, the Central Limit Theorem (CLT)
permits approximating the distribution of averages in random samples by a Gaussian distri-
bution regardless of the distribution from which the sample was drawn, provided sample size
is sufficiently large.
A scalar random variable ] is said to be (univariate) Gaussian-distributed with mean .
and variance 5 # if its probability density function is
" "
0 aC b exp aC .b# , [3.27]
È#15 # #5 #
lDl"Î# "
0 ay b exp ay .bw D" ay .bÞ [3.28]
a #1 b 8Î# #
The term lDl is called the determinant of D. We express the fact that Y is multivariate
Gaussian with the shortcut notation
Y µ K8 a.ß Db.
The subscript 8 identifies the dimensionality of the distribution and thereby the order of .
and D. It can be omitted if the dimension is clear from the context. Some important properties
of Gaussian-distributed random variables follow:
(i) EcYd .ß VarcYd D.
(ii) If Y µ K8 a.ß Db, then Y . µ K8 a0ß Db.
(iii) aY .bw D" aY .b µ ;#8 , a Chi-squared variable with 8 degrees of freedom.
(iv) If . 0 and D 5 # I, then K a0ß 5 # Ib is called the standard multivariate Gaussian
distribution.
(v) If Y µ K8 a.ß Db, then U AÐ58Ñ Y bÐ5"Ñ is Gaussian-distributed with mean
A. b and variance-covariance matrix ADAw (linear combinations of Gaussian
variables are also Gaussian).
If D"#a=>b 0, then Y" and Y# are uncorrelated and independent. As a corollary note that the
elements of Y are mutually independent if and only if D is a diagonal matrix. We can learn
more from this partitioning of Y. The distribution of any subset of Y is itself Gaussian-distri-
buted, for example,
Y" µ K= a.= ß D"" b
]3 µ Ka.3 ß 533 b Þ
Another important result pertaining to the multivariate Gaussian distribution tells us that if
Y cYw" ß Y#w dw is Gaussian-distributed, the conditional distribution of Y" given Y# y# is also
Gaussian. Furthermore, the conditional mean EÒY" ly# Ó is a linear function of y# . The general
result is as follows (see, e.g. Searle 1971, Ch. 2.4). If
The derivative of A with respect to ) is defined as the matrix of derivatives of its elements,
`+34 a)b
` AÎ` ) . [3.29]
`)
Occasionally the derivative of a function with respect to an entire vector is needed. For
example, let
0 axb B" $B# 'B# B$ %B#$ .
Some additional results for vector and matrix calculus follow. Again, A is a matrix whose
elements depend on ). Then,
(i) ` traABbÎ` ) traa` AÎ` )bBb traA` BÎ` )b
(ii) ` xw A" xÎ` ) xw A" Ð` AÎ` )ÑA" x
We can put these rules to the test to find the maximum likelihood estimator of " in the
Gaussian linear model Y X" e, e µ K8 a0ß Db. To this end we need to find the solution
s , which maximizes the likelihood function. From [3.28] the density function is given by
"
lDl"Î# "
0 ay ß " b exp ay X" bw D" ay X" bÞ
a #1 b 8Î# #
Consider this a function of " for a given set of data y, and call it the likelihood function
¿a" à yb. Maximizing ¿a" à yb is equivalent to maximizing its logarithm. The log-likelihood
function in the Gaussian linear model becomes
" 8 "
lne¿a" à ybf 6a" ;yb lnalDlb lna#1b ay X" bw D" ay X" b. [3.33]
# # #
To find "s , find the solution to `6a";ybÎ` " 0. First we derive the derivative of the log-
likelihood:
" ` ay X" bw D" ay X" b
`6a" ;ybÎ` "
# `"
" `
yw D" y yw D" X" " w Xw D" y " w Xw D" X"
# `"
" `
yw D" y #yw D" X" " w Xw D" X"
# `"
"
#Xw D" y #Xw D" X" Þ
#
Setting the derivative to zero yields the maximum likelihood equations for ":
Xw D" y Xw D" X".
If X is of full rank, the maximum likelihood estimator becomes " s aXw D" Xb" Xw D" y,
which is also the generalized least squares estimator of " (see §4.2 and §A4.8.1).
The notation for the parameters is arbitrary; we only require that parameters are denoted with
Greek letters. In regression models the symbols "" , "# , â, "5 are traditionally used, while !,
3, 7 , etc., are common in analysis of variance models. When parameters are combined into a
parameter vector, the generic symbols " or ) are entertained. As an example, consider the
linear regression model
To express the linear regression model for this set of data in matrix/vector notation
define:
"
Ô" B"" B#" á B5" ×Ô ! × Ô "! "" B"" "# B#" ÞÞÞ "5 B5" ×
Ö "" Ù
Ö" B"# B## á B5# ÙÖ Ù Ö "! "" B"# "# B## ÞÞÞ "5 B5# Ù
X" Ö ÙÖ " Ù Ö Ù.
ã ã ã ä ã Ö #Ù ã
Õ" ã
B"8 B#8 á B58 ØÕ Ø Õ "! "" B"8 "# B#8 ÞÞÞ "5 B58 Ø
" 5
Ô" ! â !×
Ö! " â !Ù
V 5# Ö Ù 5 # I8 ,
ã ä ã
Õ! ! â "Ø
where 5 # is the common variance of the model disturbances. The classical linear model with
homoscedastic, uncorrelated errors is finally
Y X" eß e µ a0ß 5 # Ib. [3.36]
The assumption of Gaussian error distribution is deliberately omitted. To estimate the param-
eter vector " by least squares does not require a Gaussian distribution. Only if hypotheses
about " are tested does a distributional assumption for the errors come into play.
Model [3.34] is a regression model consisting of fixed coefficients only. How would the
notation change if the model incorporates effects (classification variables) rather than
coefficients? In §1.7.3 it was shown that ANOVA models can be expressed as regression
models by constructing appropriate dummy regressors, which associate an observation with
elements of the parameter vector. Consider a randomized complete block design with four
treatments in three blocks. Written as an effects model (§4.3.1), the linear model for the block
design is
]34 . 34 73 /34 , [3.37]
where ]34 is the response in the experimental unit receiving treatment 3 in block 4, 34 is the
effect of block 3 "ß âß $, 73 is the effect of treatment 3 "ß âß %, and the /34 are the experi-
mental errors assumed uncorrelated with mean ! and variance 5 # . Define the column vector
of parameters
) c.ß 3" ß 3# ß 3$ ß 7" ß 7# ß 7$ ß 7% dw ,
and the response vector Y, design matrix P, and vector of experimental errors as
] " " ! ! " ! ! ! /
Ô "" × Ô × Ô "" ×
Ö ]#" Ù Ö" " ! ! ! " ! !Ù Ö /#" Ù
Ö] Ù Ö" " ! ! ! ! " !Ù Ö/ Ù
Ö $" Ù Ö Ù Ö $" Ù
Ö ]%" Ù Ö" " ! ! ! ! ! "Ù Ö /%" Ù
Ö Ù Ö Ù Ö Ù
Ö ]"# Ù Ö" ! " ! " ! ! !Ù Ö /"# Ù
Ö Ù Ö Ù Ö Ù
Ö ]## Ù Ö Ù Ö Ù
YÖ Ù, P Ö " ! " ! ! " ! ! Ù, e Ö /## Ù,
Ö ]$# Ù Ö" ! " ! ! ! " !Ù Ö /$# Ù
Ö Ù Ö Ù Ö Ù
Ö ]%# Ù Ö" ! " ! ! ! ! "Ù Ö /%# Ù
Ö Ù Ö Ù Ö Ù
Ö ]"$ Ù Ö" ! ! " " ! ! !Ù Ö /"$ Ù
Ö Ù Ö Ù Ö Ù
Ö ]#$ Ù Ö" ! ! " ! " ! !Ù Ö /#$ Ù
Ö Ù Ö Ù Ö Ù
]$$ " ! ! " ! ! " ! /$$
Õ] Ø Õ" ! ! " ! ! ! "Ø Õ/ Ø
%$ %$
As an example, consider
]3 B!"3 a"" "# B#3 b,
a nonlinear model used by Cole (1975) to model forced expiratory volume of humans a]3 b as
a function of height aB" b and age aB# b. Put x3 cB"3 ß B#3 dw and ) c!ß "" ß "# dw and add a sto-
chastic element to the model:
]3 0 ax3 ß )b /3 . [3.40]
To express model [3.40] for the vector of responses Y c]" ß ]# ß âß ]8 dw , replace the
function 0 ab with vector notation, and remove the index 3 from its arguments,
Y faxß )b eÞ [3.41]
[3.41] is somewhat careless notation, since fab is not a vector function. We think of it as the
function 0 ab applied to the arguments x3 ß ) in turn:
where /34 denotes the experimental error for replicate 3 of treatment 4, and .345 the
subsampling error for sample 5 of replicate 3 of treatment 4. The indices range as follows.
Table 3.1. Indices for the models corresponding to panels (a) to (c) of Figure 2.7
Panel of Figure 2.7 3 4 5 No. of obs.
(a) "ß âß "' "'
(b) "ß âß % "ß âß % "'
(c) "ß âß # "ß âß % "ß âß # "'
To complete the specification of the error structure we put Ec/3 d Ec/34 d Ec.345 d !,
(a) Covc/3 ß /5 d 5 # expe $234 Î!f
5/# 3 5ß 4 6
(b), (c) Covc/34 ß /56 d
! otherwise
5.# 3 6ß 4 7ß 5 8
(d) Covc.345 ß .678 d
! otherwise
Covc/34 ß .345 d !.
The covariance model for two spatial observations in model (a) is called the exponential
model, where 234 is the Euclidean distance between ]+3 and ]+4 . The parameter ! measures
the range at which observations are (practically) uncorrelated (see §7.5.2 and §9.2.2 for
details). In models (b) and (c), the error structure states that experimental and subsampling
errors are uncorrelated with variances 5/# and 5.# , respectively. The variance-covariance
matrix in each model is a a"' "'b matrix. In the case of model a) we have
Varc]+3 d Covc/3 ß /3 d 5 # expe $233 Î!f 5 # expe !Î!f 5 #
Covc]+3 ß ]+4 d Covc/3 ß /4 d 5 # expe $234 Î!f
Ô " expe $2"# Î!f expe $2"$ Î!f â expe $2" "' Î!f ×
Ö expe $2#" Î!f " expe $2#$ Î!f â expe $2# "' Î!f Ù
Ö Ù
VarcY+ d 5 # Ö expe $2$" Î!f expe $2$# Î!f " â expe $2$ "' Î!f Ù.
Ö Ù
ã ã ä ã
Õ expe $2"' " Î!f expe $2"' # Î!f expe $2"' $ Î!f â " Ø
ÔÔ ],"" ×× Ô " ! ! ! ! ! â !×
ÖÖ ],"# ÙÙ Ö ! " ! ! ! ! â !Ù
ÖÖ ÙÙ Ö Ù
ÖÖ ],"$ ÙÙ Ö ! ! " ! ! ! â !Ù
ÖÖ ÙÙ Ö Ù
ÖÖ ],"% ÙÙ Ö ! ! ! " ! ! â !Ù
VarcY, d VarÖÖ ÙÙ 5 # Ö Ù.
ÖÖ ],#" ÙÙ Ö ! ! ! ! " ! â !Ù
ÖÖ ÙÙ Ö Ù
ÖÖ ],## ÙÙ Ö ! ! ! ! ! " â !Ù
ÖÖ ÙÙ Ö Ù
ã ã ã ã ã ã ã ä ã
ÕÕ ],%% ØØ Õ ! ! ! ! ! ! â "Ø
The first four entries of Y, correspond to the first replicates of treatments " to % and so forth.
To derive the variance-covariance matrix for model (c), we need to separately investigate the
covariances among observations from the same experimental unit and from different units.
For the former we have
Covc]-345 ß ]-345 d Varc/34 .345 d 5/# 5.#
and
Covc]-345 ß ]-346 d Covc/34 .345 ß /34 .346 d
Covc/34 ß /34 d Covc/34 ß .346 d Covc.345 ß /34 d Covc.345 ß .346 d
5/# ! ! !.
If the elements of Y- are arranged by grouping observations from the same cluster together,
the variance-covariance matrix can be written as
Ô " 9 ! ! ! ! ! ! !×
Ö 9 " ! ! ! ! ! ! !Ù
Ö Ù
Ö ! ! " 9 ! ! ! ! !Ù
Ö Ù
Ö ! ! 9 " ! ! ! ! !Ù
# # Ö Ù
VarcY- d 5/ 5. Ö ! ! ! ! " 9 ! ! !Ù
Ö Ù
Ö ! ! ! ! 9 " ã ãÙ
Ö Ù
Ö ! ! ! ! ! ! ä Ù
Ö Ù
! ! ! ! ! ! â " 9
Õ! ! ! ! ! ! â 9 "Ø
where 9 5/# Îa5/# 5.# b. The observations from a single cluster are shaded.
When data are clustered and clusters are uncorrelated but observations within a cluster
are correlated, VarcYd has a block-diagonal structure, and each block corresponds to a
different cluster. If data are unclustered (cluster size ") and uncorrelated, VarcYd is diagonal.
If data consists of a single cluster of size 8 (spatial data), the variance-covariance matrix
consists of a single block.
“The ususal criticism is that the formulae ... can tell us nothing new, and
nothing worth knowing of the biology of the phenomenon. This appears to
me to be very ill-founded. In the first place, quantitative expression in
place of a vague idea ... is not merely a mild convenience. It may even be
a very great convenience, and it may even be indispensable in making
certain systematic and biological deductions. But further, it may suggest
important ideas to the underlying processes involved.” Huxley, J.S.,
Problems of Relative Growth. New York: Dial Press, 1932.
4.1 Introduction
4.2 Least Squares Estimation and Partitioning of Variation
4.2.1 The Principle
4.2.2 Partitioning Variability through Sums of Squares
4.2.3 Sequential and Partial Sums of Squares and the
Sum of Squares Reduction Test
4.3 Factorial Classification
4.3.1 The Means and Effects Model
4.3.2 Effect Types in Factorial Designs
4.3.3 Sum of Squares Partitioning through Contrasts
4.3.4 Effects and Contrasts in The SAS® System
4.4 Diagnosing Regression Models
4.4.1 Residual Analysis
4.4.2 Recursive and Linearly Recovered Errors
4.4.3 Case Deletion Diagnostics
4.4.4 Collinearity Diagnostics
4.4.5 Ridge Regression to Combat Collinearity
or
• least squares estimators of the model parameters are no longer best suited although the
model may be correct.
It is the many ways in which model [4.1] breaks down that we discuss here. If the model
does not hold we are led to alternative model formulations; in the second case we are led to
alternative methods of parameter estimation. Figure 4.1 is an attempt to roadmap where these
breakdowns will take us. Before engaging nonlinear, generalized linear, linear mixed, non-
linear mixed, and spatial models, this chapter is intended to reacquaint the reader with the
basic concepts of statistical estimation and inference in the classical linear model and to intro-
duce some methods that have gone largely unnoticed in the plant and soil sciences. Sections
§4.2 and §4.3 are largely a review of the analysis of variance and regression methods. The
important sum of squares reduction test that will be used frequently throughout this text is
discussed in §4.2.3. Standard diagnostics for performance of regression models are discussed
in §4.4 along with some remedies for model breakdowns such as ridge regression to combat
collinearity of the regressors (§4.4.5). §4.5 concentrates on diagnosing classification models
with special emphasis on the homogeneous variance assumption. In the sections that follow
we highlight some alternative approaches to statistical estimation that (in our opinion) have
not received the attention they deserve, specifically P" - and M-Estimation (§4.6) and non-
parametric regression (§4.7). Mathematical details on these topics which reach beyond the
coverage in the main text can be found in Appendix A on the CD-ROM (§A4.8).
β
Non-Conforming Xβ Non-Conforming e
Non-Conforming Y
Y is not continuous:
Generalized Linear Models (§6)
Transformed Linear Models (§4.5, §6)
Elements of Y correlated:
Mixed Models and Clustered Data Models (§7, §8)
Spatial Models (§9)
Figure 4.1. The classical linear Gaussian model and breakdowns of its components that lead
to alternative methods of estimation (underlined) or alternative statistical models (italicized).
Here, .34 is the mean pH on plots receiving Lime Type 3 at Rate of Application 4. The
/345 are experimental errors associated with the 5 replicates of the 34th treatment combi-
nation.
Of interest to the modeler is to uncover the relationship among the "! treatment means
."" ß âß .#& . Figure 4.2 displays the sample averages of the five replicates for each
treatment and a sharp increase of pH with increasing rate of application for agricultural
lime and a weaker increase for granulated lime is apparent.
6.4
Agricultural Lime
6.2
Sample Average pH
6.0
Granulated Lime
5.8
5.6
0 1 2 3 4 5 6 7 8
Rate of Application (tons)
Figure 4.2. Sample average pH in soil samples one week after lime application.
The linear model in Example 4.1 is determined by the experimental design. Randomi-
zation ensures independence of the errors, the treatment structure determines which treatment
effects are to be included in the model and if design effects (blocking, replication at different
locations, or time points) were present they, too, would be included in the mean structure.
Analysis focuses on the predetermined questions of interest. For example,
• Do factors Lime Type and Rate of Application interact?
• Are there significant main effects of Lime Type and Rate of Application?
• How can the trend between pH and application rates be modeled and does this trend
depend on which type of lime is applied?
• At which rate of application do the lime types differ significantly in pH?
In the next example, developing an appropriate mean structure is the focal point of the
analysis. The modeler must apply a series of hypothesis tests and diagnostic procedures to
arrive at a final model on which inference and conclusions can be based with confidence.
Example 4.2. Turnip Greens. Draper and Smith (1981, p. 406) list data from a study
of Vitamin F# content in the leaves of turnip plants (Wakeley, 1949). For each of
8 #( plants, the concentration of F# vitamin (milligram per gram) was measured as
the response of interest. Along with the vitamin content the explanatory variables
were measured (Table 4.2). Only three levels of soil moisture were observed for \#
a#Þ!ß (Þ!ß %(Þ%b with nine plants per level, whereas only a few or no duplicate values are
available for the variables Sunlight and Air Temperature.
Any one of the explanatory variables does not seem very closely related to the vitamin
content of the turnip leaves (Figure 4.3). Running separate linear regressions between
] and each of the three explanatory variables, only Soil Moisture seems to explain a
significant amount of Vitamin F# variation. Should we conclude based on this finding
that the amount of sunlight and the air temperature have no effect on the vitamin
content of turnip plant leaves? Is it possible that Air Temperature is an important pre-
dictor of vitamin F# content if we simultaneously adjust for Soil Moisture? Is a linear
trend in Soil Moisture reasonable even if it appears to be significant?
Analysis of the Turnip Greens data does not utilize a linear model suggested by the
processes of randomization and experimental control. It is the modeler's task to discover
the importance of the explanatory variables on the response and their interaction with
each other in building a model for these data. We use methods of multiple linear regres-
sion (MLR) to that end. The purposes of an MLR model can be any or all of the
following:
• to predict the outcome of interest for values of the explanatory variables not
in the data set.
Vitamin B2
Vitamin B2
60 60
100
Vitamin B2
60
61 72 83
Air Temperature
Examples 4.1 and 4.2 are analyzed with standard analysis of variance and multiple linear
regression techniques. The parameters of the respective models will be estimated by ordinary
least squares or one of its variations (§4.2.1) because of the efficiency of least squares esti-
mates under standard conditions. Least squares estimates are not necessarily the best esti-
mates. They are easily distorted in a variety of situations. Strong dependencies among the
columns in the regressor matrix X, for example, can lead to numerical instabilities producing
least squares estimates of inappropriate sign, inappropriate magnitude, and of low precision.
Diagnosing and remedying this condition known as multicollinearity is discussed in
§4.4.4, with additional details in §A4.8.3.
Outlying observations also have high (negative) influence on least squares analysis and a
single outlier can substantially distort the analysis. Methods resistant and/or robust against
outliers were developed decades ago but are being applied to agronomic data only infre-
quently. To delete suspicious observations from the analysis is a common course of action,
but the fact that an observation is outlying does not warrant its removal. Outliers can be the
most interesting observations in a set of data that should be investigated with extra
thoroughness. Outliers can be due to a breakdown of an assumed model; therefore, the model
needs to be changed, not the data. One such breakdown concerns the distribution of the model
errors. Compared to a Gaussian distribution, outliers will be occurring more frequently if the
distribution of the errors is heavy-tailed or skewed. Another model breakdown agronomists
should be particularly aware of concerns the presence of block treatment interactions in
randomized complete block designs (RCBD). The standard analysis of an RCBD is not valid
if treatment comparisons do not remain constant from block to block. A single observation
often an extreme observation can induce a significant interaction.
Denote the entry for block (row) 3 and treatment (column) 4 as ]34 . Since the data are
counts (without natural demoninator, §2.2), one may consider the entry in each cell as a
realization of a Poisson random variable with mean Ec]34 d. Poisson random variables
with mean greater than "&, say, can be well approximated by Gaussian random
variables. The entries in Table 4.3 appear sufficiently large to invoke the approxima-
tion. If one analyzes these data with a standard analysis of variance and performs
hypothesis tests based on the Gaussian distribution of the model errors, will the obser-
vation for treatment "! in block % negatively affect the inference? Are there any other
unusual or influential observations? Are there interactions between treatments and
blocks? If so, could they be induced by extreme observations? Will the answers to these
important questions change if a transformation of the counts is employed?
In §4.5.3 we apply the outlier resistant method of Median Polishing to study the
potential block treatment interactions and in §4.6.4 we estimate treatment effects with
an outlier robust method.
Many agronomic data sets are comparatively small, and estimates of residual variation
(mean square errors) rest on only a few degrees of freedom. This is particularly true for
designed experiments but analysis in observational studies may be hampered too. Losing
additional degrees of freedom by removing suspicious observations is then particularly costly
since it can reduce the power of the analysis considerably. An outlier robust method that re-
tains all observations but reduces their negative influence on the analysis is then preferred.
Example 4.4. Prediction Efficiency. Mueller et al. (2001) investigate the accuracy and
precision of mapping spatially variable soil attributes for site-specific fertility manage-
ment. The efficiency of geostatistical prediction via kriging (§9.4) was expressed as the
ratio of the kriging mean square error relative to a whole-field average prediction which
does not take into account the spatial dependency of the data. Data were collected on a
50
Prec
40 P
Prediction Efficacy
Ca
30
pH
20
Mg
CEC
10
0 lime
Figure 4.4. Prediction efficiency for kriging of various soil attributes as a function of
the range of spatial correlation overlaid with predictions from a quadratic polynomial.
Adapted from Figure 1.12 in Mueller (1998). Data kindly provided by Dr. Thomas G.
Mueller, Department of Agronomy, University of Kentucky. Used with permission.
The precision with which observations at unobserved spatial locations can be predicted
based on geostatistical methods (§9.4) is a function of the spatial autocorrelation among
the observations. The stronger the autocorrelation, the greater the precision of spatially
explicit methods compared to whole-field average prediction which does not utilize
spatial information. The degree of autocorrelation is strongly related to the range of the
spatial process. The range is defined as the spatial separation distance beyond which
measurements of an attribute can be considered uncorrelated. It is expected that the
geostatistical efficiency increases with the range of the attribute. This is clearly seen in
Figure 4.4. Finding a model that captures (on average) the dependency between range
and efficiency is of primary interest in this application. In contrast to the Turnip Greens
study (Example 4.2), tests of hypotheses about the relationship between prediction effi-
ciency and covariates are secondary in this study.
• Ordinary least squares (OLS) leads to best linear unbiased estimators in the
classical model and minimum variance unbiased estimators if
e µ K a!ß 5 # Ib.
Recall the classical linear model Y X" e where the errors are uncorrelated and homo-
scedastic, Varced 5 # I. The least squares principle chooses as estimators of the parameters
" c"! ß "" ß âß "5" dw those values that minimize the sum of the squared residuals
W a" b llell# ew e aY X" bw aY X" b. [4.2]
One approach is to set derivatives of Wa" b with respect to " to zero and to solve. This leads
to the normal equations Xw Y Xw X" s with solution " s aXw Xb" Xw Y provided X is of full
rank 5 . If X is rank deficient, a generalized inverse is used instead and the estimator becomes
s aX w X b X w Y .
"
The calculus approach disguises the geometric principle behind least squares somewhat.
The simple identity
s ÐY X "
Y X" s Ñ, [4.3]
Y β
Y-Xβ
^
β
Y-Xβ
β^
Xβ
0 β
Xβ
If the classical linear model holds, the least square estimator enjoys certain optimal
properties. It is a best linear unbiased estimator (BLUE) since EÒ" s Ó " and no other
unbiased estimator that is a linear function of Y has smaller variability. These appealing fea-
tures do not require Gaussianity. It is possible, however, that some other, nonlinear estimator
of " would have greater precision. If the model errors are Gaussian, i.e., e µ Ka0ß 5 # Ib, the
ordinary least squares estimator of " is a minimum variance unbiased estimator (MVUE),
extending its optimality beyond those estimators which are linear in Y. We will frequently
denote the ordinary least squares estimator as " s SPW , to distinguish it from the generalized
s
least squares estimator " KPW that arises if e µ a0ß Vb, where V is a general variance-
covariance matrix. In this case we minimize ew V" e and obtain the estimator
s KPW Xw V" X" Xw V" Y.
" [4.5]
Table 4.4 summarizes some properties of the ordinary and generalized least square estimators.
We see that the ordinary least squares estimator remains unbiased, even if the error variance
s SPW Ó is typically larger than VarÒaw "
is not 5 # I. However, VarÒaw " s KPW Ó in this case. The OLS
estimator is less efficient. Additional details about the derivation and properties of " s KPW can
be found in §A4.8.1. A third case positioned between ordinary and generalized least squares
arises when V is a diagonal matrix and is termed weighted least squares estimation
(WLSE). It is the appropriate estimation principle if the model errors are heteroscedastic but
uncorrelated. If Varced Diaga5 b W, where 5 is a vector containing the variances of the
/3 , then the weighted least squares estimator is " s " w " #
[ PW aX W Xb X W Y. If V 5 I, the
w "
Standard hypothesis tests can be derived based on the Gaussian distribution of the estimator
and usually lead to >- or J -tests (§A4.8.2). If the model errors are not Gaussian, the asymp-
s is Gaussian nevertheless. With sufficiently large sample size one can
totic distribution of "
thus proceed as if Gaussianity of "s holds.
and measures the length of the vector. By the orthogonality of the residual vector Y X" s
and the vector of predicted values X" s (Figure 4.5), the length of the observed vector Y is re-
lated to the length of the predictions and residuals by the Pythagorean theorem:
s ll# llX"
llYll# llY X" s ll# . [4.6]
The three terms in [4.6] correspond to the uncorrected total sum of squares aWWX llYll# b,
the residual sum of squares ÐWWV llY X" s ll# Ñ, and the model sum of squares
Table 4.5. Analysis of variance table for the standard linear model (<aXb denotes rank of X)
Source df SS MS
w
Model < aX b s Xw Y WWQ
" WWQ Î<aXb
w
Residual (Error) 8 < aX b YY"w s X Y WWV w
WWVÎa8 <aXbb
w
Uncorrected Total 8 Y Y WWX
The model sum of squares WWQ measures the joint explanatory power of the variables
(including the intercept). If the explanatory variables in X were unrelated to the response, we
would use ] to predict the mean response, which is the ordinary least squares estimate in an
intercept-only model. WWQ7 , which measures variability explained beyond an intercept-only
model, is thus the appropriate statistic for evaluating the predictive value of the explanatory
variables and we use ANOVA tables in which the pertinent terms are corrected for the mean.
The coefficient of determination aV # b is defined as
WWQ7 s w Xw Y 8] #
"
V# #
. [4.8]
WWX7 Yw Y 8]
Similarly, WWQ7 WW a"" ß âß "5" l"! b is the joint contribution of "" ß âß "5" after adjust-
ment for the intercept (correction for the mean, Table 4.6). A partitioning of WWQ7 into
sequential one degree of freedom sums of squares is
WWQ7 WW a"" l"! b
WW a"# l"! ß "" b
WW a"$ l"! ß "" ß "# b [4.9]
â
WW a"5" l"! ß âß "5# b.
WW a"# l"! ß "" b, for example, is the sum of squares contribution accounted for by adding the
regressor \# to a model already containing an intercept and \" . The test statistic
WW a"# l"! ß "" bÎQ WV can be used to test whether the addition of \# to a model containing
\" and an intercept provides significant improvement of fit and hence is a gauge for the
explanatory value of the regressor \# . If the model errors are Gaussian,
WW a"# l"! ß "" bÎQ WV has an J distribution with one numerator and 8 <aXb denominator
degrees of freedom. Since the sum of squares WW a"# l"! ß "" b has a single degree of freedom,
we can also express this test statistic as
Q W a"# l"! ß "" b
J9,= ,
Q WV
where Q W a"# l"! ß "" b WW a"# l"! ß "" bÎ" is the sequential mean square. This is a special
case of a sum of squares reduction test. Imagine we wish to test whether adding regressors
\# and \$ simultaneously to a model containing \" and an intercept improves the model fit.
The change in the model sum of squares is calculated as
WW a"# ß "$ l"! ß "" b WW a"! ß "" ß "# ß "$ b WW a"! ß "" b.
To obtain WW a"# ß "$ l"! ß "" b we fit a model containing four regressors and obtain its model
sum of squares. Call it the full model. Then a reduced model is fit, containing only an inter-
cept and \" . The difference of the model sums of squares of the two models is the contri-
bution of adding \# and \$ simultaneously. The mean square associated with the addition of
the two regressors is Q W a"# ß "$ l"! ß "" b WW a"# ß "$ l"! ß "" bÎ#. Since both models have the
same (corrected or uncorrected) total sum of squares we can express the test mean square also
in terms of a residual sum of squares difference. This leads us to the general version of the
sum of squares reduction test.
Consider a full model Q0 and a reduced model Q< where Q< is obtained from Q0 by
constraining some (or all) of its parameters. Usually the constraints mean setting one or more
parameters to zero, but other constraints are possible, for example, "" "# '. If (i) WWV0
is the residual sum of squares obtained from fitting the full model, (ii) WWV< is the respective
sum of square for the reduced model, and (iii) ; is the number of parameters constrained in
the full model to obtain Q< , then
is WWQ7 with 5 " degrees of freedom. See §A4.8.2 for more details on the sum of squares
reduction test.
The sequential sum of squares decomposition [4.9] is not unique; it depends on the order
in which the regressors enter the model. Regardless of this order, the sequential sums of
squares add up to the model sum of squares. This may appear to be an appealing feature, but
in practice it is of secondary importance. A decomposition in one degree of freedom sum of
squares that does not (necessarily) add up to anything useful but is much more relevant in
practice is based on partial sums of squares. A partial sum of squares is the contribution
made by one explanatory variable in the presence of all other regressors, not only the
regressors preceding it. In a four regressor model (not counting the intercept), for example,
the partial sums of squares are
Partial sums of squares do not depend on the order in which the regressors enter a model and
are usually more informative for purposes of hypothesis testing.
Example 4.2 Turnip Greens (continued). Recall the Turnip Greens data on p. 91. We
need to develop a model that relates the Viatmin F# content in turnip leaves to the
explanatory variables Sunlight a\" b, Soil Moisture a\# b, and Air Temperature a\$ b.
Figure 4.3 on p. 92 suggests that the relationship between vitamin content and soil
moisture is probably quadratic, rather than linear. We fit the following multiple
regression model to the #( observations
]3 "! "" B"3 "# B#3 "$ B$3 "% B#%3 /3 ,
where B%3 B##3 . Assume the following questions are of particular interest:
In terms of the model parameters, the questions translate into the hypotheses
The full model is the four regressor model and the reduced models are given in Table
4.7 along with their residual sums of squares. Notice that the reduced model for the
third hypothesis is a quadratic polynomial in soil moisture.
Table 4.7. Residual sums of squares and degrees of freedom of various models
along with test statistics for sum of squares reduction test
Reduced Model
Hypothesis Contains Residual SS Residual df J9,= T -value
"$ ! \" ß \# ß \% "ß !$"Þ") #$ &Þ'(( !Þ!#'$
"% ! \" ß \# ß \$ #ß #%$Þ") #$ $)Þ#!' !Þ!!!"
"" "$ ! \ # ß \% "ß ")!Þ'! #% %Þ)%$ !Þ!")"
Full Model )"*Þ') ##
where ; is the difference in residual degrees of freedom between the full and a reduced
model. For example, for the test of L! : "$ ! we have
a"ß !$"Þ") )"*Þ')bÎ" #""Þ&!
J9,= &Þ'((
)"*Þ')Î## $(Þ#&*
Using proc glm of The SAS® System, the full model is analyzed with the statements
Output 4.1.
The GLM Procedure
Standard
Parameter Estimate Error t Value Pr > |t|
The analysis of variance for the full model leads to WWQ7 )ß $$!Þ)%&, WWV
)"*Þ')%, and WWX7 *"&!Þ&#*. The mean square error estimate is Q WV $(Þ#&).
The four regressors jointly account for *"% of the variability in the vitamin F# content
of turnip leaves. The test statistic J9,= Q WQ7 ÎQ WV &&Þ*! a: Þ!!!"b is used
to test the global hypothesis L! : "" "# "$ "% !. Since it is rejected, we con-
clude that at least one of the regressors explains a significant amount of vitamin F#
variability in the presence of the others.
Sequential sums of squares are labeled Type I SS by proc glm (Output 4.1). For
example, WW a"" l"! b *(Þ(%*, WW a"# l"! ß "" b '((*Þ"!. Also notice that the sequen-
tial sums of squares add up to WWQ7 ,
*(Þ(%* 'ß ((*Þ"!% $!Þ%*# "ß %#$Þ&!! )ß $$!Þ)%& WWQ7 Þ
The partial sums of squares are listed as Type III SS. The partial hypothesis
L! À "$ ! can be answered directly from the output of the fitted full model without
actually fitting a reduced model and calculating the sum of squares reduction test. The
J9,= statistics shown as F Value and the :-values shown as Pr > F in the Type III SS
table are partial tests of the individual parameters. Notice that the last sequential and the
The last table of Output 4.1 shows the parameter estimates and their standard errors.
For example, " s ! ""*Þ&(", " s " !Þ!$$(, and so forth. The >9,= statistics shown in
the column t Value are the ratios of a parameter estimate and its estimated standard
s 4 ÎeseÐ"
error, >9,= " s 4 Ñ. The two-sided >-tests are identical to the partial J -tests in the
Type III SS table. See §A4.8.2 for the precise correspondence between partial J -tests
for single variables and the >-tests.
In the Turnip Greens example, sequential and partial sums of squares are not identical. If
they were, it would not matter in which order the regressors enter the model. When should we
expect the two sets of sum of squares to be the same? For example, when the explanatory
variables are orthogonal which requires that the inner product of pairs of columns of X is
zero. The sum of square contribution of one regressor then does not depend on whether the
other regressor is in the model. Sequential and partial sums of squares may coincide under
conditions less stringent than orthogonality of the columns of X. In classification (ANOVA)
models where the columns of X consist of dummy (design) variables, the sums of squares
coincide if the data exhibit a certain balance. Hinkelmann and Kempthorne (1994, pp. 87-88)
show that for the two-way classification without interaction, a sufficient condition is equal
frequencies (replication) of the factor-level combinations. An analysis of variance table where
sequential and partial sums of squares are identical is termed an orthogonal ANOVA. There
are other differences in the sum of squares partitioning between regression and analysis of
variance models. Most notably, in ANOVA models we usually test subsets rather than indi-
vidual parameters and the identity of the parameters in subsets reflects informative structural
relationships among the factor levels. The single degree of freedom sum of squares parti-
tioning in ANOVA classification models can be accomplished with sums of squares of ortho-
gonal contrasts, which are linear combinations of the model parameters. Because of these
subtleties and the importance of ANOVA models in agronomic data analysis, we devote §4.3
to classification models exclusively.
Sequential sums of squares are of relatively little interest unless there is a natural order in
which the various explanatory variables or effects should enter the model. A case in point are
polynomial regression models where regressors reflect successively higher powers of a single
variable. Consider the cubic polynomial
]3 "! "" B3 "# B#3 "$ B$3 /3 .
Read from the bottom, these sums of squares can be used to answer the questions
Statisticians and practitioners do not perfectly agree on how to build a final model based on
the answers to these questions. One school of thought is that if an interaction is found
significant the associated main effects (lower-order terms) should also be included in the
model. Since we can think of the cubic term as the interaction between the linear and
quadratic term aB$ B B# b, if B$ is found to make a significant contribution in the
presence of B and B# , these terms would not be tested further and remain in the model. The
second school of thought tests all terms individually and retains only the significant ones. A
third-order term without the linear or quadratic term in the model is then possible. In observa-
tional studies where the regressors are rarely orthogonal, we adopt the first philosophy. In
designed experiments where the treatments are levels on a continuous scale such as a rate of
application, we prefer to adopt the second school of thought. If the design matrix is orthog-
onal we can then easily test the significance of a cubic term independently from the linear or
quadratic term using orthogonal polynomial contrasts.
• The means model expresses observations as random deviations from the cell
means; its design matrix is of full rank.
• The effects model decomposes cell means in grand mean, main effects, and
interactions; its design matrix is deficient in rank.
Two equivalent ways of representing a classification models are termed the means and the
effects model. We prefer effects models in general, although on the surface means models are
simpler. But the study of main effects and interactions, which are of great concern in classifi-
cation models, is simpler in means models. In the two-way classification with replications
(e.g., Example 4.1) an observation ]345 for the 5 th replicate of Lime Type 3 and Rate 4 can be
expressed as a random deviation from the mean of that particular treatment combination:
]345 .34 /345 . [4.11]
Here, /345 denotes the experimental error associated with the 5 th replicate of the 3th lime type
and 4th rate of application e3 "ß + #à 4 "ß âß , &à 5 "ß âß < &f. .34 denotes the
mean pH of an experimental unit receiving lime type 3 at rate 4; hence the name means model.
To finish the model formulation assume /345 are uncorrelated random variables with mean !
and common variance 5 # . Model [4.11] can then be written in matrix vector notation as
Y X. e, where
The a& "b vector Y"" , for example, contains the replicate observations for lime type " and
Ô !Þ# ! ! ! â ! ×
Ö ! !Þ# ! ! â ! Ù
Ö Ù
" " Ö ! ! !Þ# ! â ! Ù
aX w X b Diag 1"! Ö Ù.
& Ö ! ! ! !Þ# â ! Ù
Ö Ù
ã ã ã ã ä ã
Õ ! ! ! ! â !Þ# Ø
The ordinary least squares estimate of . is thus simply the vector of sample means in the
+ , groups.
Ô C ""Þ ×
Ö C "#Þ Ù
Ö Ù
Ö ã Ù
Ö Ù
" ÖC Ù
s aXw Xb Xw y Ö "&Þ Ù.
.
Ö C #"Þ Ù
Ö Ù
Ö C ##Þ Ù
Ö Ù
ã
Õ C #&Þ Ø
A different parameterization of the two-way model can be derived if we think of the .34 as
cell means in the body of a two-dimensional table in which the factors are cross-classified
(Table 4.8). The row and column averages of this table are denoted .3Þ and .Þ4 where the dot
replaces the index over which averaging is carried out. We call .3Þ and .Þ4 the marginal
means for factor levels 3 of Lime type and 4 of Rate of Application, respectively, since they
occupy positions in the margin of the table (Table 4.8).
Table 4.8. Cell and marginal means in Lime Application (Example 4.1)
Rate of Application
0 1 2 4 8
Agricultural lime ."" ."# ."$ ."% ."& ."Þ
Granulated lime .#" .## .#$ .#% .#& .#Þ
.Þ" .Þ# .Þ$ .Þ% .Þ&
Marginal means are arithmetic averages of cell means (Yandell, 1997, p. 109), even if the
data are unbalanced:
" , " +
.3Þ ".34 .Þ4 ".34 .
, 4" + 3"
The grand mean . is defined as the average of all cell means, . !3 !4 .34 Î+, . To construe
marginal means as weighted means where weighing would take into account the number of
observations 834 for a particular cell would define population quantities as functions of
Models based on this decomposition are termed effects models since !3 a.3Þ .b measures
the effect of the 3th level of factor E, "4 a.Þ4 .b the effect of the 4th level of factor F and
a!" b34 their interaction. The nature and precise interpretation of main effects and interactions
is studied in more detail in §4.3.2. For now we notice that the effects obey certain constraints
by construction:
+
"!3 !
3"
,
""4 ! [4.14]
4"
+ ,
"a!" b34 "a!" b34 !.
3" 4"
A two-way factorial layout coded as an effects model can be expressed as a sum of separate
vectors. We obtain
Y 1. X! ! X" " Xa!" b 9 e, [4.15]
where
The matrix Xa!" b is the same as the X matrix in [4.12]. Although the latter is non-
singular, it is clear that in the effects model the complete design matrix
P 1ß X! ß X" ß Xa!" b is rank-deficient. The columns of X! , the columns of X" , and the
columns of XÐ!" Ñ sum to 1. This results from the linear constraints in [4.14].
We now proceed to define various effect types based on the effects model.
• Simple effects are comparisons of the cell means where one factor is held
fixed.
• Main effects are contrasts among the marginal means and can be expressed
as averages of simple effects.
Hypotheses can be expressed in terms of relationships among the cell means .34 or in terms of
the effects !3 , "4 , and a!" b34 in [4.13]. These relationships are classified into the following
categories (effects):
• simple effects
• interaction effects
• main effects, and
• simple main effects
A simple effect is the most elementary comparison. It is a comparison of the .34 where
one of the factors is held fixed. For example, ."$ .#$ is a comparison of lime types at rate 2
tons, and .## .#$ is a comparison of the " and # ton application rate for granulated lime. By
comparisons we do not just have pairwise tests in mind, but more generally, contrasts. A
contrast is a linear function of parameters in which the coefficients of the linear function sum
to zero. A simple effect of application rate for agricultural lime a3 "b is a contrast among
the ."4 ,
,
j "-4 ."4 ,
4"
where !,4" -4 !. The -4 are called the contrast coefficients. The simple effect .## .#$
has contrast coefficients a!ß "ß "ß !ß !b.
Interaction effects are contrasts among simple effects. Consider j" ."" .#" ,
j# ."# .## , and j$ ."$ .#$ which are simple Lime Type effects at rates !, ", and #
tons, respectively. The contrast
j% " j " # j # " j $
is an interaction effect. It tests whether the difference in pH between lime types changes
ó is called the slice of the Lime Rate interaction at lime type " and ô is called the slice of
the Lime Rate interaction at lime type #. Similarly, slices at the various application rates are
How do slices relate to simple and main effects? Kirk a1995, p. 377b defines as simple
main effects comparisons of the type .35 .3Þ and .54 .Þ4 . Consider the first case. If all
simple main effects are identical, then
.3" .3Þ .3# .3Þ â .3, .3Þ
which implies .3" .3# .3$ .3% .3, . This is the slice at the 3th level of factor A.
Schabenberger, Gregoire, and Kong (2000) made this correspondence between the various
effects in a factorial structure more precise in terms of matrices and vectors. They also proved
the following result noted earlier by Winer (1971, p. 347). If you assemble all possible slices
of E F by the levels of E (for example, ó and ô in [4.17]), the sum of squares associated
with this assembly is identical to the sum of squares for the F main effect plus that for the
E F interaction.
We do not distinguish between balanced and unbalanced cases since the definition of
population quantities j" and j# should be independent of the sample design. A complete set
of orthogonal contrasts for a factor with : levels is any set of a: "b contrasts in which the
members are mutually orthogonal. We can always fall back on the following complete set:
Ô" " ! ! â ! ×
: Ö" " # ! â ! Ù
@ Ö Ù . [4.18]
ã ã
Õ" " " " â a: "b ØaÐ:"Ñ:b
For the factor Lime Type with only two levels a complete set contains only a single contrast
with coefficients " and ", for the factor Rate of Application the set would be
The sum of squares for a contrast among the marginal means of application rate is calculated
as
sj#
WW ajb ,
! -4# Î8Þ4
4"
and if contrasts are orthogonal, their sum of squares contributions are additive. If the contrast
sums of squares for any four orthogonal contrasts among the marginal Rate means .Þ4 are
added, the resulting sum of squares is that of the Rate main effect. Using the generic
complete set above, let the contrasts be
j" .Þ" .Þ#
j# .Þ" .Þ# #.Þ$
j$ .Þ" .Þ# .Þ$ $.Þ%
j% .Þ" .Þ# .Þ$ .Þ% %.Þ& .
The contrast sum of squares are calculated in Table 4.9Þ The only contrast defining the Lime
Type main effect is ."Þ .#Þ .
Table 4.9. Main effects contrast for Rate of Application and Lime Type
(8Þ4 "!, 83Þ #&)
!- # Î8 WW ajb
Rate ! tons C Þ" &Þ($$ sj" !Þ!*% !Þ# !Þ!%%
Rate " ton C Þ# &Þ)#( sj# !Þ#!# !Þ' !Þ!')
Rate # tons C Þ$ &Þ))" sj$ !Þ'$$ "Þ# !Þ$$%
Rate % tons C Þ% 'Þ!#& sj% "Þ$)" #Þ! !Þ*&%
Rate ) tons C Þ& 'Þ#"#
Agricultural lime C "Þ 'Þ!$) sj !Þ#!& !Þ!) !Þ!&#&
Granulated lime C #Þ &Þ)$$
and that for the Lime Type main effect is WW aLimeb !Þ!&#&.
The contrast set from which to obtain the interaction sum of squares is constructed by un-
folding the main effects contrasts to correspond to the cell means and multiplying the contrast
coefficients element by element. Since there are four contrasts defining the Rate main effect
Sum of squares of the interaction contrasts are calculated as linear functions among the
cell means a.34 b. For example,
The divisor of & stems from the fact that each cell mean is estimated as the arithmetic average
of the five replications for that treatment combination. The interaction sum of squares is then
finally
WW aLime Rateb WW aj" jb WW aj# jb WW aj$ jb WW aj% jb.
This procedure of calculating main effects and interaction sums of squares is cumber-
some. It demonstrates, however, that a main effect with a+ "b degrees of freedom can be
partitioned into a+ "b mutually orthogonal single degree of freedom contrasts. It also high-
lights why there are a+ "ba, "b degrees of freedom in the interaction between two factors
with + and , levels, respectively, because of the way they are obtained from crossing a+ "b
and a, "b main effects contrasts. In practice one obtains the sum of squares of contrasts and
the sum of squares partitioning into main effects and interactions with a statistical computing
package.
lime refers to the main effects !3 of factor Lime Type, rate to the main effects of factor
Application Rate and lime*rate to the a!" b34 interaction terms. In Output 4.2 we find
sequential (Type I) and partial (Type III) sums of squares and tests for the three effects listed
in the model statement. Because the design is orthogonal the two groups of sums of squares
are identical. By default, proc glm produces these tests for any term listed on the right-hand
side of the model statement. Also, we obtain WWV !Þ"#))& and Q WV !Þ!!$##. The
main effects of Lime Type and Rate of Application are shown as sources LIME and RATE on
the output. The difference between our calculation of WW aRateb "Þ%! and the calculation by
SAS® aWW aRateb "Þ$*)b is due to round-off errors. There is a significant interaction
between the two factors aJ9,= #%Þ"#ß : !Þ!!!"b. Neither of the main effects is masked.
Since the trends in application rates do not cross (Figure 4.2, p. 89) this is to be expected.
Output 4.2.
The GLM Procedure
Number of observations 50
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 9 2.23349200 0.24816578 77.04 <.0001
Error 40 0.12885000 0.00322125
Corrected Total 49 2.36234200
We now reconstruct the main effects and interaction tests with contrasts for expository
purposes. Recall that an E factor main effect contrast is a contrast among the marginal means
.3Þ and a F factor main effect is a contrast among the marginal means .4Þ . We obtain
and
, , +
"
"-4 .Þ4 "-4 " . !3 "4 a!" b34
4" 4" 3"
+
,
" , +
"-4 "4 " "-4 a!" b34 .
4"
+ 4" 3"
Fortunately, The SAS® System does not require that we specify coefficients for effects
which contain other effects for which coefficients are given. For main effect contrasts it is
sufficient to specify the contrast coefficients for lime or rate; the interaction coefficients
-3 Î, and -4 Î+ are assigned automatically. Using the generic contrast set [4.18] for the main
effects and the unfolded contrast coefficients for the interaction in Table 4.10, the following
contrast statements in proc glm add Output 4.3 to Output 4.2.
Output 4.3.
Contrast DF Contrast SS Mean Square F Value Pr > F
In this case the user must specify the -4 coefficients for the interaction terms and the "4 . If
Genuine interaction effects will involve only the a!" b34 terms.
For the lime requirement data we now return to the research questions raised in the
introduction:
ó: Do Lime Type and Application Rate interact?
ô: Are there main effects of Lime Type and Application Rate?
õ: Is there a difference between lime types at the " ton application rate?
ö: Does the difference between lime types depend on whether " or # tons are applied?
÷: How does the comparison of lime types change with application rate?
Questions ó and ô refer to the interactions and the main effects of the two factors.
Although we have seen in Output 4.3 how to obtain the main effects with contrasts, it is of
course much simpler to locate the particular sources in the Type III sum of squares table. õ
is a simple effect ([4.19]), ö an interaction effect ([4.20]), and ÷ slices of the Lime
Type Rate interaction by application rates. The proc glm statements are as follows (Output
4.4).
proc glm data=limereq;
class lime rate ;
model aph = lime rate lime*rate; /* ó and ô */
contrast 'Lime at 1 ton (C)'
lime 1 -1 lime*rate 0 1 0 0 0 0 -1 0 0 0; /* õ */
contrast 'Lime effect at 1 vs. 2 (D)'
lime*rate 0 1 -1 0 0 0 -1 1 0 0; /* ö */
lsmeans lime*rate / slice=(rate); /* ÷ */
run; quit;
In specifying contrast coefficients for an effect in SAS® , one should pay attention to (i)
the Class Level Information table printed at the top of the output (see Output 4.2) and (ii)
the order in which the factors are listed in the class statement. The Class Level
Information table depicts how the levels of the factors are ordered internally. If factor
variables are character variables, the default ordering of the levels is alphabetical, which may
be counterintuitive in the assignment of contrast coefficients (for example, “0 tons/acre”
appears before “100 tons/acre” which appears before “50 tons/acre”).
Number of observations 50
Sum of
RATE DF Squares Mean Square F Value Pr > F
The order in which variables are listed in the class statement determines how SAS®
organizes the cell means. The subscript of the factor listed first varies slower than the
subscript of the factor listed second and so forth. The class statement
class lime rate;
results in cell means ordered ."" ß ."# ß ."$ ß ."% ß ."& ß .#" ß .## ß .#$ ß .#% ß .#& . Contrast coeffi-
cients are assigned in the same order to the lime*rate effect in the contrast statement. If the
class statement were
the cell means would be ordered ."" ß .#" ß ."# ß .## ß ."$ ß .#$ ß ."% ß .#% ß ."& ß .#& and the arrange-
ment of contrast coefficients for the lime*rate effect changes accordingly.
so that only a single factor is being compared at each combination of two other factors.
The table of effect slices at the bottom of Output 4.4 conveys no significant difference
among lime types at ! tons (J9,= !Þ$)ß : !Þ&%$%), but the J statistics increase with in-
creasing rate of application. This represents the increasing separation of the trends in Figure
4.2. Since factor Lime Type has only two levels the contrast comparing Lime Types at " ton
is identical to the slice at that rate aJ9,= %Þ(#ß : !Þ!$&)b.
One could perform slices of Lime Type Rate in the other direction, comparing the rates
of application at each lime type. We notice, however, that the rate of application is a quanti-
tative factor. It is thus more meaningful to test the nature of the trend between pH and Rate
of Application with regression contrasts (orthogonal polynomials). Since five rates were
applied we can test for quartic, cubic, quadratic, and linear trends. Published tables of orthog-
onal polynomial coefficients require that the levels of the factor are evenly spaced and that
the data are balanced. In this example, the factors are unevenly spaced. The correct contrast
coefficients can be calculated with the OrPol() function of SAS/IML® . The %orpoly macro
contained on the CD-ROM finds coefficients up to the eighth degree for any particular factor
level spacing. When the macro is executed for the factor level spacing !ß "ß #ß %ß ) the
coefficients shown in Output 4.5 result.
data rates;
input levels @@;
datalines;
0 1 2 4 8
;;
run;
%orpoly(data=rates,var=levels);
Output 4.5.
linear quadratic cubic quartic
The coefficients are fractional numbers. Sometimes, they can be converted to values that
are easier to code using the following trick. Since the contrast J statistic is unaffected by a
re-scaling of the contrast coefficients, we can multiply the coefficients for a contrast with an
arbitrary constant. Dividing the coefficients by the smallest coefficient for each trend yields
the coefficients in Table 4.11.
Since the factors interact we test the order of the trends for agricultural lime (AL) and
granulated lime (GL) separately by adding to the proc glm code (Output 4.6) the statements
contrast 'rate quart (AL)' rate 21 -64 56 -14 1
lime*rate 21 -64 56 -14 1;
contrast 'rate cubic (AL)' rate -4.5 4.05 4.95 -5.5 1
lime*rate -4.5 4.05 4.95 -5.5 1;
contrast 'rate quadr.(AL)' rate 15.5 1 -9.5 -18.5 11.5
lime*rate 15.5 1 -9.5 -18.5 11.5;
contrast 'rate linear(AL)' rate -3 -2 -1 1 5
lime*rate -3 -2 -1 1 5;
Output 4.6.
Contrast DF Contrast SS Mean Square F Value Pr > F
For either Lime Type, one concludes that the trend of pH is linear in application rate at
the &% significance level. For agricultural lime a slight quadratic effect is noticeable. The
interaction between the two factors should be evident in a comparison of the linear trends
among the Lime Types. This is accomplished with the contrast statement
contrast 'Linear(AL vs. GL)'
lime*rate -3 -2 -1 1 5 3 2 1 -1 -5;
The linear trends are significantly different aJ9,= *!Þ$'ß : !Þ!!!"b. From Figure 4.2
this is evidently due to differences in the slopes of the two lime types.
Diagnosing the model and its agreement with a particular data set is an essential step in
developing a good statistical model and sound inferences. Estimating model parameters and
drawing statistical inferences must be accompanied by sufficient criticism of the model. This
criticism should highlight whether the assumptions of the model are met and if not, to what
degree they are violated. Key assumptions of the classical linear model [4.1] are
• correctness of the model (Eced 0) and
• homoscedastic, uncorrelated errors (Varced 5 # I).
Often the assumption of Gaussian errors is added and must be diagnosed, too, not
because least squares estimation requires it, but to check the validity of exact inferences. The
importance of these assumptions, in terms of complications introduced into the analysis by
their violation, follows roughly the same order.
• The raw residual s/3 C3 sC 3 is not a good diagnostic tool, since it does not
mimic the behavior of the model disturbances /3 . The 8 residuals are
correlated, heteroscedastic, and do not constitute 8 pieces of information.
Since the model residuals (errors) e are unobservable it seems natural to focus model criticism
s The fitted residuals do not behave exactly as the
on the fitted residuals se y sy y X".
s
model residuals. If " is an unbiased estimator of " , then Ecsed Eced 0, but the fitted resi-
duals are neither uncorrelated nor homoscedastic. We can write se y sy y Hy, where
H XaXw Xb" Xw is called the "Hat" matrix since it produces the sy values when post-
multiplied by y (see §A4.8.3 for details). The hat matrix is a symmetric idempotent matrix
which implies that
Hw H, HH H and aI HbaI Hb aI Hb.
H is not a diagonal matrix and the entries of its diagonal are not equal. The fitted residuals
thus are neither uncorrelated nor homoscedastic. Furthermore, H is a singular matrix of rank
8 <aXb. If one fits a standard regression model with intercept and 5 " regressors only
8 5 least squares residuals carry information about the model disturbances, the remaining
residuals are redundant. In classification models where the rank of X can be large relative to
the sample size, only a few residuals are nonredundant.
Because of their heteroscedastic nature, we do not recommend using the raw residuals s/3
to diagnose model assumptions. First, the residual should be properly scaled. The 3th diagonal
value of H is called the leverage of the 3th data point. Denote it as 233 and it follows that
Varcs/3 d 5 # a" 233 b. [4.22]
Standardized residuals have mean ! and variance " and are obtained as s/3
s/3 ÎÐ5 È" 233 Ñ. Since the variance 5 # is unknown, 5 is replaced by its estimate 5
s , the square
root of the model mean square error. The residual
s/3
<3 [4.23]
s È" 233
5
is called the studentized residual and is a more appropriate diagnostic measure. If the model
errors /3 are Gaussian, the scale-free studentized residuals are akin to > random variables
which justifies — to some extent — their use in diagnosing outliers in regression models (see
below and Myers 1990, Ch. 5.3). Plots of <3 against the regressor variables can be used to
diagnose the equal variance assumption and the need to transform regressor variables. A
graph of <3 against the fitted values sC 3 highlights the correctness of the model. In either type
of plot the residuals should appear as a stable band of random scatter around a horizontal line
at zero.
-2
60 70 80 90 100 110
Predicted Value
Air Temperature
-2
60 65 70 75 80 85 90
Studentized Residuals
Soil Moisture
1
without quadratic term for soil moisture
with quadratic term for soil moisture -2
0 10 20 30 40 50
Sunlight
-2
50 100 150 200 250 300
Value of Covariate
Figure 4.6. Studentized residuals in the turnip green data analysis for the model
C3 "! "" B"3 "# B#3 "$ B$3 (full circles) and C3 "! "" B"3 "# B#3 "$ B$3 "% B##3
(open circles).
For the Turnip Greens data (Example 4.2) studentized residuals for the model with and
without the squared term for soil moisture are shown in Figure 4.6. The quadratic trend in the
residuals for soil moisture is obvious if \% \## is omitted.
A problem of the residual by covariate plot is the interdependency of the covariates. The
plot of <3 vs. Air Temperature, for example, assumes that the other variables are held con-
stant. But changing the amount of sunlight obviously changes air temperature. How this col-
linearity of the regressor variables impacts not only the interpretation but also the stability of
the least squares estimates is addressed in §4.4.4.
To diagnose the assumption of Gaussian model errors one can resort to graphical tools
such as normal quantile and normal probability plots, to formal tests for normality, for
example, the tests of Shapiro and Wilk (1965) and Anderson and Darling (1954) or to
goodness-of-fit tests based on the empirical distribution function. The normal probability plot
is a plot of the ranked studentized residuals (ordinate) against the expected value of the 3th
smallest value in a random sample of size 8 from a standard Gaussian distribution (abscissa).
This expected value of the 3th smallest value can be approximated as the Ð: "!!Ñth K a!ß "b
percentile D: ,
The reference line in a normal probability plot has intercept –.Î5 and slope "Î5 where . and
5 # are the mean and variance of the Gaussian reference distribution. If the reference is the
Ka!ß "b, deviations of the points in a normal probability plot from the straight line through the
origin with slope "Þ! indicate the magnitude of the departure from Gaussianity. Myers (1990,
p. 64) shows how certain patterned deviations from Gaussianity can be diagnosed in this plot.
For studentized residuals the %&°-line is the correct reference since these residuals have mean
! and variance ". For the raw residuals the %&°-line is not the correct reference since their
variance is not ". The normal probability plot of the studentized residuals for the Turnip
Greens analysis suggests model disturbances that are symmetric but less heavy in the tails
than a Gaussian distribution.
1
Studentized Residual
-1
-2
-3
-3 -2 -1 0 1 2 3
Figure 4.7. Normal probability plot of studentized residuals for full model in Turnip Greens
analysis.
Plots of raw or studentized residuals against regressor or fitted values are helpful to
visually assess the quality of a model without attaching quantitative performance measures.
Our objection concerns using these types of residuals for methods of residual analysis that
predicate a random sample, i.e., independent observations. These methods can be graphical
(such as the normal probability plot) or quantitative (such as tests of Gaussianity). Such
diagnostic tools should be based on 8 5 residuals that are homoscedastic and uncorrelated.
These can be constructed as recursive or linearly recovered errors.
and fit the model to 5 data points. The remaining 8 5 data points are then entered sequen-
tially and the 4th recursive residual is the scaled difference of predicting the next observation
from the model fit to the previous observations. More formally, let X4" be the matrix con-
sisting of the first 4 " rows of X. If Xw4" X4" is nonsingular and 4 5 " the parameter
vector " can be estimated as
s 4" Xw4" X4" " X4"
" w
y4" . [4.24]
Now consider adding the next observation C4 . Define as the (unstandardized) recursive
residual A3 the difference between C4 and the predicted value based on fiting the model to the
preceding observations,
s 4" .
A3 C4 xw4 "
Finally, scale A3 and define the recursive residual (Brown et al. 1975) as
A3
A3 ß 4 5 "ß âß 8. [4.25]
"
É" xw4 Xw4" X4" x4
The A3 are independent random variables with mean ! and variance VarcA3 d 5 # , just as the
model disturbances. However, only 8 5 of them are available. Recursive residuals are
unfortunately not unique. They depend on the set of 5 points chosen initially and on the order
of the remaining data. It is not at all clear how to compute the best set of recursive residuals.
Because data points with high leverage can have negative impact on the analysis, one
possibility is to order the data by leverage to circumvent calculation of recursive residuals for
potentially influential data points. In other circumstances, for example, when the detection of
outliers is paramount, one may want to produce recursive residuals for precisely these obser-
vations. On occasion a particular ordering of the data suggests itself, for example, if a
covariate relates to time or distance. A second complication of initially fitting the model to 5
observations is the need for Xw4" X4" to be nonsingular. When columns of X contain repeat
2
Studentized Residual
Recursive Residual
1
0
Residual
-1
-2
-3
-2 -1 0 1 2
Standard Gaussian Quantile
Figure 4.8. Normal probability plots of studentized and recursive residuals for simulated
data. Solid line represents standard Gaussian distribution.
The name reflects that Ra>b is a linear function of Y, has the same expectation as the model
disturbances aEcRd Eced 0b, and has a scalar covariance matrix ÐVarÒRa>b Ó 5 # IÑ. In con-
trast to recursive residuals it is not necessary to fit the model to an initial set of 5 observa-
tions, since the process is not sequential. This allows the recovery of > 8 5 uncorrelated,
homoscedastic residuals in the presence of classification variables. The error recovery pro-
ceeds as follows. Let M I H and recall that Varcsed 5 # M. If one premultiplies se with a
matrix Qw such that
I> 0
VarcQwsed 5 # ,
0 0
then the first > elements of Qwse are the LUS estimates of e. Does such a matrix Q exist? This
is indeed the case and unfortunately there are many such matrices. The spectral decomposi-
tion of a real symmetric matrix A is PDPw A, where P is an orthogonal matrix containing
the ordered eigenvectors of A, and D is a diagonal matrix containing the ordered eigenvalues
of A. Since M is symmetric idempotent (a projector) it has a spectral decomposition. Further-
more, since the eigenvalues of a projector are either " or !, and the number of nonzero eigen-
values equals the rank of a matrix, there are > 8 5 eigenvalues of value " and the remain-
ing values are !. D thus has precisely the structure
I> 0
D
0 0
• A high leverage point is unusual with respect to other x values; it has the
potential to be an influential data point.
Case deletion diagnostics assess the influence of individual observations on the overall
analysis. The idea is to remove a data point and refit the model without it. The change in a
particular aspect of the fitted model, e.g., the residual for the deleted data point or the least
squares estimates, is a measure for its influence on the analysis. Problematic are highly in-
fluential points (hips), since they have a tendency to dominate the analysis. Fortunately, in
linear regression models, these diagnostics can be calculated without actually fitting a regres-
sion model 8 times but in a single fit of the entire data set. This is made possible by the Sher-
man-Morrison-Woodbury theorem which is given in §A4.8.3. For a data point to be a hip, it
must be either an outlier or a high leverage data point. Outliers are data points that are un-
usual relative to the other ] values. The Magnesium observation in the prediction efficiency
data set (Example 4.4, p. 93) might be an outlier. The attribute outlier does not have a nega-
tive connotation. It designates a data point as unusual and does not automatically warrant
deletion. An outlying data point is only outlying with respect to a particular statistical model
or criterion. If, for example, the trend in C is quadratic in B, but a simple linear regression
model, C3 "! "" B3 /3 , is fit to the data, many data points may be classified as outliers
because they do not agree with the model. The reason therefore is an incorrect model, not
erroneous observations. According to the commonly applied definition of what constitutes
outliers based on the box-plot of a set of data, outliers are those values that are more than "Þ&
introduced in the previous section. This matrix does not depend on the observed responses y,
only on the information in X. Its 3th diagonal element, 233 , measures the leverage of the 3th
observation and expresses how unusual or extreme the covariate record of this observation is
relative to the other observations. In a simple linear regression, a single B value far removed
from the bulk of the B-data is typically a data point with high leverage. If X has full rank 5 , a
point is considered a high leverage point if
233 #5Î8.
High leverage points deserve special attention because they may be influential. They have the
potential to pull the fitted regression toward it. A high leverage point that follows the regres-
sion trend implied by the remaining observations will not exert undue influence on the least
squares estimates and is of no concern. The decision whether a high leverage point is influen-
tial thus rests on combining information about leverage with the magnitude of the residual. As
leverage increases, smaller and smaller residual values are needed to declare a point as
influential. Two statistics are particularly useful in this regard. The RStudent residual is a
studentized residual that combines leverage and the fitted residual similar to <3 in [4.23],
s 3 È" 233 .
RStudent3 s/3 5 [4.27]
s #3 is the mean square error estimate obtained after removal of the 3th data point. The
Here, 5
DFFITS (difference in fit, standardized) statistic measures the change in fit in terms of
standard error units when the 3th observation is deleted:
DFFITS3 RStudent3 È233 Îa" 233 b. [4.28]
A DFFITS3 value of #Þ!, for example, implies that the fit at C3 will change by two standard
error units if the 3th data point is removed. DFFITs are useful to asses whether a data point is
highly influential and RStudent residuals are good at determining outlying observations.
According to [4.27] a data point may be a hip if it has a moderate residual s/3 and high
leverage or if it is not a high leverage point (233 small), but unusual in the y space (s/3 large).
RStudent residuals and DFFITS measure changes in residuals or fitted values as the 3th
observation is deleted. The influence of removing the 3th observation on the least squares
s is measured by Cook's Distance H3 (Cook 1977):
estimate "
H3 <3# 233 Îaa5 "ba" 233 bb [4.29]
When diagnosing (criticizing) a particular model, one or more of these statistics may be
important. If the purpose of the model is mainly predictive, the change in the least squares
estimates aH3 b is secondary to RStudent residuals and DFFITS. If the purpose of the model
lies in testing hypotheses about " , Cook's distance will gain importance. Rule of thumb (rot)
Table 4.12. Rules of thumbs (rots) for leverage and case deletion diagnostics: 5 denotes
the number of parameters (regressors plus intercept), 8 the number of observations
Statistic Formula Rule of thumb Conclusion
Leverage 233 233 #5Î8 high leverage point
RStudent3 s 3 È" 233
s/3 Î5 lRStudent3 l # outlier (hip)
DFFITS3 <3# 233 Îa5 a" 233 bb lDFFITS3 l #È5Î8 hip
Cook's Distance H3 <3 233 Îa5 a" 233 bb H3 "
#
hip
Let \3 denote the spatial range of the 3th attribute and ]3 its prediction efficiency, we
commence by fitting a quadratic polynomial
]3 "! "" B3 "# B#3 /3 [4.30]
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
The various case deletion diagnostics are listed in Table 4.13. The purpose of their
calculation is to determine whether the Mg observation is an influential data point. The
rule of thumb value for a high leverage point is #5Î8 #$Î( !Þ)&( and for a
DFFITS is #È$Î( "Þ$!*. The diagnostics were calculated with the /influence
option of the model statement in proc reg. This option produces raw residuals,
RStudent residuals, leverages, DFFITS, and other diagnostics:
proc reg data=range;
model eff30 = range range2 / influence;
output out=cookd cookd=cookd;
run; quit;
proc print data=cookd; run;
The two most extreme regressor values, for P and lime, have the largest leverages. P is
a borderline high leverage point (Table 4.13). The residual of the P observations is not
too large so that it is not considered an outlying observation as judged by RStudent. It
Table 4.13. Case deletion diagnostics for quadratic polynomial in Prediction Efficiency
application (rot denotes a rule-of-thumb value)
Obs. Attribute s/3 233 RStudent3 DFFITS3 H3
(rot !Þ)&(Ñ Ðrot #Ñ (rot "Þ$!*) (rot ")
" pH $Þ'")% !Þ#"&( !Þ$")" !Þ"'') !Þ!""*
# P $Þ$(&! !Þ*%'( "Þ%'($ 'Þ")%# *Þ)*&'
$ Ca *Þ**() !Þ#$!* "Þ!"!* !Þ&&$* !Þ"!"(
% Mg "(Þ&$!$ !Þ$&#& 'Þ#"!# %Þ&)#! !Þ'($%
& CEC !Þ()#) !Þ#("" !Þ!(!$ !Þ!%#* !Þ!!!(
' Lime 'Þ%""* !Þ'$!( !Þ*"$& "Þ"*$' !Þ%*&%
( Prec 'Þ"')# !Þ$&#& !Þ'#%! !Þ%'!% !Þ!)$%
Table 4.14. Pairwise Pearson correlation coefficients for covariates in Turnip Greens data.
Sunlight Soil Moisture Air Temp. Soil Moisture#
\" \# \$ \%
\" !Þ!""# !Þ&$($ !Þ!$)!
\# !Þ!""# !Þ!"%* !Þ**'&
\$ !Þ&$($ !Þ!"%* !Þ!(!#
\% !Þ!$)! !Þ**'& !Þ!(!#
Large pairwise correlations are a sufficient but not a necessary condition for collinearity.
A collinearity problem can exist even if pairwise correlations among the regressors are small
and a near-linear dependency involves more than two regressors. For example, if
5
"- 4 x 4 - " x " - # x # â - 5 x 5 ¸ 0
4"
holds, X will be almost rank-deficient although the pairwise correlations among the x3 may be
small. Collinearity negatively affects all regression calculations that involve the aXw Xb"
matrix: the least squares estimates, their standard error estimates, the precision of predictions,
test statistics, and so forth. Least squares estimates tend to be unstable (large standard errors),
large in magnitude, and have signs at odds with the subject matter.
Example 4.2. Turnip Greens (continued). Collinearity can be diagnosed in a first step
by removing columns from X. The estimates of the remaining parameters will change
unless the columns of X are orthogonal. The table that follows shows the coefficient
estimates and their standard errors when \% , the quadratic term for Soil Moisture, is in
the model (full model) and when it is removed.
A simple but efficient collinearity diagnostic is based on variance inflation factors (VIFs)
of the regression coefficients. The VIF of the 4th regression coefficient is obtained from the
coefficient of determination aV # b in a regression model where the 4th covariate is the
response and all other covariates form the X matrix (Table 4.16). If the coefficient of determi-
nation from this regression is V4# , then
Table 4.16. Variance inflation factors for the full model in the Turnip Greens example
Response Regressors 4 V4# Z MJ4
\" \# ß \$ ß \ % " !Þ$))) "Þ'$'
\# \" ß \$ ß \ % # !Þ**'( $!$Þ!
\$ \" ß \# ß \ % $ !Þ%(%* "Þ*!%
\% \" ß \# ß \ $ % !Þ**'' #*%Þ"
If a covariate is orthogonal to all other columns, its variance inflation factor is ". With
increasing linear dependence, the VIFs increase. As a rule of thumb, variance inflation factors
greater than "! are an indication of a collinearity problem. VIFs greater than $! indicate a
severe problem.
Other collinearity diagnostics are based on the principal value decomposition of a
centered and scaled regressor matrix (see §A4.8.3 for details). A variable is centered and
scaled by subtracting its sample mean and dividing by its sample standard deviation:
B34 aB34 B4 bÎ=4 ,
where B4 is the sample mean of the B34 and =4 is the standard deviation of the 4th column of X.
Let X be the regressor matrix
X cx" ß âß x5 d
This is called the centered regression model and " is termed the vector of standardized
coefficients. Collinearity diagnostics examine the conditioning of X w X . If X is an orthogo-
nal matrix, then X w X is orthogonal and its eigenvalues -4 are all unity. If an exact linear
dependency exists, at least one eigenvalue is zero (the rank of a matrix is equal to the number
of nonzero eigenvalues). If eigenvalues of X w X are close to zero, a collinearity problem
exists. The 4th condition index of X w X is the square root of the ratio between the largest
maxe-4 f
<4 Ë . [4.33]
-4
The collin option of the model statement also calculates eigenvalues and condition
indices, but does not adjust them for the intercept as the collinoint option does. We prefer
the latter. With the intercept-adjusted collinearity diagnostics, one obtains as many eigen-
values and condition indices as there are explanatory variables in X. They do not stand in a
one-to-one correspondence with the regressors, however. If the first condition index is larger
than $!, it does not imply that the first regressor needs to be removed. It means that there is at
least one near-linear dependency among the regressors that may or may not involve \" .
Table 4.17. Eigenvalues and condition indices for the full model in Turnip Greens
data and for the reduced model without \%
Eigenvalues Condition Indices
-" -# -$ -% <" <# <$ <%
Full Model #Þ!! "Þ&$ !Þ%' !Þ!!" "Þ!! "Þ"% #Þ!) $%Þ)'
Reduced Model "Þ&% "Þ!! !Þ%' "Þ!! "Þ#% "Þ)#
For the Turnip Greens data the (intercept-adjusted) eigenvalues of X w X and the con-
dition indices are shown in Table 4.17. The full model containing \% has one eigenvalue that
is close to zero and the associated condition index is greater than the rule of thumb value of
$!. Removing \% leads to a model without a collinearity problem, but early on we found that
the model without \% does not fit the data well. To keep the four regressor model and reduce
the negative impact of collinearity on the least squares estimates, we now employ ridge
regression.
• The method has an ad-hoc character because the user must choose the ridge
factor, a small number by which to shrink the least squares estimates.
where best implies that the variance of the linear combination aw "s SPW is smaller than the
wµ µ
variance of a " , where " is any other estimator that is linear in Y. In general, the mean
square error (MSE) of an estimator 0 aYb for a target parameter ) is defined as
Q WI c0 aYbß )d Ea0 aYb )b# Ea0 aYb Ec0 aYbdb# Ec0 aYb )d#
Varc0 aYbd Biasc0 aYbß )d# . [4.34]
The MSE has a variance and a (squared) bias component that reflect the estimator's precision
and accuracy, respectively (low variance = high precision). The mean square error is a more
appropriate measure for the performance of an estimator than the variance alone since it takes
both precision and accuracy into account. If 0 aYb is unbiased for ) , then Q WIÒ0 aYbß )Ó
VarÒ0 aYbÓ and choosing among unbiased estimators on the basis of their variances is reason-
able. But unbiasedness is not necessarily the most desirable property. An estimator that is
highly variable and unbiased may be less preferable than an estimator which has a small bias
and high precision. Being slightly off-target if one is always close to the target should not
bother us as much as being on target on average but frequently far from it.
The relative efficiency of 0 aYb compared to 1aYb as an estimator for ) is measured by
the ratio of their respective mean square errors:
Q WI c1aYbß )d Varc1aYbd Biasc1aYbß )d#
REc0 aYbß 1aYbl)d . [4.35]
Q WI c0 aYbß )d Varc0 aYbd Biasc0 aYbß )d#
If REÒ0 aYbß 1aYbl)Ó " then 0 aYb is preferred, and 1aYb is preferred if the efficiency is less
than ". Assume that 0 aYb is an unbiased estimator and that we choose 1aYb by shrinking
0 aYb by some multiplicative factor - a! - "b. Then
Q WI c0 aYbß )d Varc0 aYbd
Q WI c1aYbß )d - # Varc0 aYbd a- "b# )# .
The efficiency of the unbiased estimator 0 aYb relative to the shrinkage estimator
1aYb -0 aYb is
- # Varc0 aYbd a- "b# )# a- "b#
REc0 aYbß -0 aYbl)d - # )# .
Varc0 aYbd Varc0 aYbd
If - is chosen such that - # )# a- "b# ÎVarÒ0 aYbÓ is less than ", the biased estimator is more
efficient than the unbiased estimtator and should be preferred.
How does this relate to least squares estimation in the classical linear model? The opti-
mality of the least squares estimator of " , since it is restricted to the class of unbiased estima-
tors, implies that no other unbiased estimator will have a smaller mean square error. If the
regressor matrix X exhibits near linear dependencies (collinearity), the variance of " s 4 may be
inflated, and sign and magnitude of " s 4 may be distorted. The least squares estimator, albeit
unbiased, has become unstable and imprecise. A slightly biased estimator with greater stabili-
ty can be more efficient because it has a smaller mean square error. To motivate this esti-
The ill-conditioning of the X w X matrix due to collinearity causes the instability of the esti-
mator. To improve the conditioning of X w X , add a small positive amount $ to its diagonal
values,
s V X w X $ I" X w Y.
" [4.36]
This estimator is known as the ridge regression estimator of " (Hoerl and Kennard 1970a,
1970b). It is applied in the centered and not the original model to remove the scale effects of
the X columns. Whether a covariate is measured in inches or meters, the same correction
should apply since only one ridge factor $ is added to all diagonal elements. Centering the
model removes scaling effects, every column of X has a sample mean of ! and a sample
variance of ". The value $ , some small positive number chosen by the user, is called the ridge
factor.
The ridge regression estimator is not unbiased. Its bias (see §A4.8.4) is
s V " $ X w X $ I" " .
E"
If $ ! the unbiased ordinary least squares estimator results and with increasing ridge factor
$, the ridge estimator applies more shrinkage.
To see that the ridge estimator is a shrinkage estimator consider a simple linear regres-
sion model in centered form,
Y "! B3 "" /3 ,
where B3 aB3 B bÎ=B . The least squares estimate of the standardized slope is
!83" B3 C3
s "
"
!83" B#3
s "V "
Since $ is positive " s " . The standardization of the coefficients is removed by dividing
the coefficient by È=B , and we also have " s "V "s".
The choice of $ is made by the user which has led to some disapproval of ridge
regression among statisticians. Although numerical methods exist to estimate the best value
of the ridge factor from data, the most frequently used technique relies on graphs of the ridge
s 4V for various values of $ . The ridge factor is chosen as
trace. The ridge trace is a plot of "
that value for which the estimates stabilize (Figure 4.9).
6
Sunlight
300
Soil Moisture
4 Air Temperature
Soil Moisture squared
250
2
Standardized Coefficient
150
-2
100
-4
50
-6
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
Ridge factor δ
Figure 4.9. Ridge traces for full model in the Turnip Greens example. Also shown (empty
circles) is the variance inflation factor for the Soil Moisture coefficient.
For the full model in the Turnip Greens example the two coefficients affected by the col-
s # and "
linearity are " s % . Adding only a small amount $ !Þ!" to the diagonals of the X w X
matrix already stabilizes the coefficients (Figure 4.9). The variance inflation factor of the Soil
Moisture coefficient is also reduced dramatically. When studying a ridge trace such as Figure
4.9 we look for various indications of having remedied collinearity.
• The smallest value of $ at which the parameter estimates are stabilized (do not change
rapidly). This is not necessarily the value where the ridge trace becomes a flat line.
Inexperienced users tend to choose a $ value that is too large, taking the notion of
stability too literally.
• Changes in sign. One of the effects of collinearity is that signs of the coefficients are at
odds with theory. The ridge factor $ should be chosen in such a way that the signs of
the ridge estimates are meaningful on subject matter grounds.
• Coefficients that change the most when the ridge factor is altered are associated with
variables involved in a near-linear dependency. The more the ridge traces swoop, the
higher the degree of collinearity.
• If the degree of collinearity is high, a small value for $ will suffice, although this result
may seem counterintuitive.
Table 4.18 shows the standardized ordinary least squares coefficients and the ridge esti-
mates for $ !Þ!" along with their estimated standard errors. Sign, size, and standard errors
Table 4.18. Ordinary least squares and ridge regression estimates Ð$ !Þ!"Ñ
along with their estimated standard errors
Ordinary Least Squares Ridge Regression
Coefficient Estimate Est. Std. Error Estimate Est. Std. Error
"" !Þ""#(% !Þ!#%$) !.!$)*" !.!$#%!
"# &Þ*)*)# "Þ!!&#' !.%((!! !.#""%'
"$ !Þ#!*)! !Þ#"!*% !.!"$*% !.#'!%&
"% 'Þ)(%*! !Þ!"*&' ".$%'"% !.!!%""
The ridge regression estimates can be calculated in SAS® with the ridge= option of the
reg procedure. The outest= option is required to identify a data set in which SAS® collects
the ridge estimates. By adding outstb and outvif options, SAS® will also save the standard-
ized coefficients along with the variance inflation factors in the outest data set. In the Turnip
Greens example, the following code calculates ridge regression estimates for ridge factors
$ !ß !Þ!"ß !Þ!#ß âß !Þ&ß !Þ(&ß "ß "Þ#&ß "Þ&ß "Þ(&ß and #. The subsequent proc print step
displays the results.
proc reg data=turnip outest=regout outstb outvif
ridge=0 to 0.5 by 0.01 0.5 to 2 by 0.25;
model vitamin = sun moist airtemp x4 ;
run; quit;
proc print data=regout; run;
For more details on ridge regression and its applications the reader is referred to Myers
(1990, Ch. 8).
The analysis of variance J test is generally robust against mild and moderate departures
of the error distribution from the Gaussian model. This result goes back to Pearson (1931),
Geary (1947), and Gayen (1950). Box and Andersen (1955) note that robustness of the test
increases with the sample size per group (number of replications per treatment). The reason
for the robustness of the J test is that the group sample means are unbiased estimates of the
group means and for sufficient sample size are Gaussian-distributed by the Central Limit
Theorem regardless of the parent distribution from which the data are drawn. Finally, it must
be noted that in designed experiments under randomization one can (and should) apply signi-
ficance tests derived from randomization theory (§A4.8.2). These tests do not require
Gaussian-distributed error terms. As shown by, e.g., Box and Andersen (1955), the analysis
of variance J test is a good approximation to the randomization test.
A departure from the equal variance assumption is more troublesome. If the data are
Gaussian-distributed and the sample sizes in the groups are (nearly) equal, the ANOVA J
test retains a surprising robustness against moderate violations of this assumption (Box
1954a, 1954b, Welch 1937). This is not true for one-degree-of-freedom contrasts which are
very much affected by unequal variances even if the group sizes are the same. As group sizes
become more unbalanced, small departures from the equal variance assumption can have very
negative effects. Box (1954a) studied the signficance level of the J test in group comparisons
as a function of the ratio of the smallest group variance to the largest group variance. With
equal replication and a variance ratio of " À $, the nominal significance level of !Þ!& was only
slightly exceeded a!Þ!&' !Þ!(%b, but with unequal group sizes the actual significance level
was as high as !Þ"#. The test becomes liberal, rejecting more often than it should.
Finally, the independence assumption is the most critical since it affects the J test most
severely. The essential problem has been outlined before. If correlations among observations
are positive, the :-values of the standard tests that treat data as if they were uncorrelated are
too small as are the standard error estimates.
There is no remedy for lack of independence except for observing proper experimental
procedure. For example, randomizing treatments to experimental units and ensuring that ex-
perimental units not only receive treatments independently but also respond independently.
Even then, data may be correlated, for example, when measurements are taken repeatedly
over time. There is no univariate data transformation to uncorrelate observations but there are
of course methods for diagnosing and correcting departures from the Gaussian and the homo-
denote the absolute deviations of observations in group 3 from the group median. Then
calculate an analysis of variance for the Y34 and reject the hypothesis of equal variances when
significant group differences among the Y34 exist.
Originally, the Levene test was based on Y34 Ð]34 ] 3Þ Ñ# , rather than absolute
deviations from the median. Other variations of the test include analysis of variance based on
Y34 l]34 ] 3Þ l
Y34 lnl]34 ] 3Þ l
Y34 Él]34 ] 3Þ lÞ
The first of these deviations as well as Y34 Ð]34 ] 3Þ Ñ# are implemented through the
hovtest= option of the means statement in proc glm of The SAS® System.
where ]34 denotes the 4th replicate value observed for the 3th treatment and obtain the Levene
test with the statements
proc glm data=hetero;
class tx;
model y = tx /ss3;
means tx / hovtest=levene(type=abs);
run; quit;
Number of observations 50
Dependent Variable: y
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 4 1260.019647 315.004912 12.44 <.0001
Error 45 1139.207853 25.315730
Corrected Total 49 2399.227500
Sum of Mean
Source DF Squares Square F Value Pr > F
tx 4 173.6 43.4057 3.93 0.0081
Error 45 496.9 11.0431
Level of --------------y--------------
tx N Mean Std Dev
1 10 1.5135500 0.75679658
2 10 6.8449800 4.67585820
3 10 7.7172000 3.13257166
4 10 10.7295700 3.64932521
5 10 16.8137900 9.00064885
20
10
Raw residual
-10
0 5 10 15
Estimated treatment mean
Figure 4.10. Raw residuals of one-way classification model fit to data in Table 4.19.
]34 $Î)
Y34 sin" Ë
834 $Î%
appears to be superior.
For continuous data, the relationship between mean and variance is not known unless the
distribution of the responses is known (see §6.2 on how mean variance relationships for
continuous data are used in formulating generalized linear models). If, however, it can be
established that variances are proportional to some power of the mean, a transformation can
be derived. Box and Cox (1964) assume that
If one takes a transformation Y34 ]34- , then ÈVarÒY34 Ó º .3" -" . Since variance homoge-
neity implies 53 º ", the transformation with - " " stabilizes the variance. If - !, the
transformation is not defined and the logarithmic transformation is chosen.
For the data in Table 4.19 sample means and sample standard deviations are available for
each group. We can empirically determine which transformation will result in a variance
stabilizing transformation. For data in which standard deviations are proportional to a power
of the mean the relationship 53 !."3 can be linearized by taking logarithms: lne53 f
lne!f " lne.3 f. Substituting estimates =3 for 53 and C 3Þ for .3 , this is a linear regression of
the log standard deviations on the log sample means: lne=3 f "! " lneC 3Þ f /3 . Figure
4.11 shows a plot of the five sample standard deviation/sample mean points after taking
logarithms. The trend is linear and the assumption that 53 is proportional to a power of the
mean is reasonable. The statements
proc means data=hetero noprint;
by tx;
var y;
output out=meanstd(drop=_type_) mean=mean std=std;
run;
data meanstd; set meanstd;
logstd = log(std);
logmn = log(mean);
run;
proc reg data=meanstd;
model logstd = logmn;
run; quit;
calculate the treatment specific sample means and standard deviations, take their logarithms
and fit the simple linear regression model (Output 4.10). The estimate of the slope is
s !Þ*&) which suggests that " !Þ*&) !Þ!%# is the power for the variance stabilizing
"
s is close to zero, one could also opt for a logarithmic transformation of
transform. Since " "
2.0
1.5
log(si)
1.0
0.5
0.0
Figure 4.11. Logarithm of sample standard deviations vs. logarithm of sample means for data
in Table 4.19 suggests a linear trend.
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 3.02949 3.02949 32.57 0.0107
Error 3 0.27901 0.09300
Corrected Total 4 3.30850
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 -0.65538 0.34923 -1.88 0.1572
logmn 1 0.95800 0.16785 5.71 0.0107
Number of observations 50
Sum of Mean
Source DF Squares Square F Value Pr > F
tx 4 0.00302 0.000755 2.19 0.0852
Error 45 0.0155 0.000345
Level of ------------powery-----------
tx N Mean Std Dev
1 10 1.00962741 0.03223304
2 10 1.07004861 0.04554636
3 10 1.08566961 0.02167296
4 10 1.10276544 0.01446950
5 10 1.12061483 0.02361509
The Levene test is (marginally) nonsignificant at the &% level. Since we are interested in
accepting the null hypothesis of equal variances in the five groups this result is not too
convincing. We would like the :-value for the Levene test to be larger. The Levene test based
on squared deviations Ð^34 ^ 3Þ Ñ# has a :-value of !Þ")* which is more appropriate. The
group-specific standard deviations are noticeably less heterogeneous than for the untrans-
formed data. The ratio of the largest to the smallest group sample variance for the original
data was *Þ!# Î!Þ(&')# "%"Þ%# whereas this ratio is only !Þ!$### Î!Þ!"%&# %Þ*$ for the
transformed data. The &% critical value for a Hartley's J -Max test is (Þ"" and we fail to
reject the hypothesis of equal variances with this test too.
where M and N denote the number of rows and columns, respectively. If there are no missing
entries in the table, the least squares estimates of the effects are simple linear combinations of
the various row and column averages:
" M N
s
. ""C34 C ÞÞ
MN 3" 4"
!s3 C 3Þ C ÞÞ
s 4 C Þ4 C ÞÞ
"
s/34 C34 C 3Þ C Þ4 C ÞÞ
Notice that the s/34 are the terms in the residual sum of squares [4.38]. The estimates have the
desirable property that !M3" !N4"s/34 !, mimicking a property of the model disturbances /34
provided the decomposition is correct. The sample means unfortunately are sensitive to
extreme observations (outliers). In least squares analysis based on means, the residual in any
cell of the two-way table affects the fitted values in all other cells and the effect of an outlier
is to leak its contribution across the estimates of the overall, row, and column effects
(Emerson and Hoaglin 1983). Tukey (1977) suggested basing estimation of the effects on
medians which are less affected by extreme observations. Leakage of outliers into estimates
of overall, row, and column effects is then minimized. The method — termed median
polishing — is iterative and sweeps medians out of rows, then columns out of the row-
adjusted residuals, then rows out of column-adjusted residuals, and so forth. The process
In [4.39] ! µ denotes the median-polished effect of the 3th row, µ" 4 the median-polished
3
th µ µ µ " µ
effect of the 4 column and the residual is calculated as /34 C34 . ! 3 4.
1 Birch, J.B. Unpublished manuscript: Contemporary Applied Statistics: An Exploratory and Robust Data Analysis Approach.
If this type of patterned interaction is present, then ) should differ from zero. After median
polishing the no-interaction model, it is recommended to plot the residuals µ /3 against the
µ"µ µ
comparison values ! 3 4 Î . . A trend in this plot indicates the presence of row column
interaction. To decide whether the interaction is induced by outlying observations or a more
general phenomenon, fit a simple linear regression between /µ34 and the comparison values by
an outlier-robust method such as M-Estimation (§4.6.2) and test the slope against zero. Also,
fit a simple regression by least squares. When the robust M-estimate of the slope is not
different from zero but the least squares estimate is, the interaction between rows and
columns is caused by outlying observations. Transforming the data or removing the outlier
should then eliminate the interaction. If the M-estimate of the slope differs significantly from
zero the interaction is not caused just by outliers. The slope of the diagnostic plot is helpful in
determining the transformation that can reduce the nonadditivity of row and column effects.
Power transformations of approximately " ) are in order (Emerson and Hoaglin 1983). In
the interaction test a failure to reject L! : ) ! is the result of interest since one can then
proceed with a simpler analysis without interaction. Even if the data are balanced (all cells of
the table are filled), the analysis of variance based on the interaction model [4.40] is non-
orthogonal and treatment means are not estimated by arithmetic averages. Because of the
possibility of a Type-II error one should not accept L! when the :-value of the test is barely
larger than !Þ!& but require :-values in excess of !Þ#& !Þ$.
Example 4.3. Dollar Spot Counts (continued). Recall the turfgrass experiment in a
randomized complete block design with > "% treatments in , % blocks (p. 93). The
outcome of interest was the number of leaves infected with dollar spot in a reference
area of each experimental unit. One count was obtained from each unit.
Here we perform median polishing for the dollar spot count data for the original counts
and log-transformed counts. Particular attention will be paid to the observation for
treatment "! in block % that appears extreme compared to the remainder of the data (see
Table 4.3, p. 93). Median polishing with The SAS® System is possible with the
%MedPol() macro contained on the CD-ROM (\SASMacros\MedianPolish.sas). It
requires an installation of the SAS/IML® module. The macro does not produce any
output apart from the residual interaction plot if the plot=1 option is active (which is
the default). The results of median polishing are stored in a SAS® data set termed
_medpol with the following structure.
Variable Value
The median polish of the original counts and a printout of the first #& observations of
the result data set is produced by the statements
1 43.3888 0 0 0 . .
2 -0.6667 1 0 1 . .
3 0.6667 2 0 1 . .
4 -6.0000 3 0 1 . .
5 3.4445 4 0 1 . .
6 -13.1111 0 1 2 . .
7 36.7778 0 2 2 . .
8 -16.7777 0 3 2 . .
9 -12.5555 0 4 2 . .
10 11.0556 0 5 2 . .
11 0.9445 0 6 2 . .
A graph of the column (treatment) effects against column index gives an indication of
treatment differences to be expected when a formal comparison procedure is invoked
(Figure 4.12).
90
Median Polished Treatment Effects
60
30
-30
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Treatment Label
µ
Figure 4.12. Column (treatment) effects "4 for dollar spot counts.
Whether a formal analysis can proceed without accounting for block treatment inter-
action cannot be gleaned from a plot of the column effects alone. To this end, plot the
median polished residuals against the interaction term (Figure 4.13) and calculate least
squares and robust M-estimates (§4.6) of the simple linear regression after the call to
%MedPol():
%include 'DriveLetterOfCDROM:\SASMacros\MEstimation.sas';
Residual (log{count})
3
Residual (count)
100
2
50
OLS 1
M
OLS
0 M
0
-50 -1
Figure 4.13. Median polished residual plots vs. interaction terms for original counts
(left panel) and log-transformed counts (right panel). The outlier (treatment "! in block
%) is circled.
The outlying observation for treatment "! in block % pulls the least squares regression
line toward it (Figure 4.13). The interaction is significant a: !Þ!!*%, Table 4.20b and
remains significant when the influence of the outlier is reduced by robust M-Estimation
a: !Þ!!)"b. The interaction is thus not induced by the outlier alone but a general
block treatment interaction exists for these data. For the log-transformed counts
(Figure 4.13, right panel) the interaction is not significant (: !Þ#)$" for OLS and
: !Þ$)*) for M-Estimation) and the least squares and robust regression line are
almost indistinguishable from each other. The observation for treatment "! in block % is
certainly not as extreme relative to the remainder of the data as is the case for the
original counts.
Table 4.20. T -values for testing the the block treatment interaction, L! : ) !
Response Least Squares M-Estimation
Count !Þ!!*% !Þ!!)"
logeCountf !Þ#)$" !Þ$)*)
If the variability of the phenomenon being studied is the cause of the outliers there is no
justification for their deletion. Two alternatives to least squares estimation, P" - and M-
Estimation, are introduced in §4.6.1 and §4.6.2. The prediction efficiency data is revisited in
§4.6.3 where the quadratic response model is fit by robust methods to reduce the negative
influence of the outlying Mg observation without deleting it from the data set. In §4.6.4 we
apply the M-Estimation principle to classification models.
• P" -Regression minimizes the sum of absolute residuals and not the sum of
squared residuals as does ordinary least squares estimation. An alternative
name is hence Least Absolute Deviation (LAD) Regression.
• The P" -norm of a vector aa5"b is defined as !53" l+3 l and the P# -norm is
É!53" +3# . Least squares is an P# -norm method.
• The fitted trend of an P" -Regression with 5 " regressors and an intercept
passes through 5 data points.
is also called its P# -norm and the P" -norm of a vector is the sum of the absolute values of its
elements: !53" l+3 l. In P" -Estimation the objective is not to find the estimates of the model
parameters that minimize the sum of squared residuals (the P# -norm) !83" /3# , but the sum of
the absolute values of the residuals, !83" l/3 l. Hence the alternative name of Least Absolute
Deviation (LAD) estimation. In terms of a linear model Y X" e, the P" -estimates " s P are
the values of " that minimize
8
"l C3 xw3 " l. [4.41]
3"
Squaring of the residuals in least squares estimation gives more weight to large residuals
than taking their absolute value (see Figure 4.15 below).
A feature of P" -Regression is that the model passes exactly through some of the data
points. In the case of a simple linear regression, C3 "! "" B3 /3 , the estimates " s!
P
s
" "P B pass through two data points and in general if X is a a8 5 b matrix of full rank 5 the
model passes through 5 data points. This fact can be used to devise a brute-force method of
finding the LAD estimates. Force the model through 5 data points and calculate its sum of
absolute deviations WEH !83" l/ s3 l. Repeat this for all possible sets of 5 points and choose
the set which has the smallest WEH.
Example 4.5. Galton (1886), who introduced the concept of regression in studies of
inheritance noted that the diameter of offspring peas was linearly increasing with the
diameter of the parent peas. Seven of the data points from his 1886 publication are
shown in Figure 4.14. Fitting a simple linear regression model to these data, there are
18
SAD = 7.50
SAD = 4.30
Diameter of offspring
16 SAD = 4.20
SAD = 2.40
SAD = 1.80
14
SAD = 7.25
12
10
15 16 17 18 19 20 21
Diameter of parent
Figure 4.14. Ten of the #" possible least absolute deviation lines for seven observations
from Galton's pea diameter data. The line that minimizes !83" l/ s3 l passes through the
first and last data point: sC3 *Þ) !Þ$''B.
This brute-force approach is computationally expensive since the total number of models
being fit is large unless either 8 or 5 is small. The number of models that must be evaluated
with this method is
8 8x
.
5 5xa8 5 bx
Fitting a four regressor model with intercept to a data set with 8 $! observations requires
evaluation of "%#ß &!' models. In practice we rely on iterative algorithms to reduce the
number of evaluations. One such algorithm is discussed in §A4.8.5 and implemented in the
SAS® macro %LAD() (\SASMacro\L1_Regression.sas on CD-ROM). One can also fit least
absolute deviation regression with the SAS/IML® function lav().
To test the hypothesis L! : A" d in P" -Regression we use an approximate test analo-
gous to the sum of squares reduction test in ordinary least squares (§4.2.3 and §A4.8.2). If
WEH0 is the sum of absolute deviations for the full and WEH< is the sum of absolute devia-
tions for the model reduced under the hypothesis, then
aWEH< WEH0 bÎ;
J9,= [4.42]
sÎ#
-
where -44 is the square root of the 4th diagonal element of -a s Xw Xb" , the estimated standard
s 4 . P" -Regression is applied to the Prediction Efficiency data in §4.6.3.
error of "
4.6.2 M-Estimation
Box 4.9 M-Estimation
M-Estimation was introduced by Huber (1964, 1973) as a robust technique for estimating
location parameters (means) in data sets containing outliers. It can also be applied to the esti-
mation of parameters in the mean function of a regression model. The idea of M-Estimation is
simple (additional details in §A4.8.6). In least squares the objective function can be written as
8 8
" #
U" #
a C 3 x w
3 " b " a/3 Î5 b# , [4.43]
3"
5 3"
a function of the squared residuals. If the contribution of large residuals /3 to the objective
function U can be reduced, the estimates should be more robust to extreme deviations. In M-
Estimation we minimize the sum of a function 3a•b of the residuals, the function being chosen
so that large residuals are properly weighted. The objective function for M-Estimation can be
written as
8 8 8
C3 xw3 "
U "3 "3a/3 Î5 b "3a?b. [4.44]
3"
5 3" 3"
Least squares estimation is a special case of M-Estimation with 3a?b ?# and maximum
likelihood estimators are obtained for 3a?b lne0 a?bf where 0 a?b is the probability den-
sity function of the model errors. The name M-Estimation is derived from this reationship
with maximum likelihood estimation.
The estimating equations of the minimization problem can be written as
8
"< a ?
sbx3 0, [4.45]
3"
where <a?b is the first derivative of 3Ð?Ñ. An iterative algorithm for this problem is discussed
in our §A4.8.6 and in Holland and Welsch (1977), Coleman et al. (1980), and Birch and
Agard (1993). Briefly, the algorithm rests on rewriting the psi-function as <a?b
If a residual is less than some number 5 in absolute value it is retained; otherwise it is cur-
tailed to #5l/3 l 5 # . This truncation function is a compromise between using squared
residuals if they are small, and a value close to l/3 l if the residual is large. It is a compromise
between least squares and P" -Estimation (Figure 4.15) combining the efficiency of the former
with the robustness of the latter. Large residuals are not curtailed to l/3 l but #5l/3 l 5 # to
ensure that 3a/b is convex (Huber 1964). Huber (1981) suggested to choose 5 "Þ& 5 s
where 5 s is an estimate of the standard deviation. In keeping with the robust/resistant theme
one can choose the median absolute deviation 5 s "Þ%)#' medianÐl/ s3 lÑ or its rescaled
version (Birch and Agard 1993)
5
s "Þ%)#' medianal/
s3 medianas/3 blb.
The value 5 "Þ$%&s 5 was suggested by Holland and Welsch (1977). If the data are
Gaussian-distributed this leads to M-estimators with relative efficiency of *&% compared to
least squares estimators. For non-Gaussian data prone to outliers, "Þ$%&s 5 yields more
efficient estimators than "Þ&s
5.
ρ(e) = e2
16
14
12 k = 2.0
10 k = 1.5
ρ(e)
8
k = 1.0
6
4
ρ(e) = |e|
2
-4 -3 -2 -1 0 1 2 3 4
Residual ei
Since <a?
sb serves as a weight of a residual, it is sometimes termed the residual weighing
function and presents an alternative way of defining the residual transformation in M-
Estimation. For Huber's transformation with tuning constant 5 "Þ$%&s 5 the weighing
function is
The user must set values for the tuning constants +, , , and - , similar to the choice of 5 in
other weighing functions. Beaton and Tukey (1974) define the biweight M-estimate through
the weight function
#
/#3
/ " l/3 l 5
<a/3 b 3 5# 5#
! l/3 l 5 .
and suggest the tuning constant 5 %Þ')&. Whereas Huber's weighing function is an example
of a monotonic function, the Beaton and Tukey < function is a redescending function which
is preferred if the data are contaminated with gross outliers. A weight function that produces
maximum likelihood estimates for Cauchy distributed data (a > distribution with a single
degree of freedom) is
<a/3 b #/3 Î" /3# .
The Cauchy distribution is symmetric but much heavier in the tails than the Gaussian distri-
bution and permits more extreme observations than the Gaussian model. A host of other
weighing functions has been proposed in the literature. See Hampel et al. (1986) and Barnett
and Lewis (1994) for a more detailed description.
Hypothesis tests and confidence intervals for the elements of " in M-Estimation rely on
the asymptotic Gaussian distribution of "s Q . For finite sample sizes the :-values of the tests
are only approximate. Several tests have been suggested in the literature. To test the general
linear hypothesis, L! : A" d, an analog of the sum of squares reduction test can be used.
The test rests on comparing the sum of transformed residuals aWX V !83" 3as/3 bb in a full
and reduced model. If WX V< denotes this sum in the reduced and WX V0 in the full model, the
test statistic is
aWX V< WX V0 bÎ;
J9,= , [4.47]
:
s
where ; is the rank of A and : s is an estimate of error variability playing a similar role in M-
s # in least squares estimation. If =3 maxe 5ß mine/3 ß 5 ff then
Estimation as 5
This variation of the sum of squares reduction test is due to Schrader and McKean (1977) and
Schrader and Hettmansberger (1980). T -values are approximated as PraJ9,= J;ß85 b. The
test of Schrader and Hettmansberger (1980) does not adjust the variance of the M-estimates
for the fact that they are weighted unequally as one would in weighted least squares. This is
justifiable since the weights are random, whereas they are considered fixed in weighed least
squares. To test the general linear hypothesis L! : A" d, we prefer a test proposed by Birch
and Agard (1993) which is a direct analog of the J test in the Gaussian linear models with
unequal variances. The test statistic is
"
"
J9,= aA" dbw AaXw WXb Aw aA" dbÎ;=# ,
Here, <w a?b is the first derivative of ? and the diagonal weight matrix W has entries <a?
sbÎ?s.
Through simulation studies it was shown that this test has very appealing properties with
respect to size (significance level) and power. Furthermore, if <Ð?Ñ ? which results in least
squares estimates, =# is the traditional residual mean square error estimate.
%include 'DriveLetterOfCDROM:\SASMacros\MEstimation.sas';
/* M-Estimation */
%MEstim(data=range,
stmts=%str(model eff30 = range range2 /s;) )
proc print data=_predm;
var eff30 _wght Pred Resid StdErrPred Lower Upper;
run;
%include 'DriveLetterOfCDROM:\SASMacros\L1_Regression.sas';
title 'L1-Regression for Prediction Efficacy Data';
%LAD(data=range,y=eff30,x=range range2);
P" -estimates can also be calculated with the LAV() function call of the SAS/IML®
module which provides greater flexibility in choosing the method for determining standard
errors and tailoring of output than the %LAD() macro. The code segment
proc iml;
use range; read all var {eff30} into y;
read all var {range range2} into x;
close range;
x = J(nrow(x),1,1) || x;
opt = {. 3 0 . };
call lav(rc,xr,X,y,,opt);
quit;
produces the same analysis as the previous call to the %LAD() macro.
Output 4.12 reiterates the strong negative influence of the Mg observation. When it is
contained in the model, the residual sum of squares is WWV &$'Þ#*, the error mean square
estimate is Q WV )*Þ$)$, and neither the linear nor the quadratic term are (partially)
significant at the &% level. Removing the Mg observation changes things dramatically. The
mean square error estimate is now Q WV (Þ%&* and the linear and quadratic terms are
(partially) significant.
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 854.45840 427.22920 3.34 0.1402
Error 4 511.52342 127.88086
Corrected Total 6 1365.98182
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 1280.23575 640.11788 52.02 0.0047
Error 3 36.91880 12.30627
Corrected Total 5 1317.15455
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
The M-estimates of the regression coefficients are " s !Q '&Þ'&, " s "Q #Þ"#', and
s #Q !Þ!"!$ (Output 4.13), values similar to the OLS estimates after the Mg observation
"
has been deleted. The estimate of the residual variability a"%Þ#&'b is similar to the mean
square error estimate based on OLS estimates in the absence of the outlying observation. Of
particular interest is the printout of the data set _predm that is generated automatically by the
Output 4.13.
Results - The MEstim Macro - Author: Oliver Schabenberger
Model Information
Dimensions
Covariance Parameters 1
Columns in X 3
Columns in Z 0
Subjects 7
Max Obs Per Subject 1
Observations Used 7
Observations Not Used 0
Total Observations 7
Fit Statistics
Covariance Parameters
Parameter Estimate
Residual 14.2566
Num Den
Effect DF DF F Value Pr > F
range 1 4 30.93 0.0051
range2 1 4 20.31 0.0108
Standard
Effect Estimate Error DF t Value Pr > |t|
StdErr
Obs eff30 _wght Pred Resid Pred Lower Upper
Notice that the raw residuals sum to zero in the ordinary least squares analysis but not in
M-Estimation due to the fact that residuals are weighted unequally. The results of P" -Estima-
tion of the model parameters are shown in Output 4.14. The estimates of the parameters are
very similar to the M-estimates.
Output 4.14.
Least Absolute Deviation = L1-Norm Regression
Author: Oliver Schabenberger
Model: eff30 = intcpt range range2
Table 4.21. Estimated coefficients for quadratic polynomial C3 "! "" B3 "# B#3 /3
for Prediction Efficacy data
OLS+66 OLSQ 1 P" -Regression M-Regression
s
"! $*Þ"#" (&."!$ &(.#"( '&.'%*
s
"" ".$*) #.%"% ".*"( #."#'
s
"# !.!!'$ !.!""* !Þ!!* !.!"!
8 ( ' ( (
WWV &""Þ&#$ $'Þ*"* '#%Þ#(& '%$Þ''(
WEH %(Þ))% $&Þ$'$ $(Þ!##
Among the fits based on all observations a8 (b, the residual sum of squares is mini-
mized for ordinary least squares regression, as it should be. By the same token, P" -Estimation
yields the smallest sum of absolute deviations (WEH). M-Regression with Huber's weighing
function, has a residual sum of squares slightly larger than the least absolute deviation
regression. Notice that M-estimates are obtained by weighted least squares, and WWV is not
the criterion being minimized. The sum of absolute deviations of M-Estimation is smaller
than that of ordinary least squares and only slightly larger than that of the P" fit.
The similarity of the predicted trends for OLSQ 1 , P" -, and M-Regression is apparent in
Figure 4.16. While the Mg observation pulls the least squares regression line toward it, P" -
and M-estimates are not greatly affected by it. The predictions for P" - and M-Regression are
hard to distinguish; they are close to least squares predictions obtained after outlier deletion
but do not require removal of the offending data point. The P" -Regression passes through
three data points since the model contains two regressors and an intercept. The data points are
CEC, Prec and P.
Prec
40 P
Prediction Efficacy
Ca
OLS with outlier
30
pH M and L1-Norm regression
20
Mg
CEC
10
0 lime
Figure 4.16. Fitted and predicted values in prediction efficacy example for ordinary least
squares with and without outlying Mg observation, P" - and M-Regression.
Example 4.3 Dollar Spot Counts (continued). Median polishing of the dollar spot
count data suggested to analyze the log-transformed counts. We apply M-Estimation
here with Huber's weighing function and the tuning constant "Þ$%&. These are default
settings of %MEstim(). The statements
%include 'DriveLetter:\SASMacros\MEstimation.sas';
%MEstim(data=dollarspot,
stmts=%str(
class block tmt;
model lgcnt = block tmt;
lsmeans tmt / diff;) );
ask for a robust analysis of the log-transformed dollar spot counts in the block design
and pairwise treatment comparisons (lsmeans tmt / diff).
Model Information
Data Set WORK.DOLLSPOT
Dependent Variable LGCNT
Weighing Function HUBER
Covariance Structure Variance Components
Estimation Method REML
Residual Variance Method Parameter
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Between-Within
Weighing constant 1.345
Fit Statistics
OLS Residual variance 0.161468
Rescaled MAD 0.38555
Birch and Agard estimate 0.162434
Observations used 56
Sum of weights (M) 55.01569
Sum of residuals (M) 0.704231
Sum of abs. residuals (M) 15.68694
Sum of squ. residuals (M) 6.336998
Sum of residuals (OLS) -862E-16
Sum of abs. residuals (OLS) 15.89041
Sum of squ. residuals (OLS) 6.297243
Covariance Parameters
Parameter Estimate
Residual 0.1624
The least squares estimates minimize the sum of squared residuals ('Þ#*(, Table 4.22).
The corresponding value in the robust analysis is very close as are the J statistics for
the treatment effects a(Þ'! and (Þ$)b.
The treatment estimates are also quite close. Because of the orthogonality and the
constant weights in the least squares analysis, the standard errors of the treatment
effects are identical. In the robust analysis, the standard errors depend on the weights
which differ from observation to observation depending on the size of their residuals.
The subtle difference in precision is exhibited in the standard error of the estimate for
treatment '. Four observations are downweighed substantially in the robust analysis. In
particular, treatment ' in block # (C#' **, Table 4.3) received weight !Þ(( and
treatment "# in block " (C"ß"# '$) receives weight !Þ'%. These observations have not
been identified as potential outliers before. Combined with treatment "! in block %
these observations accounted for three of the four largest median polished residuals.
Table 4.22. Analysis of variance for Dollar Spot log(counts) by least squares and
M-Estimation (Huber's weighing function)
Least Squares M-Estimation
Sum of squared residuals 'Þ#* 'Þ$%
Sum of absolute residuals "&Þ)* "&Þ'*
J9,= for treatment effect (Þ'! (Þ$)
:-value for treatment effect !Þ!!!" !Þ!!!"
Treatment Estimates (Std.Err)
1 $Þ#( Ð!Þ#!Ñ $Þ#( Ð!Þ#!Ñ
2 %Þ#" Ð!Þ#!Ñ %Þ#" Ð!Þ#!Ñ
3 $Þ$" Ð!Þ#!Ñ $Þ$" Ð!Þ#!Ñ
4 $Þ&* Ð!Þ#!Ñ $Þ&' Ð!Þ#!Ñ
5 $Þ*! Ð!Þ#!Ñ $Þ*! Ð!Þ#!Ñ
6 $Þ*$ Ð!Þ#!Ñ $Þ)) Ð!Þ#"Ñ
ã ã ã
Treatment Label
9
8
7
6
5 Significance Least Squares
4 Significance M-Estimation
3
2
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Treatment Label
Figure 4.17. Results of pairwise treatment comparisons in robust ANOVA and ordi-
nary least squares estimation. Dots reflect significance of treatment comparison at the
&% level in ANOVA, circles significance at the &% level in M-Estimation.
The robust analysis downweighs the influence of C#' ** whereas the least squares
analysis weighs all observations equally. Such differences, although small, can have a
measurable impact on treatment comparisons. At the &% significance level the least
squares and robust analysis agree closely. However, treatments $ and ' as well as ' and
"$ are significantly different in the least squares analysis and not so in the robust
analysis (Figure 4.17).
The detrimental effect of outlier deletion in experimental designs with few degrees of
freedom to spare can be demonstrated with the following split-plot design. The whole-plot
design is a randomized complete block design with four treatments in two blocks. Each
whole-plot is then subdivided into three sub-plots to which the levels of the sub-plot treat-
ment factor are randomly assigned. The analysis of variance of this design contains separate
error terms for whole- and sub-plot factors. Tests of whole-plot effects use the whole-plot
error which has only $ degrees of freedom (Table 4.23).
The sub-plot error is associated with ) degrees of freedom. Now assume that the obser-
vations from one of the eight whole plots appear errant. It is surmised that either the experi-
Table 4.24. Analysis of variance for split-plot design after removal of one whole-plot
Source Degrees of freedom
Block "
A $
Error(A) #
B #
AB '
Error(B) '
Total #!
At the &% significance level the critical value in the J -test for the whole-plot aAb main
effect is J!Þ!&ß$ß$ *Þ#) in the complete design and J!Þ!&ß$ß# "*Þ"' in the design with a lost
whole-plot. The test statistic J9,= Q WÐEÑÎQ WÐErroraEbÑ must double in value to find a
significant difference among the whole-plot treatments. An analysis which retains extreme
observations and reduces their impact is to be preferred.
Example 4.6. The data from a split-plot design with four whole-plot treatments ar-
ranged in a randomized complete block design with three blocks and three sub-plot
treatments is given in the table below.
It is assumed that the levels of factor A are quantitative and equally spaced. Apart from
tests for main effects and interactions, we are interested in testing for trends of the
response with levels of E. We present a least squares analysis of variance and a robust
where 33 denotes the block effect a3 "ß âß $b, !4 the main effects of factor E
a4 "ß âß %b, .34 is the random whole-plot experimental error with mean ! and
variance 5.# , "5 are the main effects of factor F a5 "ß #ß $b, a!" b45 are the interaction
effects, and /345 is the random sub-plot experimental error with mean ! and variance 5/# .
The tabulated data do not suggest any data points as problematic. A graphical display
adds more insight. The value %Þ( for the fourth level of factor E and the first level of F
in the third block appears suspiciously small compared to the remainder of the data for
the whole-plot factor level (Figure 4.18).
9
Underlined labels refer to block 1,
italicized labels refer to block 3
B2
8
B3 B3
B3
B3 B3
B1
B2
B2
7 B1
Observed Response
B3
B1
B1
B2 B3 B2
B3 B3 B2
B2
6 B1 B1
B2 B2
B2 B3
B2
B1
5 B1
B2
B1
B1
B3
4 B1
B1
B3
3
A1 A2 A3 A4
Whole-Plot Factor
Figure 4.18. Data in split-plot design. Labels F" ß âß F$ in the graph area are drawn at
the value of the response for the particular combination of factors E (horizontal axis)
and F . Values from block 1 are underlined, appear in regular type for block 2, and are
italicized for block 3.
Split-plot designs are special cases of mixed models (§7) that are best fit with the mixed
procedure of The SAS® System. The standard and robust analysis using Huber's weight
function with tuning constant "Þ$%& are produced with the statements
/* Robust Analysis */
%include 'DriveLetter:\SASMacros\MEstimation.sas';
%MEstim(data=spd,
stmts=%str(class block a b;
model y = block a b a*b ;
random block*a;
parms / nobound;
lsmeans a b a*b / diff;
lsmeans a*b / slice=(a b);
contrast 'A cubic @ B1' a -1 3 -3 1
a*b -1 0 0 3 0 0 -3 0 0 1 0 0;
contrast 'A quadr.@ B1' a 1 -1 -1 1
a*b 1 0 0 -1 0 0 -1 0 0 1 0 0;
contrast 'A linear@ B1' a -3 -1 1 3
a*b -3 0 0 -1 0 0 1 0 0 3 0 0;
/* and so forth for contrasts @ B2 and @B3 */
),converge=1E-4,fcn=huber );
The interaction is not significant in the least squares analysis (: !Þ"%)), Table 4.26)
and both main effects are. The robust analysis reaches a different conclusion. It indi-
cates a significant E F interaction (: !Þ!$&&) and a masked E main effect
(: !Þ$$"#). At the &% significance level the least squares results suggest linear trends
of the response in E for all levels of F . A marginal quadratic trend can be noted for F#
(: !Þ!('', Table 4.26). The robust analysis concludes a stronger linear effect at F"
and quadratic effects at F# and F$ .
The estimated treatment cell means are very much the same for both analyses with the
exception of . s"$ and .
s%" (Figure 4.19). The least squares estimate of ."$ is too large
and the estimate of .%" is too small. For the other treatment combinations the estimates
are identical (because of the unequal weights the precision of the treatment estimates is
not the same in the two analyses, even if the estimates agree). Subtle differences in esti-
mates of treatment effects contribute to the disagreement in conclusions from the two
analyses in Table 4.26. A second source of disagreement is the estimate of experimental
s # !Þ&&)% in the least
error variance. The sub-plot error variance 5/# was estimated as 5
squares analysis and as !Þ$('" in the robust analysis. Comparisons of the treatments
whose estimates are not affected by the outliers will be more precise and powerful in
the robust analysis.
B3 B2
B3
7
Treatment Estimate
B1
B1
B2
6 B3
B2
B2
B1
5
B3
B1
4
A1 A2 A3 A4
Whole-Plot Factor Level
Figure 4.19. Estimated treatment means . s45 in split-plot design. The center of the
circles denotes the estimates in the robust analysis, the labels the location of the
estimates in the least squares analysis.
The disagreement does not stop here. Comparing the levels of F at each level of E via
slicing (see §4.3.2), the least squares analysis fails to find significant differences among
the levels of F at the &% level for any level of E. The robust analysis detects F effects
at E" and E$ (Table 4.27). The marginally significant slice at level E% in the least
Consider the case of a single response variable ] and a single covariate \ . The regression of
] on \ is the conditional expectation
0 aBb Ec] l\ Bd
and our analysis of the relationship between the two variables so far has revolved around a
parametric model for 0 aBb, for example a quadratic polynomial 0 aBb "! "" B "# B# .
Inferences drawn from the analysis are dependent on the model for 0 aBb being correct. How
are we to proceed in situations where the data do not suggest a particular class of parametric
models? What can be gleaned about the conditional expectation Ec] lBd in an exploratory
fashion that can aid in the development of a parametric model?
A starting point is to avoid any parametric specification of the mean function and to
consider the general model
]3 0 aB3 b /3 . [4.50]
Example 4.7. Paclobutrazol Growth Response. During the 1995 growing season the
growth regulator Paclobutrazol was applied May 1, May 29, June 29, and July 24 on
turf plots. If turfgrass growth is expressed relative to the growth of untreated plots we
expect a decline of growth shortly after each application of the regulator and increasing
growth as the regulator's effect wears off. Figure 4.20 shows the clipping percentages
removed from Paclobutrazol-treated turf by regular mowing. The amount of clippings
removed relative to the control is a surrogate measure of growth in this application.
The data points show the general trend that is expected. Decreased growth shortly after
application with growth recovery before the next application. It is not obvious,
however, how to parametrically model the clipping percentages over time. A single
polynomial function would require trends of high order to pick up the fluctuations in
growth response. One could also fit separate quadratic or cubic polynomials to the
intervals c!ß #d, c#ß $d, and c$ß %d months. Before examining complicated parametric
models, a nonparametric smooth of the data can (i) highlight pertinent features of the
data, (ii) provide guidance for the specification of possible parametric structures and
(iii) may answer the questions of interest.
200
150
Clippings (% of Control)
100
50
0
0 1 2 3 4
Months
1200 1200
800 800
400 400
0 0
6:00 11:00 16:00 21:00 6:00 11:00 16:00 21:00
Time Time
1600 1600
c) λ = 20 d) λ = 123
PPFD (µmol * m-2 * s)
1200 1200
800 800
400 400
0 0
6:00 11:00 16:00 21:00 6:00 11:00 16:00 21:00
Time Time
Figure 4.21. Light transmittance expressed as photosynthetic photon flux density (PPDF
.796/7# Î=) in understory of longleaf pine stand. (a): raw data measured in "!-second inter-
vals. (b) to (d): moving average smoothers with symmetric nearest neighborhoods of "% (b),
&% (c), and $!% of the 8 )"* data points. Data kindly provided by Dr. Paul Mou, Depart-
ment of Biology, University of North Carolina at Greensboro. Used with permission.
1200 1200
800 800
400 400
0 0
6:00 11:00 16:00 21:00 6:00 11:00 16:00 21:00
Time Time
1600 1600
c) λ = 20 d) λ = 123
PPFD (µmol * m-2 * s)
1200 1200
800 800
400 400
0 0
6:00 11:00 16:00 21:00 6:00 11:00 16:00 21:00
Time Time
Figure 4.22. Loess fit of light transmittance data with smoothing parameters identical to
those in Figure 4.21.
Cleveland's proposal was to initially fit a . th degree polynomial with weights [3 aB3 à -b
and to obtain the residual C3 s0 aB3 b at each point. Then a second set of weights $3 is defined
is a natural choice. Other frequently used kernels are the quadratic kernel
!Þ(&a" ># b l>l "
O a> b
! otherwise
due to Epanechnikov (1969), the triangular kernel O a>b " l>l M al>l "b and the
minimum variance kernel
$
a$ &># b l>l "
O a> b ) .
! otherwise
For a discussion of these kernels see Hastie and Tibshirani (1990), Härdle (1990), and
Eubank (1988). The weight assigned to the point B3 in the estimation of s0 aB! b is now
- lB3 B! l
[! aB3 à -b O .
- -
one arrives at the popular Nadaraya-Watson kernel estimator (Nadaraya 1964, Watson 1964)
Compared to the choice of bandwidth -, the choice of kernel function is usually of lesser
consequence for the resulting estimate s0 aBb.
The Nadaraya-Watson kernel estimator is a weighted average of the observations where
the weights depend on the kernel function, bandwidth, and placement of the design points
B" ß âß B8 . Rather than estimating an average locally, one can also estimate a local mean func-
tion that depends on \ . This leads to kernel regression. A local linear kernel regression esti-
a!b a!b
mate models the mean at B! as Ec] d "! "" B! by weighted least squares where the
weights for the sum of squares are given by the kernel weights. Once estimates are obtained,
the mean at B! is estimated as
s a!!b "
" s "a!b B! .
As B! is changed to the next location at which a mean prediction is desired, the kernel
weights are recomputed and the weighted least squares problems is solved again, yielding
s a!!b and "
new estimates " s "a!b .
s0 aB! b "
" C3 [4.53]
#- " 3R aB b
!
The prediction error focuses on the prediction of a new observation and thus has an additional
term a5 # b. The bandwidth which minimizes one criterion also minimizes the other. It can be
shown that
" " #
EQ WI a-b " Vars0 aB3 b " 0 aB3 b s0 aB3 b .
8 3 8 3
The term 0 aB3 b s0 aB3 b is the bias of the smooth at B3 and the average mean square error is
the sum of the average variance and the average squared bias. The cross-validation statistic
" 8 #
GZ "C3 s0 3 aB3 b [4.54]
8 3"
where sC3ß3 is the predicted mean of the 3th observation if that observation is left out in the
estimation of the regression coefficients. Bandwidths that minimize the cross-validation
statistic [4.54] are often too small, creating fitted values with too much variability. Various
adjustments of the basic GZ statistic have been proposed. The generalized cross-validation
statistic of Craven and Wahba (1979),
#
" 8 C3 s0 aB3 b
KGZ " , [4.56]
8 3" 8/
simplifies the calculation of GZ and penalizes it at the same time. If the vector of fitted
values at the observed data points is written as sy Hy, then the degrees of freedom 8 /
are 8 traHb. Notice that the difference in the numerator term is no longer the leave-one-out
residual, but the residual where s0 is based on all 8 data points. If the penalty 8 / is applied
directly to GZ , a statistic results that Mays, Birch, and Starnes (2001) term
T VIWW
T VIWW .
8/
Whereas bandwidths selected on the basis of GZ are often too small, those selected based on
T VIWW tend to be large. A penalized T VIWW statistic that is a compromise between GZ
WWV7+B is the largest residual sum of squares over all possible values of - and WWV a-b is
the residual sum of squares for the value of - investigated.
GZ and related statistics select the bandwidth based on the ability to predict a new obser-
vation. One can also concentrate on the ability to estimate the mean 0 aB3 b which leads to con-
sideration of U !83" Ð0 aB3 b s0 a3 bÑ# as a selection criterion under squared error loss.
Mallows' G: statistic
s # traHbÎ8,
G: a-b 8" WWV a-b #5 [4.57]
The estimate s0 aBb traces the reversal in response trend after treatment application. It
appears that approximately two weeks after the third and fourth treatment application
the growth regulating effect of Paclobutrazol has disappeared, and there appears to be a
growth stimulation relative to untreated plots.
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Figure 4.23. EMG a-b, GZ a-b, and KGZ a-b for Paclobutrazol response data in Figure
4.20. The goodness of fit measures were rescaled to range from ! to ".
The quadratic loess fit was obtained in The SAS® System with proc loess. Starting
with Release 8.1 the select= option of the model statement in that procedure enables
automatic selection of the smoothing parameter by EMG a-b or KGZ a-b criteria. For
the Paclobutrazol data, the following statements fit local quadratic polynomials and
select the smoothing parameter based on the general cross-validation criteria
aKGZ a-b, Output 4.16b.
200
Clippings (% of Control)
150
100
50
0
0 1 2 3 4
Months
Fit Summary
Fit Method Direct
Number of Observations 30
Degree of Local Polynomials 2
Smoothing Parameter 0.35000
Points in Local Neighborhood 10
Residual Sum of Squares 6613.29600
Trace[L] 10.44893
GCV 17.30123
AICC 7.70028
AICC1 233.02349
Delta1 18.62863
Delta2 18.28875
Equivalent Number of Parameters 9.52649
Lookup Degrees of Freedom 18.97483
Residual Standard Error 18.84163
Nonlinear Models
“Given for one instant an intelligence which could comprehend all the
forces by which nature is animated and the respective situation of the
beings who compose it — an intelligence sufficiently vast to submit these
data to analysis — it would embrace in the same formula the movements
of the greatest bodies of the universe and those of the lightest atom; for it,
nothing would be uncertain and the future, as the past, would be present
to its eyes.” Pierre de LaPlace, Concerning Probability. In Newman, J.R.,
The World of Mathematics. New York: Simon and Schuster, 1965, p.
1325.
5.1 Introduction
5.2 Models as Laws or Tools
5.3 Linear Polynomials Approximate Nonlinear Models
5.4 Fitting a Nonlinear Model to Data
5.4.1 Estimating the Parameters
5.4.2 Tracking Convergence
5.4.3 Starting Values
5.4.4 Goodness-of-Fit
5.5 Hypothesis Tests and Confidence Intervals
5.5.1 Testing the Linear Hypothesis
5.5.2 Confidence and Prediction Intervals
5.6 Transformations
5.6.1 Transformation to Linearity
5.6.2 Transformation to Stabilize the Variance
5.7 Parameterization of Nonlinear Models
5.7.1 Intrinsic and Parameter-Effects Curvature
5.7.2 Reparameterization through Defining Relationships
5.8 Applications
5.8.1 Basic Nonlinear Analysis with The SAS® System — Mitscherlich's
Yield Equation
Recall from §1.7.2 that nonlinear statistical models are defined as models in which the deriva-
tives of the mean function with respect to the parameters depend on one or more of the pa-
rameters. A growing number of researchers in the biological sciences share our sentiment that
relationships among biological variables are best described by nonlinear functions. Processes
such as growth, decay, birth, mortality, abundance, and yield, rarely relate linearly to
explanatory variables. Even the most basic relationships between plant yield and nutrient
supply, for example, are nonlinear. Liebig's famous law of the minimum or law of constant
returns has been interpreted to imply that for a single deficient nutrient crop yield, ] is
proportional to the addition of a fertilizer \ until a point is reached where another nutrient is
in the minimum and yield is limited. At this point further additions of the fertilizer show no
effect and the yield stays constant unless the deficiency of the limiting nutrient is removed.
The proportionality between ] and \ prior to reaching the yield limit implies a straight-line
relationship that can be modeled with a linear model. As soon as the linear increase is
combined with a plateau, the corresponding model is nonlinear. Such models are termed
linear-plateau models (Anderson and Nelson 1975), linear response-and-plateau models
(Waugh et al. 1973, Black 1993), or broken-stick models (Colwell et al. 1988). The data in
Figure 5.1 show relative corn (Zea mays L.) yield percentages as a function of late-spring test
nitrate concentrations in the top $! cm of the soil. The data are a portion of a larger data set
discussed and analyzed by Binford et al. (1992). A linear-plateau model has been fitted to
these data and is shown as a solid line. Let ] denote the yield percent and B the soil nitrogen
concentration. The linear-plateau model can be written as
"! "" B B !
Ec] d [5.1]
"! "" ! B !,
where ! is the nitrogen concentration at which the two linear segments join. An alternative
expression for model [5.1] is
Here, M aB !b is the indicator function that returns " if B ! and ! otherwise. Similarly,
M aB !b returns " if B ! and ! otherwise. If the concentration ! at which the lines
intersect is known, the term D aBM aB !b !M aB !bb is known and one can set up an
appropriate regressor variable by replacing the concentrations in excess of ! with the value of
!. The resulting model is a linear regression model Ec] d "! "" D with parameters "! and
"" . If ! is not known and must be estimated from the data as will usually be the case this
is a nonlinear model since the derivatives
` Ec] dÎ` "! "
` Ec] dÎ` "" BM aB !b !M aB !b
` Ec] dÎ` ! "" M aB !b
110
100
90
Relative Yield Percent
80
70
60
50
40
a
30
0 20 40 60 80
Soil NO3 (mg kg-1)
Figure 5.1. Relative corn yield percent as a function of late-spring test soil nitrogen concen-
tration in top $! cm of soil. Solid line is the fitted linear-plateau model. Dashed line is the
fitted quadratic polynomial model. Data kindly made available by Dr. A. Blackmer, Depart-
ment of Agronomy, Iowa State University. Used with permission. See also Binford, Black-
mer, and Cerrato (1992) and the application in §5.8.4.
Should we guess a value for ! from a graph of the data, assume it is the true value (without
variability) and fit a simple linear regression model or should we let the data guide us to a
best possible estimate of ! and fit the model as a nonlinear regression model? As an alter-
native we can abandon the linear-plateau philosophy and fit a quadratic polynomial to the
data, since a polynomial Ec] d "! "" B "# B# has curvature. That this polynomial fails to
fit the data is easily seen from Figure 5.1. It breaks down in numerous places. The initial in-
crease of yield with soil R S$ is steeper than what a quadratic polynomial can accommodate.
The maximum yield for the polynomial model occurs at a nitrate concentration that is
upwardly biased. Anderson and Nelson (1975) have noticed that these two model breakdowns
are rather typical when polynomials are fit to data for which a linear-plateau model is
This chapter is concerned with nonlinear statistical models with a single covariate. In
§5.2 we investigate growth models as a particularly important family of nonlinear models to
demonstrate how theoretical considerations give rise to nonlinear models through determin-
istic generating equations, but also to examine how nonlinear models evolved from mathema-
tical equivalents of laws of nature to empirical tools for data summary and analysis. In §5.3 a
relationship between linear polynomial and nonlinear models is drawn with the help of Taylor
series expansions and the basic process of fitting a nonlinear model to data is discussed in
§5.4. The test of hypotheses and inference about the parameters is covered in §5.5. Even if
models can be transformed to a linear scale, we prefer to fit them in their nonlinear form to
retain interpretability of the parameters and to avoid transformation bias. Transformations
to stabilize the variance have already been discussed in §4.5.2 for linear models.
Transformations to linearity (§5.6) are of concern if the modeler does not want to resort to
nonlinear fitting methods. Parameterization, the process of changing the mathematical form
of a nonlinear model by re-expressing the model in terms of different parameters greatly
impacts the statistical properties of the parameter estimates and the convergence properties of
the fitting algorithms. Problems in fitting a particular model can often be overcome by
changing its parameterization (§5.7). Through reparameterization one can also make the
model depend on parameters it did not contain originally, thereby facilitating statistical
inference about these quantities. In §5.8 we discuss various analyses of nonlinear models
from a standard textbook example to complex factorial treatment structures involving a non-
linear response. Since the selection of an appropriate model family is key in successful non-
linear modeling we present numerous concave, convex, and sigmoidal nonlinear models in
§A5.9 (on CD-ROM). Additional mathematical details extending the discussion in the text
can be found as Appendix A on the CD-ROM (§A5.10).
which the product C is also a catalyst (an autocatalytic reaction). If ! is the initial rate of
growth and " the upper limit of growth, this relationship can be expressed in form of the
differential equation
` logeC fÎ`> a`CÎ`>bÎC !a" CÎ" b, [5.2]
which is termed the generating equation of the process. Robertson viewed this relationship
as fundamental to describe the increase in size aCb over time a>b for (all) biological entities.
The solution to this differential equation is known as the logistic or autocatalytic model:
"
C a> b . [5.3]
" expe !aB # bf
Pearl and Reed (1924) promoted the autocatalytic concept not only for individual but also for
population growth.
The term ` logeC fÎ`> a`CÎ`>bÎC in [5.2] is known as the specific growth rate, a
measure of the rate of change relative to size. Minot (1908) called it the power of growth and
defined senescence as a loss in specific growth rate. He argued that ` logeCfÎ`> is a concave
decreasing function of time, since the rate of senescence decreases from birth. A mathema-
tical example of a relationship satisfying Minot's assumptions about aging and death is the
differential equation
` logeCfÎ`> !eloge" f logeCff, [5.4]
where ! is the intrinsic growth rate and " is a rate of decay. This model is due to Gompertz
(1825) who posited it as a law of human mortality. It assumes that specific growth declines
linearly with the logarithm of size. Gompertz (1825) reasoned that
“the average exhaustions of a man's power to avoid death were such that at the end of equal
infinitely small intervals of time, he lost equal portions of his remaining power to oppose
destruction.”
The Gompertz model, one of the more common growth models and named after him, is
the solution to this differential equation:
C a>b " expe expe !a> # bff. [5.5]
Like the logistic model it has upper and lower asymptotes and is sigmoidal in shape. Whereas
the logistic model is symmetric about the inflection point > # , the Gompertz model is
asymmetric.
Whether growth is autocatalytic or captured by the Gompertz model has been the focus
of much debate. The key was whether one believed that specific growth rate is a linear or
concave function of size. Courtis (1937) felt so strongly about the adequacy of the Gompertz
model that he argued any biological growth, whether of an individual organism, its parts, or
populations, can be described by the Gompertz model provided that for the duration of the
study conditions (environments) remained constant.
The Gompertz and logistic models were developed for Size-vs.-Time relationships. A
second developmental track focused on models where the size of one part aC" b is related to
the size of another aC# b (so-called size-vs.-size models). Huxley (1932) proposed that
specific growth rates of C" and C# should be proportional:
The parameter " measures the ratio of the specific growth rates of C" and C# . The isometric
case " " was of special interest because it implies independence of size and shape. Quiring
(1941) felt strongly that allometry, the proportionality of sizes, was a fundamental biological
law. The study of its regularities, in his words,
“should lead to a knowledge of the fundamental laws of organic growth and explain the scale of
being.”
Integrating the differential equation, one obtains the basic allometric equation
logeC" f ! " logeC# f or in exponentiated form,
We notice at this point that the models [5.3] [5.7] are of course nonlinear. Nonlinearity is a
result of integrating the underlying differential equations. The allometric model, however, can
be linearized by taking logarithms on both sides of [5.7]. Pázman (1993, p.36) refers to such
models as intrinsically linear. Models which cannot be transformed to linearity are then
intrinsically nonlinear.
Allometric relationships can be embedded in more complicated models. Von Bertalanffy
(1957) postulated that growth is the sum of positive (anabolic) forces that synthesize material
and negative (metabolic) forces that reduce material in an organism. Studying the weight of
animals, he found that the power #Î$ for the metabolic rate describes the anabolic forces well.
The model derived from the differential equation,
`CÎ`> !C #Î$ " C [5.8]
is known as the Von Bertalanffy model. Notice that the first term on the right-hand side is of
the allometric form [5.7].
The paradigm shift from nonlinear models as mathematical expressions of laws to non-
linear models as empirical tools for data summary had numerous reasons. Cases that did not
seem to fit any of the classical models could only be explained as aberrations in measurement
protocol or environment or as new processes for which laws needed to be found. At the same
time evidence mounted that the various laws could not necessarily coexist. Zeger and Harlow
(1987) elaborate how Lumer (1937) showed that sigmoidal growth (Logistic or Gompertz) in
different parts of an organism can disable allometry by permitting only certain parameter
values in the allometry equation. The laws could not hold simultaneously. Laird (1965)
argued that allometric analyses were consistent with sigmoidal growth provided certain con-
ditions about specific growth rates are met. Despite the inconsistencies between sigmoidal
and allometric growth, Laird highlighted the utility in both types of models. Finally, advances
in computing technology made fitting of nonlinear models less time demanding and allowed
examination of competing models for the same data set. Rather than adopting a single model
family as the law to which a set of data must comply, the empirical nature of the data could be
emphasized and different model families could be tested against a set of data to determine
which described the observations best. Whether one adopts an underlying biological or
chemical relationship as true, there is much to be learned from a model that fits the data well.
Today, we are selecting nonlinear models because they offer certain patterns. If the data
suggest a sigmoidal trend with limiting values, we will turn to the Logistic, Gompertz, and
other families of models that exhibit the desired behavior. If empirical data suggest mono-
tonic increasing or decreasing relationships, families of concave or convex models are to be
considered.
One can argue whether the empiricism in modeling has been carried too far. Study of
certain disciplines shows a prevalence of narrow classes of models. In (herbicide) dose-
response experiments the logistic model (or log-logistic model if the regressor is log-trans-
formed) is undoubtedly the most frequently used model. This is not the case because the
underlying linearity of specific growth (decay) rates is widely adopted as the mechanism of
herbicide response, but because in numerous works it was found that logistic functions fit
herbicide dose-response data well (e.g., Streibig 1980, Streibig 1981, Lærke and Streibig
1995, Seefeldt et al. 1995, Hsiao et al. 1996, Sandral et al. 1997). As a result, analysts may
resist the urge to thoroughly investigate alternative model families. Empirical models are not
panaceas and examples where the logistic family does not describe herbicide dose-response
behavior well can be found easily (see for example, Brain and Cousens 1989, Schabenberger
et al. 1999). Sandland and McGilchrist (1979) and Sandland (1983) criticize the widespread
application of the Von Bertalanffy model in the fisheries literature. The model's status accor-
ding to Sandland (1983), goes “far beyond that accorded to purely empirical models.”
Cousens (1985) criticizes the categorical assumption of many weed-crop competition studies
that crop yield is related to weed density in sigmoidal fashion (Zimdahl 1980, Utomo 1981,
Roberts et al. 1982, Radosevich and Holt 1984). Models for yield loss as a function of weed
density are more reasonably related to hyperbolic shapes according to Cousens (1985). The
appropriateness of sigmoidal vs. hyperbolic models for yield loss depends on biological
assumptions. If it is assumed that there is no competition between weeds and crop at low
densities, a sigmoidal model suggests itself. On the other hand, if one assumes that at low
weed densities weed plants interact with the crop but not each other and that a weed's
influence increases with its size, hyperbolic models with a linear increase of yield loss at low
weed densities of the type advocated by Cousens (1985) arise rather naturally. Because the
biological explanations for the two model types are different, Cousens concludes that one
must be rejected. We believe that much is to be gained from using nonlinear models that
differ in their physical, biological, and chemical underpinnings. If a sigmoidal model fits a set
of yield data better than the hyperbolic contrary to the experimenter's expectation, one is led
to rethink the nature of the biological process, a most healthy exercise in any circumstance. If
one adopts an attitude that models are selected that describe the data well, not because they
comply with a narrow set of biological assumptions, any one of which may be violated in a
particular case, the modeler gains considerable freedom. Swinton and Lyford (1996), for
example, entertain a reparameterized form of Cousen's rectangular hyperbola to model yield
loss as a function of weed density. Their model permits a test whether the yield loss function
is indeed hyperbolic or sigmoidal and the question can be resolved via a statistical test if one
is not willing to choose between the two model families on biological grounds alone. Cousens
(1985) advocates semi-empirical model building. A biological process is divided into stages
and likely properties of each stage are combined to formulate a resulting model “based on
biologically sound premises.” His rectangular hyperbola mentioned above is derived on these
grounds: (i) yield loss percentage ranges between !% and "!!% as weed density tends
towards ! or infinity, respectively; (ii) effects of individual weed plants on crop at low
density are additive; (iii) the rate at which yield loss increases with increasing density is
proportional to the squared yield loss per weed plant. Developing mathematical models in this
Equation c5Þ9d is the Taylor series expansion of 0 aBb around B . Here, 0 w aB b denotes the first
derivative of 0 aBb with respect to B, evaluated at the point B and D aB B b. V is the
remainder term of the expansion and measures the accuracy of the approximation of 0 aBb by
the series of order <. Replace 0 aB b with "! , 0 w aB b with "" , 0 ww aB bÎ#x with "# and so forth
and c5Þ9d reveals itself as a polynomial in D :
C "! "" D "# D # ÞÞÞ "< D < V zw " V . [5.10]
The term "! "" D "# D # ÞÞÞ "< D< is a linear approximation to 0 aBb and, depending on
the number of terms, can be made arbitrarily close to 0 aBb. If there are 8 distinct data points
Example 5.1. The data plotted in Figure 5.2 suggest a curved trend between C and B
with inflection point. To incorporate the inflection, a model must be found for which
the second derivative of the mean function depends on B. A linear polynomial in B must
be carried at least to the third order. The four parameter linear model
]3 "! "" B3 "# B#3 "$ B$3 /3 .
1.0
0.8
[ ]
E Yi = 1 - exp{- x 2.2 }
y 0.6
[]
E Yi = - 0.0107 + 01001
. x + 0.807 x 2 - 0.3071x 3
0.4
0.2
0.0
0.00 0.55 1.10 1.65 2.20
X
Figure 5.2. Data suggesting mean function with inflection summarized by nonlinear
and linear polynomial models.
The nonlinear model is more parsimonious and also restricts Ec]3 d between zero and
one. If the response is a true proportion the linear model does not guarantee predicted
values inside the permissible range, whereas the nonlinear model does. The nonlinear
function approaches the upper limit of "Þ! asymptotically as B grows. The fitted poly-
nomial, because of its curvature, does not have an asymptote but achieves extrema at
Linear polynomials are flexible modeling tools that do not appeal to a generating
equation and for short data series, may be the only possible modeling choice. They are less
parsimonious, poor at fitting asymptotic approaches to limiting values, and do not provide a
biologically meaningful parameter interpretation. From the scientists' point of view, nonlinear
models are certainly superior to polynomials. As an exploratory tool that points the modeler
into the direction of appropriate nonlinear models, polynomials are valuable. For complex
processes with changes of phase and temporal fluctuations, they may be the only models
offering sufficient flexibility unless one resorts to nonparametric methods (see §4.7).
• Stop criteria for the iterative process should be true convergence, not
termination, criteria to distinguish convergence to a global minimum from
lack of progress of the iterative algorithm.
If X is of full rank, this problem has a closed-form unique solution, the OLS estimator
s aXw Xb" Xw y (see §4.2.1). If the mean function is nonlinear, the basic model equation is
"
]3 0 ax3 ß )b /3 ß /3 µ 33. !ß 5 # ß 3 "ß âß 8, [5.11]
where ) is the a: "b vector of parameters to be estimated and 0 ax3 ß )b is the mean of ]3 .
The residual sum of squares to be minimized now can be written as
8
W a)b "aC3 0 ax3 ß )bb#
3"
a y f a x ß ) b b w a y f a x ß ) bb , [5.12]
with
Ô 0 ax " ß ) b ×
Ö 0 ax # ß ) b Ù
f ax ß ) b Ö Ù.
ã
Õ 0 ax 8 ß ) b Ø
This minimization problem is not as straightforward as in the linear case since faxß )b is a
nonlinear function of ). The derivatives of Wa)b depend on the particular structure of the
model, whereas in the linear case with faxß )b X" finding derivatives is easy. One method
of minimizing [5.12] is to replace faxß )b with a linear model that approximates faxß )b. In
§5.3, a nonlinear function 0 aBb was expanded into a Taylor series of order <. Since faxß )b
has : unknowns in the parameter vector we expand it into a Taylor series of first order about
each element of ) . Denote by ) ! a vector of initial guesses of the parameters (a vector of
starting values). The first-order Taylor series (see §A5.10.1 for details) of faxß )b around )! is
` fÐx ß )Ñ
faxß )b » fxß )! ) ) ! f x ß ) ! F ! ) ) ! , [5.13]
` )w l)!
where F! is the a8 :b matrix of first derivatives of faxß )b with respect to the parameters,
evaluated at the initial guess value )! . The residual y faxß )b in [5.12] is then approximated
by the residual
y fxß )! F! ) )! y fxß )! F! )! F! ),
which is linear in ) and minimizing [5.12] can be accomplished by standard linear least
squares where the response y is replaced by the pseudo-response y faxß )! b F! )! and the
regressor matrix is given by F! . Since the estimates we obtain from this approximated linear
least squares problem depend on our choice of starting values )! , the process cannot stop
after just one update of the estimates. Call the estimates of this first fit )" . Then we recalcu-
late the new pseudo-response as y faxß )" b F" )" and the new regressor matrix is F" . This
process continues until some convergence criterion is met, for example, until the relative
change in residual sums of squares between two updates is minor. This approach to least
and 5 is chosen to ensure that the residual sum of square decreases between iterations. This is
known as step-halving or step-shrinking. The GN method is also not a stable estimation
method if the columns of F are highly collinear for the same reasons that ordinary least
squares estimates are unstable if the columns of the regressor matrix X are collinear (see
§4.4.4 on collinearity) and hence Xw X is ill-conditioned. Nonlinear models are notorious for
ill-conditioning of the Fw F matrix which plays the role of the Xw X matrix in the approximate
linear model of the GN algorithm. In particular when parameters appear in exponents, deriva-
tives with respect to different parameters contain similar functions. Consider the simple two-
parameter nonlinear model
Ec] d " " exp B) [5.16]
1.0
0.9
0.8
E[Y]
0.7
0.6
0.5
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
x
Figure 5.3. Nonlinear response function Ec] d " " exp B) with " !Þ&, ) !Þ*.
Assume the covariate vector is x c!Þ#ß !Þ&ß !Þ(ß "Þ)dw and the F matrix becomes
Ô !Þ(*!' !Þ"%*& ×
Ö !Þ&)"& !Þ"!)( Ù
FÖ Ù.
!Þ%)%" !Þ!'#'
Õ !Þ")$# !Þ!*"% Ø
The correlation coefficient between the two columns of F is !Þ*()&. Ridging (§4.4.5) the Fw F
matrix is one approach to modifying the basic GN method to obtain more stable estimates.
This modification is known as the Levenberg-Marquardt method (Levenberg 1944,
Marquardt 1963).
Calculating nonlinear parameter estimates by hand is a tedious exercise. Black (1993, p.
65) refers to it as the “drudgery connected with the actual fitting.” Fortunately, we can rely on
statistical computing packages to perform the necessary calculations and manipulations. We
do caution the user, however, that simply because a software package claims to be able to fit
nonlinear models does not imply that it can fit the models well. Among the features of a good
package we expect suitable modifications of several basic algorithms, grid searches over sets
of starting values, efficient step-halving procedures, the ability to apply ridge estimation
when the F matrix is poorly conditioned, explicit control over the type and strictness of the
convergence criterion, and automatic differentiation to free the user from having to specify
derivatives. These are just some of the features found in the nlin procedure of The SAS®
System. We now go through the "drudgery" of fitting a very simple, one-parameter nonlinear
model by hand, and then show how to apply the nlin procedure.
The data set consists of the response vector y c!Þ"ß !Þ%ß !Þ'ß !Þ*dw and the covariate
vector x c!Þ#ß !Þ&ß !Þ(ß "Þ)dw . The mean vector and the matrix (vector) of derivatives
are given by the following:
As a starting value we select )! "Þ$. From [5.14] the first evaluation of the derivative
matrix and the residual vector gives
w
The first correction term is then ÐF! F! Ñ" Fw ! rÐ)! Ñ *Þ)!! !Þ!#$ !Þ##&)
"
and the next iterate is s) )! !Þ##&) "Þ!(%#. Table 5.1 shows results of
successive iterations with the GN method.
" ?
Iteration ? )? F? r? aF? w F? b F? w r ? $? W s)
Ô !Þ"(&' × Ô !Þ!"'" ×
Ö !Þ")(& Ù Ö !Þ!''# Ù
! "Þ$ Ö Ù Ö Ù *Þ)!!& !Þ!#$! !Þ##&) !Þ!##'
!Þ""*' !Þ"$$"
Õ !Þ"%(% Ø Õ !Þ!"') Ø
Ô !Þ#$*# × Ô !Þ!'#' ×
Ö !Þ#!%( Ù Ö !Þ!#"* Ù
" "Þ!(%# Ö Ù Ö Ù (Þ!!)' !Þ!!'% !Þ!%%& !Þ!")%
!Þ"#$! !Þ"!&(
Õ !Þ"')' Ø Õ !Þ!&#' Ø
Ô !Þ##&% × Ô !Þ!&#$ ×
Ö !Þ#!"% Ù Ö !Þ!$!* Ù
# "Þ"")( Ö Ù Ö Ù (Þ%*$# !Þ!!!' !Þ!!%( !Þ!")"
!Þ"##$ !Þ"""#
Õ !Þ"'%( Ø Õ !Þ!%&" Ø
Ô !Þ##') × Ô !Þ!&$% ×
Ö !Þ#!") Ù Ö !Þ!$!! Ù
$ "Þ""%! Ö Ù Ö Ù (Þ%%!' !Þ!!!" !Þ!!!' !Þ!")"
!Þ"##% !Þ""!'
Õ !Þ"'&" Ø Õ !Þ!%&* Ø
% "Þ""%'
To fit this model using The SAS® System, we employ proc nlin. Prior to Release 6.12
of SAS® , proc nlin required the user to supply first derivatives for the Gauss-Newton
method and first and second derivatives for the Newton-Raphson method. Since
Release 6.12 of The SAS® System, an automatic differentiator is provided by the
procedure. The user supplies only the starting values and the model expression. The
following statements read the data set and fit the model using the default Gauss-Newton
algorithm. More sophisticated applications of the nlin procedure can be found in the
example applications (§5.8) and a more in-depth discussion of its capabilities and
options in §5.8.1. The statements
data Ex_51;
input y x @@;
datalines;
0.1 0.2 0.4 0.5 0.6 0.7 0.9 1.8
;;
run;
proc nlin data=Ex_51;
parameters theta=1.3;
model y = 1 - exp(-x**theta);
run;
produce Output 5.1. The Gauss-Newton method converged in six iterations to a residual
!
sum of squares of WÐs)Ñ !Þ!")!*& from the starting value ) "Þ$. The converged
iterate is s) "Þ""%' with an estimated asymptotic standard error eseÐs)Ñ !Þ#""*.
Estimation Summary
Method Gauss-Newton
Iterations 6
R 2.887E-6
PPC(theta) 9.507E-7
RPC(theta) 7.761E-6
Object 4.87E-10
Objective 0.018095
Observations Read 4
Observations Used 4
Observations Missing 0
Asymptotic
Standard Asymptotic 95% Confidence
Parameter Estimate Error Limits
theta 1.1146 0.2119 0.4401 1.7890
The parameters statement defines which quantities are parameters to be estimated and
assigns starting values. The model statement defines the mean function 0 ax3 ß )b to be
fitted to the response variable (y in this example). All quantities not defined in the
parameters statement must be either constants defined through SAS® programming
statements or variables to be found in the data set. Since x is neither defined as a param-
eter nor a constant, SAS® will look for a variable by that name in the data set. If, for
example, one may fit the same model where x is square-root transformed, one can
simply put
As is the case for a linear model, the method of least squares provides estimates for the pa-
rameters of the mean function but not for the residual variability. In the model
Y faxß )b e with e µ a0ß 5 # Ib an estimate of 5 # is required for evaluating confidence
intervals and test statistics. Appealing to linear model theory it is reasonable to utilize the
residual sum of squares obtained at convergence. Specifically,
" " w "
s#
5 WÐs
)Ñ y fxß s
) y fxß s
) rÐs
)Ñw rÐs
)Ñ. [5.17]
8: 8: 8:
Here, : is the number of parameters and s ) is the converged iterate of ) . If the model errors
# #
are Gaussian, a8 :b5 s Î 5 is approximately Chi-square distributed with 8 : degrees of
freedom. The approximation improves with sample size 8 and is critical in the formulation of
test statistics and confidence intervals. In Output 5.1 this estimate is shown in the analysis of
variance table as the Mean Square of the Residual source, 5 s # !Þ!!'!$.
and iterations are halted when this measure is less than some number 0. The nlin procedure
of The SAS® System implements the Bates and Watts criterion as the default convergence
criterion with 0 "!& .
The sum of squares surface in linear models with a full rank X matrix has a unique mini-
mum, the values at the minimum being the least squares estimates (Figure 5.4). In nonlinear
models the surface can be considerably more complicated with long, elongated valleys
(Figure 5.5) or multiple local minima (Figure 5.6). When the sum of square surface has
multiple extrema the iterative algorithm may be trapped in a region from which it cannot
escape. If the surface has long, elongated valleys, it may require a large number of iterations
to locate the minimum. The sum of squares surface is a function of the model and the data
and reparameterization of the model (§5.7) can have tremendous impact on its shape. Well-
chosen starting values (§5.4.3) help in the resolution of convergence problems.
To protect against the possibility that a local rather than a global minimum has been
found we recommend starting the nonlinear algorithm with sufficiently different sets of start-
ing values. If they converge to the same estimates it is reasonable to assume that the sum of
squares surface has a global minimum at these values. Good implementations of modified
algorithms, such as Hartley's modified Gauss-Newton method (Hartley 1961) can improve the
convergence behavior if starting values are chosen far from the solution but cannot guarantee
convergence to a global minimum.
Ô" "Þ"%%( ×
X " !Þ! .
Õ" "Þ"$$)) Ø
#
A surface contour of the least squares objective function W a" b !$3 aC3 "! "" B$3 b
shows elliptical contours with a single minimum WÐ" s Ñ "(Þ#*& achieved by
s c"Þ'$%!)ß !Þ##%()d (Figure 5.4). If a Gauss-Newton or Newton-Raphson
" w
algorithm is used to estimate the parameters of this linear model, either method will find
the least squares estimates with a single update, regardless of the choice of starting
values.
0.4 20.6 20.6
20.2
19.8
19.5
19.1
0.2
18.8
18.4
18.0
0.0
b1 17.7
bˆ 1 ()
S β̂
-0.2 17.
3
-0.4 18
.8
b̂0
1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
b0
-0.08
11.5
-0.09
-0.10
-0.11
11.7
11.9
11.9
11.8
11.6
11.6
11.9
11.7
11.8
-0.12
1.5 2.0 2.5 3.0 3.5
b
Figure 5.5. Sum of squares contour of model "Îa! " Bb with elongated valley.
6
3.
0.2 3
3.
1.9
3.0
2.8 1.9
0.1
a 2.5
2.2 2.2
0.0
2.5
-0.1 2.8
1.9 3.0 3
3.
6
3.
-0.2
-1.0 -0.5 0.0 0.5
b
Figure 5.6. Residual sum of square contour for model !expe " Bf with two local minima.
Adapted from Figure 3.1 in Seber, G.A.F. and Wild, C.J. (1989) Nonlinear Regression. Wiley
and Sons, New York. Copyright © 1989 John Wiley and Sons, Inc. Reprinted by permission
of John Wiley and Sons, Inc.
Graphing Data
One of the simplest methods to determine starting values is to discern reasonable values for )
from a scatterplot of the data. A popular candidate for fitting growth data, for example, is the
four-parameter logistic model. It can be parameterized in the following form,
where $ and a! $ b are lower and upper asymptotes, respectively, and the inflection point is
located at B " Î# (Figure 5.7). Furthermore, the slope of the logistic function at the
inflection point is a function of ! and # , `0 Î`BlB !# Î#.
d + a = 15
15
13
10
11 []
E Yi = 5 +
1 + exp{- 6.4 + 11
. xi }
Y
b / g = 5818
.
d =5
5
0 2 4 6 8 10
x
Figure 5.7. Four-parameter logistic model with parameters $ &, ! "!, " 'Þ%,
# "Þ".
Consider having to determine starting values for a logistic model with the data shown in
Figure 5.8. The starting values for the lower and upper asymptote could be $ ! &, !! *.
The inflection point occurs approximately at B ', hence " s ! Îs
# ! ' and the slope at the
inflection point is about –$. Solving the equations ! # Î# $ and " ! Î# ! ' for # ! and
! !
15.0
12.8
10.6
Y
8.4
6.2
4.0
0 2 4 X 6 8 10
With proc nlin of The SAS® System starting values are assigned in the parameters
statement. The code
proc nlin data=Fig5_8;
parameters delta=5 alpha=9 beta=-4.0 gamma=-0.66;
model y = delta + alpha/(1+exp(beta-gamma*x));
run;
invokes the modified Gauss-Newton algorithm (the default) and convergence is achieved
after seven iterations. Although the converged estimates
s
) c&Þ$(%'ß *Þ"'))ß 'Þ)!"%ß "Þ"'#"dw
are not too far from the starting values )! c&Þ!ß *Þ!ß %Þ!ß !Þ''dw the initial residual sum
of squares WÐ)! Ñ &"Þ'""' is more than twice the final sum of squares WÐs )Ñ #%Þ$%!'
(Output 5Þ2). The nlin procedure uses the Bates and Watts (1981) criterion [5.18] to track
convergence with a default benchmark of "!& . The criterion achieved when convergence
was halted is shown as R in the Estimation Summary.
Estimation Summary
Method Gauss-Newton
Iterations 7
R 6.353E-6
PPC(beta) 6.722E-6
RPC(beta) 0.000038
Object 1.04E-9
Objective 24.34029
Observations Read 30
Observations Used 30
Observations Missing 0
fit the four-parameter logistic model to the data in Figure 5.8. The residual sum of squares is
evaluated at % $ % $ "%% parameter combinations. The best initial combination is the
set of values that produces the smallest residual sum of squares. This turns out to be
)! c&ß *ß 'ß "d (Output 5.3). The algorithm converges to the same estimates as above
but this time requires only five iterations.
Once ) is fixed, the model is linear in " c"! ß "" ß "# dw . A common model to relate yield per
plant ] to plant density B is due to Bleasdale and Nelder (1960),
A special case of this model is the Shinozaki-Kira model with ) " (Shinozaki and Kira
1956). Starting values for the Bleasdale-Nelder model can be found by setting ) " and ob-
taining initial values for ! and " from a simple linear regression "Î]3 ! " B3 . Once star-
ting values for all parameters have been found the model is fit in nonlinear form.
The Mitscherlich model is popular in agronomy to express crop yield ] as a function of
the availability of a nutrient B. One of the many parameterizations of the Mitscherlich model
(see §5.7, §A5.9.1 on parameterizations of the Mitscherlich model and §5.8.1 for an applica-
tion) is
Ec] d !a" expe ,aB B! bfb
where ! is the upper yield asymptote. , is related to the rate of change and B! is the nutrient
concentration at which mean yield is !. A starting value !! can be found from a graph of the
data as the plateau yield. Then the relationship can be re-expressed by taking logarithms as
ln!! ] lne!f ,B! ,B "! "" B,
which is a simple linear regression with response lne!! ] f, intercept "! lne!f ,B!
and slope "" ,. The ordinary least squares estimate of , serves as the starting value ,! .
Finally, if the yield without any addition of nutrient (e.g., the zero fertilizer control) is C , a
starting value for B! is a"Î,! blne" C Î!! f.
Reparameterization
It can be difficult to find starting values for parameters that have an unrestricted range (–_ß
_Ñ. On occasion, only the sign of the parameter value can be discerned. In these cases it
helps to modify the parameterization of the model. Instead of the unrestricted parameter )
one can fit, for example, ! "Îa" expe–)fb which is constrained to range from zero to
one. Specifying a parameter in this range may be simpler. Once a reasonable estimate for !
has been obtained one can change the parameterization back to the original state and use
)! lne!! ÎÐ" !! f as initial value.
A reparameterization technique advocated by Ratkowsky (1990, Sec. 2.3.1) makes
finding starting values particularly simple. The idea is to rewrite a given model in terms of its
expected value parameters (see §5.7.1). They correspond to predicted values at selected
values aB b of the regressor. From a scatterplot of the data one can then estimate Ec] lB d by
visual inspection. Denote this expected value by . . Set the expectation equal to 0 aB ß )b and
replace one of the elements of ). We illustrate with an example.
This is a two-parameter model (. ß O ) as the original model. The process can be repeated by
choosing another value B , its expected value parameter . , and replacing the parameter O .
Changing the parameterization of a nonlinear model to expected value parameters not
only simplifies finding starting values, but also improves the statistical properties of the esti-
mators (see §5.7.1 and the monographs by Ratkowsky 1983, 1990). A drawback of working
with expected value parameters is that not all parameters can be replaced with their expected
value equivalents, since the resulting system of equations may not have analytic solutions.
" B# ÎB"
Z C"
" C" B# ÎC#
This technique requires that the system of nonlinear equations can be solved and is a special
case of a more general method proposed by Hartley and Booker (1965). They divide the 8
observations into :7 sets, where : is the number of parameters and B25
a2 "ß âß :à 5 "ß âß 7b are the covariate values. Then the system of nonlinear equations
" 7
C2 " 0 aB25 ß )b
7 5"
is solved where C 2 7" !7 5" C25 . Gallant's method is a special case where 7 " and one
selects : representative points from the data.
In practical applications, the various techniques for finding starting values are often
combined. Initial values for some parameters are determined graphically, others are derived
from expected value parameterization or subject-matter considerations, yet others are entirely
guessed.
Example 5.4. Gregoire and Schabenberger (1996b) model stem volume in $$' yellow
poplar (Liriodendron tulipifera L.) trees as a function of the relative diameter
.34
>34 ,
H4
where H4 is the stump diameter of the 4th tree and .34 is the diameter of tree 4 measured
at the 3th location along the bole (these data are visited in §8.4). At the tip of the tree
>34 ! and directly above ground >34 ". The measurements along the bole were
spaced "Þ# meters apart. The authors selected a volume-ratio model to describe the
accumulation of volume with decreasing diameter (increasing height above ground):
>34 "$ >34
Z34 a"! "" B4 bexp "# / /34 . [5.22]
"!!!
Here B4 diameter at breast height squared times total tree height for the 4th tree. This
model consists of a linear part a"! "" B4 b representing the total volume of a tree and a
multiplicative reduction term
>34 "$ >34
V a"# ß "$ ß >34 b exp "# / ,
"!!!
To find starting values for " c"! ß "" ß "# ß "$ dw , the linear component was fit to the total
tree volumes of the $$' trees,
Z4 "! "" B4 /4 a4 "ß ÞÞÞß $$'b
5.4.4 Goodness-of-Fit
The most frequently used goodness-of-fit (g-o-f) measure in the classical linear model
5"
]3 "! ""4 B43 /3
4"
The appeal of the V # statistic is that it ranges between ! and " and has an immediate interpre-
tation as the proportion of variability in ] jointly explained by the regressor variables. Notice
that this is a proportion of variability in ] about its mean ] since in the absence of any
regressor information one would naturally predict Ec]3 d with the sample mean. Kvålseth
In the linear model with intercept term, the two V # statistics [5.23] and [5.24] are identical.
[5.23] contains mean adjusted quantities and for models not containing an intercept other V # -
type measures have been proposed. Kvålseth (1985) mentions
8 8
! aC3 sC 3 b# !sC#3
# 3" 3"
V8938> " 8 and V # 8938> 8 .
! C3# ! C3#
3" 3"
#
The V8938> statistics are appropriate only if the model does not contain an intercept and C is
zero. Otherwise one may obtain misleading results. Nonlinear models do not contain an inter-
cept in the typical sense and care must be exercised to select an appropriate goodness-of-fit
measure. Also, [5.23] and [5.24] do not give identical results in the nonlinear case. [5.23] can
easily exceed " and [5.24] can conceivably be negative. The key difficulty is that the decom-
position
8 8 8
"aC3 C b# "asC 3 C b# "aC3 sC 3 b#
3" 3" 3"
no longer holds in nonlinear models. Ratkowsky (1990, p. 44) feels strongly that the danger
of misinterpreting V # in nonlinear models is too great to rely on such measures and recom-
mends basing goodness-of-fit decisions in a nonlinear model with : parameters on the mean
square error
8
" "
=# "C3 sC #3 WWV . [5.25]
8 : 3" 8:
The mean square error is a useful goodness-of-fit statistic because it combines a measure of
closeness between data and fit aWWV b with a penalty term a8 :b to prevent overfitting.
Including additional parameters will decrease WWV and the denominator. =# may increase if
the added parameter does not improve the model fit. But a decrease of =# is not necessarily
indicative of a statistically significant improvement of the model. If the models with and
without the additional parameter(s) are nested, the sum of squares reduction test addresses the
level of significance of the improvement. The usefulness of =# notwithstanding, we feel that a
reasonable V # -type statistic can be applied and recommend the statistic [5.24] as a goodness-
of-fit measure in linear and nonlinear models. In the former, it yields the coefficient of deter-
mination provided the model contains an intercept term, and is also meaningful in the no-
intercept model. In nonlinear models it cannot exceed " and avoids a serious pitfall of [5.23].
Although this statistic can take on negative values in nonlinear models, this has never
happened in our experience and usually the statistic is bounded between ! and ". A negative
value would indicate a serious problem with the considered model. Since the possibility of
negative values exists theoretically, we do not term it a V # statistic in nonlinear models to
Because the additive sum of squares decomposition does not hold in nonlinear models,
Pseudo-V # should not be interpreted as the proportion of variability explained by the model.
even if the errors e are Gaussian, the least squares estimator is not. Contrast this with the
linear model Y X" e, e µ Ka0ß 5 # Ib, where the ordinary least squares estimator
s aXw Xb" Xw Y
"
is Gaussian with mean " and variance-covariance matrix 5 # aXw Xb" . The derivative matrix F
in the nonlinear model plays a role akin to the X matrix in the linear model, but in contrast to
the linear model, the F matrix is not known, unless ) is known. The derivatives of a nonlinear
model depend, by definition, on the unknown parameters. To estimate the standard error of
the nonlinear parameter )4 , we extract - 4 , the 4th diagonal element of the aFw Fb" matrix.
Similarly, in the linear model we extract . 4 , the 4th diagonal element of aXw Xb" . The
estimated standard errors for )4 and "4 , respectively, are
Linear model: s 4 È5 # . 4
se" s 4 È5
ese" s# .4 . 4 known
The differences between the two cases are subtle. È5 # . 4 is the standard error of "4 in the
linear model, but È5 # - 4 is only the asymptotic standard error (ase) of )4 in the nonlinear
case. Calculating estimates of these quantities requires substituting an estimate of 5 # in the
linear model whereas in the nonlinear case we also need to estimate the unknown - 4 . This
estimate is found by evaluating F at the converged iterate and extracting s- 4 as the 4th diagonal
element of the ÐFsw F
sÑ" matrix. We use easeÐs)4 Ñ to denote the estimated asymptotic standard
error of the parameter estimator.
s # , not only those of the )4 s. In
Nonlinearity also affects the distributional properties of 5
the linear model with Gaussian errors where X has rank :, Ð8 :Ñ5 s # Î5 # is a Chi-squared
random variable with 8 : degrees of freedom. In a nonlinear model with : parameters,
Ð8 :Ñ5s # Î5 # is only approximately a Chi-squared random variable.
If sample size is sufficiently large we can rely on the asymptotic results and an approxi-
mate a" !b"!!% confidence interval for )4 can be calculated as
s Ès- 4 .
s)4 D!Î# easeÐs)4 Ñ s)4 D!Î# 5
Because we do not use aseÐs)4 ), but the estimated aseÐs)4 Ñ, it is reasonable to use instead confi-
dence intervals based on a >-distribution with 8 : degrees of freedom, rather than intervals
based on the Gaussian distribution. If
s)4 )4 Îeses)4
s)4 )4 Îeases)4
can be treated as a >8: variable. As a consequence, an !-level test for L! :)4 . vs.
L" :)4 Á . compares
s)4 .
>9,= [5.27]
easeÐs)4 Ñ
against the !Î# cutoff of a >8: distribution. L! is rejected if l>9,= l >!Î#ß8: or,
equivalently, if the a" !b"!!% confidence interval
s Ès- 4 .
s)4 >!Î#ß8: easeÐs)4 Ñ s)4 >!Î#ß8: 5 [5.28]
does not contain . . These intervals are calculated by the nlin procedure with ! !Þ!& for
each parameter of the model by default.
The simple hypothesis L! :)4 . is a special case of a linear hypothesis (see also
§A4.8.2). Consider we wish to test whether
WÐs
)Ñ< WÐs
)Ñ0 Î;
a#b
J9,= [5.29]
WÐs
) Ñ 0 Îa 8 : b
a"b
s)#4
J9,= >#9,= .
s #s- 4
5
Any demerits of the Wald statistic are also demerits of the >-test and >-based confidence
intervals shown earlier.
Example 5.5. Velvetleaf Growth Response. Two herbicides (L" ß L# ) are applied at
seven different rates and the dry weight percentages (relative to a no-herbicide control
treatment) of velvetleaf (Abutilon theophrasti Medikus) plants are recorded (Table 5.2).
A graph of the data shows no clear inflection point in the dose response (Figure 5.9).
The logistic model, although popular for modeling dose-response data, is not
appropriate in this instance. Instead, we select a hyperbolic function, the three-param-
eter extended Langmuir model (Ratkowsky 1990).
100
80
Dry weight percentage
60
40
20
Figure 5.9. Velvetleaf dry weight percentages as a function of the amount of active
ingredient applied. Closed circles correspond to herbicide 1, open circles to herbicide 2.
]34 denotes the observation made at rate B34 for herbicide 4 "ß #. !" is the asymptote
for herbicide " and !# the asymptote for herbicide #. The model states that the two
herbicides differ in the parameters; hence ) is a a' "b vector, )
c!" ß !# ß "" ß "# ß #" ß ## dw . To test the hypothesis that the herbicides have the same mean
response function, we let
Ô" " ! ! ! !×
A ! ! " " ! ! .
Õ! ! ! ! " "Ø
To test whether the herbicides share the same " parameter, L! : "" "# , we let
A c!ß !ß "ß "ß !ß !dw . The reduced model corresponding to this hypothesis is
# #
Ec]34 d !4 " B344 " " B344 .
Since the dry weights are expressed relative to a no-treatment control we consider as
the full model for analysis
#
"4 B344
]34 "!! # /34 [5.32]
" "4 B344
instead of [5.31]. To fit this model with the nlin procedure, it is helpful to rewrite it as
#" ##
"" B3" "# B3#
]34 "!! #" MÖ4 "× "!! ## M e4 #f /34 Þ [5.33]
" "" B3" " "# B3#
M ef is the indicator function that returns the value " if the condition inside the curly
braces is true, and ! otherwise. M e4 "f for example, takes on value " for observations
receiving herbicide " and ! for observations receiving herbicide #. The SAS®
statements to accomplish the fit (Output 5.4) are
proc nlin data=herbicide noitprint;
parameters beta1=0.049 gamma1=-1.570 /* starting values for 4 " */
beta2=0.049 gamma2=-1.570;/* starting values for 4 # */
alpha = 100; /* constrain ! to "!! */
term1 = beta1*(rate**gamma1);
term2 = beta2*(rate**gamma2);
model drypct = alpha * (term1 / (1 + term1))*(herb=1) +
alpha * (term2 / (1 + term2))*(herb=2);
run;
Estimation Summary
Method Gauss-Newton
Iterations 10
Subiterations 5
Average Subiterations 0.5
R 6.518E-6
Objective 71.39743
Observations Read 14
Observations Used 14
Observations Missing 0
NOTE: An intercept was not specified for this model.
that the two herbicides coincide in growth response, we fit the constrained model
" B#34
]34 "!! /3 Þ [5Þ34]
" " B#34
Output 5.5.
The NLIN Procedure
Asymptotic
Standard
Parameter Estimate Error Asymptotic 95% Confidence Limits
beta 0.0419 0.0139 0.0117 0.0722
gamma -1.5706 0.1564 -1.9114 -1.2297
100
80
Dry weight percentage
60
40
20
Figure 5.10. Observed dry weight percentages for velvetleaf weed plants treated with
two herbicides at various rates. Rates are measured in pounds of acid equivalent per
acre. Solid line is model [5.34] fit to the combined data in Table 5.2.
WÐs
)Ñ< WÐs
)Ñ0 Î; e%!(Þ' ("Þ$*(%fÎ#
a#b
J9,= &Þ! a: !Þ!$"#&b.
WÐs
)Ñ0 Îa8 :b ("Þ$*(%Î"!
There is sufficient evidence at the &% level to conclude that the growth responses of the
two herbicides are different. At this point it is interesting to find out whether the herbi-
cides differ in the parameter " , # , or both. It is here, in the comparison of individual
parameters across groups, that the Wald-type test can be implemented easily. To test
L! :"" "# , for example, we can fit the reduced model
#
" B344
]34 "!! # /34 [5.35]
" " B344
and use the sum of squares reduction test to compare with the full model [5.32]. Using
the statements (output not shown)
a#b
one obtains WÐs)< Ñ "&*Þ) and J9,= a"&*Þ) ("Þ$*(%bÎ(Þ"$*(% "#Þ$) with :-
value !Þ!!&&. At the &% level the hypothesis of equal " parameters is rejected.
s" "
" s# !Þ!#"$ !Þ!')& !Þ!%(#!
>9,= $Þ%('##
s s
ease" " " # È !Þ!!&*'# !Þ!"### !Þ!"$&)
Since the square of a > random variable with @ degrees of freedom is identical to a J
random variable with "ß @ degrees of freedom, we can compare >#9,= $Þ%('###
"#Þ!) to the J statistic of the sum of squares reduction test. The two tests are asymp-
totically equivalent, but differ for any finite sample size.
An alternative implementation of the > test that estimates the difference between "" and
"# directly, is as follows. Let "# "" $ and write [5.32] as
"" B#3"" ##
a"" $ bB3#
]34 "!! MÖ4 "× "!! ## M e4 #f /34,
" "" B#3"" " a"" $ bB3#
so that $ measures the difference between "" and "# . The SAS® statements fitting this
model are
The estimate s$ !Þ!%(# (Output 5.6, abridged) agrees with the numerator of >9,= above
(apart from sign) and easeÐs$ Ñ !Þ!"$' is identical to the denominator of >9,= .
Asymptotic
Standard
Parameter Estimate Error Asymptotic 95% Confidence Limits
beta1 0.0213 0.00596 0.00799 0.0346
gamma1 -2.0394 0.1422 -2.3563 -1.7225
gamma2 -1.2218 0.0821 -1.4047 -1.0389
delta 0.0472 0.0136 0.0169 0.0776
ase0 Ðx3 ,s
)Ñ É5 # Fw3 aFw Fb" Fw3 ,
s#
where F3 denotes the 3th row of F. To estimate this quantity, 5 # is replaced by its estimate 5
and F, F3 are evaluated at the converged iterate. The >-based confidence interval for Ec] d at x
is
0 Ðx3 ,s
)Ñ >!Î#ß8: ease0 Ðx3 ,s
)Ñ.
These confidence intervals can be calculated with proc nlin of The SAS® System as we now
illustrate with the small example visited earlier.
Example 5.2 Continued. Recall the simple one parameter model ]3 "
expÖ B)3 × /3 , 3 "ß âß % that was fit to a small data set on page 199. The
converged iterate was s) "Þ""%' and the estimate of residual variability 5
s # !Þ!!'.
The gradient matrix at s) evaluates to
Asymptotic *&% confidence intervals are obtained with the l95m and u95m options of
the output statement in proc nlin. The output statement in the code segment that
follows creates a new data set (here termed nlinout) that contains the predicted values
(variable pred), the estimated asymptotic standard errors of the predicted values
(easeÐ0 ÐB3 ß s)ÑÑ, variable stdp), and the lower and upper *&% confidence limits for
0 aB3 ß )b (variables l95m and u95m).
ods listing close;
proc nlin data=Ex_52;
parameters theta=1.3;
model y = 1 - exp(-x**theta);
output out=nlinout pred=pred stdp=stdp l95m=l95m u95m=u95m;
run;
ods listing;
proc print data=nlinout; run;
Output 5.7.
Obs y x PRED STDP L95M U95M
Confidence intervals are intervals for the mean response 0 ax3 ß )b at x3 . Prediction intervals,
on the contrary, have a" !b"!!% coverage probability for ]3 , which is a random variable.
Prediction intervals are wider than confidence intervals for the mean because the estimated
s Ñ rather than that of 0 Ðx3 ß s
standard error is that of the difference ]3 0 Ðx3 ß ) ) Ñ:
0 Ðx 3 ß s s Ñ .
)Ñ >!Î#ß8: ease]3 0 Ðx3 ß ) [5.36]
In SAS® , prediction intervals are also obtained with the output statement of the nlin
procedure, but instead of l95m= and u95m= use l95= and u95= to save prediction limits to the
output data set.
• Some nonlinear models can be transformed into linear models. However, de-
transforming parameter estimates and predictions leads to bias
(transformation bias).
• The transformation that linearizes the model may be different from the
transformation that stabilizes the variance or makes the errors more
Gaussian-like.
where ] is one size measurement (e.g., length of fibula) and B is the other size measure
(e.g., length of sternum). "" is the ratio of the relative growth rates "" ÐBÎCÑ`CÎ`B.
This model can be transformed to linearity by taking logarithms on both sides
lneEc]3 df Y3 lne"! f "" lneB3 f
To obtain an observational model we need to add stochastic error terms. Residuals propor-
tional to the expected value of the response give rise to
] !expe" Bfa" / b !expe" Bf !expe" Bf/ !expe" Bf / . [5.39]
This is called a constant relative error model (Seber and Wild 1989, p. 15). Additive
residuals, on the other hand, lead to a constant absolute error model
] !expe" Bf % Þ [5.40]
A linearizing transform is not applied to the mean values but to the observables ] , so that
the two error assumptions will lead to different properties for the transformed residuals. In the
case of [5.39] the transformed observational model becomes
lne] f # " B lne" / f # " B /Þ
If / has zero mean and constant variance, then lne" / f will have approximately zero
mean and constant variance (which can be seen easily from a Taylor series of lne" / f).
For constant absolute error the logarithm of the observational model leads to
lne] f ln!/" B %
!expe" Bf
ln!/" B %
!expe" Bf
%
ln!/" B "
!expe" Bf
%
# " B ln" # " B %.
!expe" Bf
The error term of this model should have expectation close to zero expectation if % is small
on average compared to Ec] d (Seber and Wild, 1989, p. 15). Varc%d depends on the mean of
the model, however. In case of [5.39] the linearization transform is also a variance-stabilizing
transform but in the constant absolute error model linearization has created heterogeneous
error variances.
Nonlinear models are parameterized so that the parameters represent important physical
and biological measures (see §5.7 on parameterizations) such as rates of change, survival and
mortality, upper and lower yield and growth asymptotes, densities, and so forth. Linearization
destroys the natural interpretation of the parameters. Sums of squares and variability esti-
mates are not reckoned on the original scale of measurement, but on the transformed scale.
The variability of plant yield is best understood in yield units (bushels/acre, e.g.), not in
where L3 is the height of the 3th plant (or stand of plants) and B3 is the reciprocal of
plant (stand) age. As before, the mean function is linearized by taking logarithms on
both sides:
lneEcL3 df lne!f " B3 xw3 ".
Here xw3 c"ß B3 d, " clne!fß " dw and expexw3 " f EcL3 dÞ We can fit the linearized
model
Y3 xw3 " /3 [5.41]
where the /3 are assumed independent Gaussian with mean ! and variance 5 # . A
s which has expectation
s xw "
predicted value in this model on the linear scale is ?
s xw " lneEcL df.
sd Exw "
Ec?
If the linearized model is correct, the predictions are unbiased for the logarithm of
height. Are they also unbiased for height? To this end we detransform the linear predic-
tions and evaluate Ecexpe? sfd. In general, if lne] f is Gaussian with mean . and
variance 0, then ] has a Log-Gaussian distribution with
Ec] d expe. 0Î#f
Varc] d expe#. 0faexpe0f "b.
We note that Ec] d is expe0Î#f times larger than expe.f, since 0 is a positive quantity.
Under the assumption about the distribution of /3 made earlier, the predicted value ?
s is
w # w w "
also Gaussian with mean x " and variance 5 x aX Xb x. The mean of a predicted
height value is then
" " " "
sfd expxw " 5 # xw aXw Xb x EcL dexp 5 # xw aXw Xb x.
Ecexpe?
# #
Transformation to achieve linearity can negatively affect other desirable properties of the
model. The most desirable transformation is one which linearizes the model, stabilizes the
variance, and makes the residuals Gaussian-distributed. The class of Box-Cox transforma-
tions (after Box and Cox 1964, see §5.6.2 below for Box-Cox transformations in the context
of variance heterogeneity) is heralded to accomplish these multiple goals but is not without
controversy. Transformations are typically more suited to rectify a particular problematic
aspect of the model, such as variance heterogeneity or residuals which are far from a
Gaussian distribution. The transformation that stabilizes the variance may be different from
the transformation that linearizes the model.
Example 5.8. Yield density models are used in agronomy to describe agricultural
output (yield) as a function of plant density (see §5.8.7 for an application). Two
particularly simple representatives of yield-density models are due to Shinozaki and
Kira (1956)
"
]3 /3
"! "" B3
and Holliday (1960)
"
]3 /3 .
"! "" B3 "# B#3
Obviously the linearizing transform is the reciprocal "Î] . It turns out, however, that for
many data sets to which these models are applied the appropriate transform to stabilize
the variance is the logarithmic transform.
Nonlinearity of the model, given the reliability and speed of today's computer algorithms,
is not considered a shortcoming of a statistical model. If estimation and inferences can be
carried out on the original scale, there is no need to transform a model to linearity simply to
invoke a linear regression routine and then to incur transformation bias upon detransfor-
mation of parameter estimates and predictions. The lack of Gaussianity of the model residuals
is less critical for nonlinear models than it is for linear ones, since less is lost if the errors are
non-Gaussian. Statistical inference in nonlinear models requires asymptotic results and the
nonlinear least squares estimates are asymptotically Gaussian-distributed regardless of the
distribution of the model errors. In linear models the difference between Gaussian and non-
Gaussian errors is the difference between exact and approximate inference. In nonlinear
models, inference is approximate anyway.
but a diagonal matrix whose entries are of different magnitude, is quite common in biological
data. It is related to the intuitive observation that large entities vary more than small entities.
If error variance is a function of the regressor B two approaches can be used to remedy
heteroscedasticity. One can apply a power transformation of the model or fit the model by
which apply when the response is non-negative a] !b. Expanding Y into a first-order
Taylor series around Ec] d we find the variance of the Box-Cox transformed variable to be
approximately
By choosing - properly the variance of Y can be made constant which was the motivation
behind finding a variance-stabilizing transform in §4.5.2 for linear models. There we were
concerned with the comparison of groups where replicate values were available for each
group. As a consequence we could estimate the mean and variance in each group, linearize
the relationship between variances and means, and find a numerical estimate for the para-
meter -. If replicate values are not available, - can be determined by trial and error, choosing
a value for -, fitting the model and examining the fitted residuals s/3 until a suitable value of -
has been found. With some additional programming effort, - can be estimated from the data.
Seber and Wild (1989) discuss maximum likelihood estimation of the parameters of the mean
function and - jointly. To combat variance heterogeneity [5.43] suggests two approaches:
• transform both sides of the model according to [5.42] and fit a nonlinear model with
response Y3 ;
• leave the response ]3 unchanged and allow for variance heterogeneity. The model is fit
by nonlinear weighted least squares where the variance of the response is proportional
to Ec]3 d#a-"b Þ
Carroll and Ruppert (1984) call these approaches power-transform both sides and power-
transformed weighted least squares models. If the original, untransformed model is
]3 0 ax3 ß )b /3 ,
where
and /3 µ 33. K a!ß 5 # b. The second approach uses the original response and
For extensions of the Box-Cox method such as power transformations for negative responses
see Box and Cox (1964), Carroll and Ruppert (1984), Seber and Wild (1989, Ch. 2.8) and
references therein. Weighted nonlinear least squares is applied in §5.8.8.
• The smaller the curvature of a model, the more estimators behave like the
efficient estimators in a linear model and the more reliable are inferential
and diagnostic procedures.
Parameterization is the process of expressing the mean function of a statistical model in terms
of unknown constants (parameters) to be estimated. With nonlinear models the same model
can usually be expressed (parameterized) in a number of ways.
Consider the basic differential equation `CÎ`> ,a! Cb where Ca>b is the size of an
organism at time >. According to this model, growth is proportional to the remaining size of
the organism, ! is the final size (total growth), and , is the proportionality constant. Upon
integration of this generating equation one obtains
is known as the asymptotic regression model. Finally, one can put 0 !a" expe,>! fb,
where >! is the time at which the size (yield) is zero, aC a>! b !b to obtain the equation
In this form the equation is known as the Mitscherlich law (or Mitscherlich equation),
popular in agronomy to model crop yield as a function of fertilizer input a>b (see §5.8.1 for a
basic application of the Mitscherlich model). In fisheries and wildlife research [5.46] is
known as the Von Bertalanffy model and it finds application as Newton's law of cooling a
body over time in physics. Ratkowsky (1990) discusses it as Mitscherlich's law and the Von
Bertalanffy model but Seber and Wild (1989) argue that the Von Bertalanffy model is derived
from a different generating equation; see §5.2. Equations [5.44] through [5.47] are four para-
meterizations of the same basic relationship. When fitting either parameterization to data,
they yield the same goodness of fit, the same residual error sum of squares, and the same
vector of fitted residuals. The interpretation of their parameters is different, however, as are
the statistical properties of the parameter estimators. For example, the correlations between
the parameter estimates can be quite different. In §4.4.4 it was discussed how correlations and
dependencies can negatively affect the least squares estimate. This problem is compounded in
nonlinear models which usually exhibit considerable correlations among the columns in the
regressor matrix. A parameterization that reduces these correlations will lead to more reliable
convergence of the iterative algorithm. To understand the effects of the parameterization on
statistical inference, we need to consider the concept of the curvature of a nonlinear model.
Example 5.9. Consider the linear mean function Ec]3 d )B#3 and design points at
B" " and B" #. The expectation surface is obtained by varying
) 0 a)b
EcYd "
%) 0# a)b
f 2 (q ) 4
0
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
f 1 (q )
Figure 5.11. Expectation surface of Ec]3 d )B#3 . Asterisks mark points on the surface
for equally spaced values of ) !Þ#ß !Þ'ß "Þ!ß "Þ%ß and "Þ).
Since the expectation surface does not bend as ) is changed, the model has no intrinsic
curvature.
f ax ß ) b f Ðx ß s s) s
)Ñ » F )
Example 5.10. The nonlinear mean function Ec]3 d expe)B3 f is investigated, where
the only design points are B" " and B# #. The expectation surface is then
expe)f 0 a)b
EcYd " .
expe#)f 0# a)b
This surface is graphed in Figure 5.12 along with points on the surface corresponding to
equally spaced sets of ) and < . Asterisks mark points on the surface for equally spaced
values of ) !ß !Þ&ß "Þ!ß "Þ&ß #Þ!ß #Þ&ß $Þ!ß and $Þ&, and circles mark points for equally
spaced values of < #ß %ß ÞÞÞß $#.
1100 q=3.5
1000
900
800
700
600
f 2 (q )
500
q=3.0
400
300
200 q=2.5
100 q=2.0
0
0 5 10 15 20 25 30 35 40
f 1 (q )
Figure 5.12. Expectation surface of Ec]3 d expe)B3 f <B3 for B" "ß B# #.
Adapted from Figures 3.5a) and b) in Seber, G.A.F. and Wild, C.J. (1989) Nonlinear
Regression. Wiley and Sons, New York. Copyright © 1989 John Wiley and Sons, Inc.
Reprinted by permission of John Wiley and Sons, Inc.
The two parameterizations of the model produce identical expectation surfaces; hence
their intrinsic curvatures are the same. The parameter-effects curvatures are quite
different, however. Equally spaced values of ) are no longer equally spaced on the sur-
face as in the linear case (Figure 5.11). Mapping onto the expectation surfaces results in
considerable distortion, and parameter-effects curvature is large. But equally spaced
values of < create almost equally spaced values on the surface. The parameter-effects
curvature of model Ec]3 d <B3 is small.
What are the effects of strong curvature in a nonlinear model? Consider using the fitted
residuals
sÑ
s/3 C3 0 Ðx3 ß )
as a diagnostic tool similar to diagnostics applicable in linear models (§4.4.1). There we can
calculate studentized residuals (see §4.8.3) ,
s È" 233 ,
<3 s/3 Î5
for example, that behave similar to the unobservable model errors. They have mean zero and
constant variance. Here, 233 is the 3th diagonal element of the projection matrix XaXw Xb" Xw .
If the intrinsic curvature component is large, a similar residual obtained from a nonlinear
model fit will not behave as expected. Let
C3 0 Ð x 3 ß s
)Ñ
<3 ,
s # a" 233 b
É5
sw F
s ÐF
where 233 is the 3th diagonal element of F sw . If intrinsic curvature is pronounced, the
sÑ" F
<3 will not have mean !, will not have constant variance, and as shown by Seber and Wild
(1989, p. 178) <3 is negatively correlated with the predicted values sC 3 . The diagnostic plot of
<3 versus sC 3 , a tool commonly borrowed from linear model analysis will show a negative
slope rather than a band of random scatter about !. Seber and Wild (1989, p. 179) give ex-
pressions of what these authors call projected residuals that do behave as their counterparts
in the linear model (see also Cook and Tsai, 1985), but none of the statistical packages we are
aware of can calculate these. The upshot is that if intrinsic curvature is large, one may
diagnose a nonlinear model as deficient based on a residual plot when in fact the model is
adequate, or find a model to be adequate when in fact it is not. It is for this reason that we shy
away from standard residual plots in nonlinear regression models in this chapter.
The residuals are not affected by the parameter-effects curvature, only by intrinsic curva-
ture. Fitting a model in different parameterizations will produce the same set of fitted
residuals (see §5.8.2 for a demonstration). The parameter estimates, on the other hand, are
very much affected by parameter-effects curvature. The key properties of the nonlinear least
squares estimators s
), namely
• asymptotic unbiasedness
• asymptotic minimum variance
• asymptotic Gaussianity
y
5 10 15 20 25
y$ q$
150
100
50
Figure 5.13. Residual sum of squares surfaces for Ec]3 d expe)B3 f <B3 ß C" "#ß
C# ##"Þ&. Other values as shown in Figure 5.12. Solid line: sum of squares surface for
Ec]3 d expe)B3 fÞ Dashed line: sum of squares surface for Ec]3 d <B3 .
popular in modeling the relationship between a dosage B and a response ] . The response
changes in a sigmoidal fashion between the asymptotes ! and $ . The rate of change and
whether response increases or decreases in B depend on the parameters < and " . Assume we
wish to replace a parameter, < say, by its expected value equivalent:
This model should have less parameter-effects curvature than [5.48], since < was replaced by
its expected value parameter. It has not become more interpretable, however. Expected value
parameterization rests on estimating the unknown . for a known value of B . This process
can be reversed if we want to express the model as a function of an unknown value B for a
known value . . A common task in dose-response studies is to find that dosage which reduc-
es/increases the response by a certain amount or percentage. In a study of insect mortality as a
function of insecticide concentration, for example, we might be interested in the dosage at
which &!% of the treated insects die (the so-called PH&! value). In biomedical studies, one is
often interested in the dosage that cures *!% of the subjects. The idea is as follows. Consider
the model Ec] lBd 0 aBß )b with parameter vector ) c)" ß )# ß âß ): dw . Find a value B which
is of interest to the investigation, for example PH&! or KV&! , the dosage that reduc-
es/increases growth by &!%. Our goal is to estimate B . Now set
Ec] lB d 0 aB ß )b,
termed the defining relationship. Solve for one of the original parameters, )" say. Substitute
the expression obtained for )" into the original model 0 aBß )b and one obtains a model where
)" has been replaced by B . Schabenberger et al. (1999) apply these ideas in the study of
herbicide dose response where the log-logistic function has negative slope and represents the
growth of treated plants relative to an untreated control. Let -O be the value which reduces
growth by aO"!!b%. In a model with lower and upper asymptotes such as model [5.48] one
needs to carefully define whether this is a reduction of the maximum response ! or of the dif-
ference between the maximum and minimum response. Here, we define -O as the value for
which
"!! O
Ec] l-O d $ a! $ b.
"!!
Schabenberger et al. (1999) chose to solve for <. Upon substituting the result back into [5.48]
one obtains the reparameterized log-logistic equation
!$
Ec] lBd $ . [5.49]
" OÎa"!! O bexpe" lnaBÎ-O bf
In the special case where O &!, the term OÎa"!! O b in the denominator vanishes.
Two popular ways of expressing the Mitscherlich equation can also be developed by this
method. Our [5.44] shows the Mitscherlich equation for crop yield C at nutrient level B as
Ec] lBd ! a0 !b/,B ,
where ! is the yield asymptote and 0 is the yield at nutrient concentration B !. Suppose we
are interested in estimating what Black (1993, p. 273) calls the availability index, the nutrient
level B! at which the average yield is zero and want to replace 0 in the process. The defining
relationship is
Ec] lB! d ! ! a0 !b/,B! .
Solving for 0 gives 0 !a" expe,B! fb. Substituting back into the original equation and
simplifying one obtains the other popular parameterization of the Mitscherlich equation
(compare to [5.47]):
Ec] d ! a!a" /,B! b !b/,B
! !/,B! /,B
!" /,aBB! b .
5.8 Applications
In this section we present applications involving nonlinear statistical models and discuss their
implementation with The SAS® System. A good computer program for nonlinear modeling
should provide simple commands to generate standard results of any nonlinear analysis such
as parameter estimates, their standard errors, hypothesis tests or confidence intervals for the
parameters, confidence and prediction intervals for the response, and residuals. It should also
allow different fitting methods such as the Gauss-Newton, Newton-Raphson, and Levenberg-
Marquardt methods in their appropriately modified forms. An automatic differentiator helps
to the user avoid having to code first (Gauss-Newton) and second (Newton-Raphson)
derivatives of the mean function with respect to the parameters. The nlin procedure of The
SAS® System fits these requirements.
In §5.8.1 we analyze a simple data set on sugar cane yields with Mitscherlich's law of
physiological relationships. The primary purpose is to illustrate a standard nonlinear
regression analysis with The SAS® System. But we also provide some additional details on
the genesis of this model that is key in agricultural investigations of crop yields as a function
of nutrient availability and fit the model in different parameterizations. As discussed in
§5.7.1, changing the parameterization of a model changes the statistical properties of the esti-
mators. Ratkowsky's simulation method (Ratkowsky 1983) is implemented for the sugar cane
yield data in §5.8.2 to compare the sampling distributions, bias, and excess variance of
estimators in different parameterizations of the Mitscherlich equation.
in a matrix programming language. Only when the A matrix has a simple form can the nlin
procedure be tricked in calculating the Walt test directly. The SAS® System provides proc
iml, an interactive matrix language to perform these tasks as part of the SAS/IML® module.
The estimated asymptotic variance covariance matrix 5 s # ÐFw FÑ" can be output by proc nlin
and read into proc iml. Fortunately, the nlmixed procedure, which was added in Release 8.0
of The SAS® System, has the ability to perform tests of linear and nonlinear combinations of
the model parameters, eliminating the need for additional matrix programming. We
demonstrate the Wald test for treatment comparisons in §5.8.5 where a nonlinear response is
analyzed in a # $ ' factorial design. We analyze the factorial testing for main effects and
interactions, and perform pairwise treatment comparisons based on the nonlinear parameters
akin to multiple comparisons in a linear analysis of variance model. The nlmixed procedure
can be used there to formulate contrasts efficiently.
Dose-response models such as the logistic or log-logistic models are among the most
frequently used nonlinear equations. Although they offer a great deal of flexibility, they are
no panacea for every data set of dose responses. One limitation of the logistic-type models,
for example, is that the response monotonically increases or decreases. Hormetic effects,
where small dosages of an otherwise toxic substance can have beneficial effects, can throw
off a dose-response investigation with a logistic model considerably. In §5.8.6 we provide
details on how to construct hormetic models and examine a data set used by Schabenberger et
al. (1999) to compare effective dosages among two herbicides, where for a certain weed
species one herbicide induces a hormetic response while the other does not.
Yield-density models are a special class of models closely related to linear models. Most
yield-density models can be linearized. In §5.8.7 we fit yield-density models to a data set by
Mead (1970) on the yield of different onion varieties. Tests of hypotheses comparing the
varieties as well as estimation of the genetic and environmental potentials are key in this
investigation, which we carry out using the nlmixed procedure.
The homogeneous variance assumption is not always tenable in nonlinear regression
analyses just as it is not tenable for many linear models. Transformations that stabilize the
variance may destroy other desirable properties of the model and transformations that
linearize the model do not necessarily stabilize the variance (§5.6). In the case of hetero-
geneous error variances we prefer to use weighted nonlinear least squares instead of transfor-
mations. In §5.8.8 we apply weighted nonlinear least squares to a growth modeling problem
and employ a grouping approach to determine appropriate weights.
Table 5.4. Sugar cane yield in randomized complete block design with five blocks
and six levels of nitrogen fertilization
Nitrogen Block Treatment
(kg/ha) 1 2 3 4 5 sample mean
0 )*Þ%* &%Þ&' (%Þ$$ ()Þ#! '"Þ&" ("Þ'#
25 "!)Þ() "!#Þ!" "!&Þ!% "!&Þ#$ "!'Þ&# "!&Þ&"
50 "$'Þ#) "#*Þ&" "$#Þ&% "$#Þ($ "$%Þ!# "$$Þ!"
100 "&(Þ'$ "'(Þ$* "&&Þ$* "%'Þ)& "&&Þ)" "&'Þ&(
150 ")&Þ*' "('Þ'' "()Þ&$ "*&Þ$% ")&Þ&' ")%Þ%"
200 "*&Þ!* "*!Þ%$ ")$Þ&# ")!Þ** #!&Þ'* "*"Þ"%
180
Treatment sample means across blocks
140
100
60
Figure 5.14. Treatment average sugar cane yield vs. nitrogen level applied.
The averages for the nitrogen levels calculated across blocks monotonically increase in
the amount of N applied (Figure 5.14). The data do not indicate a decline or a maximum yield
at some N level within the range of fertilizer applied nor do they indicate a linear-plateau
relationship, since the yield does not appear constant for any level of nitrogen fertilization.
Rather, the maximum yield is approached asymptotically. At the control level (! N), the
average yield is of course not !; it corresponds to the natural fertility of the soil.
The parameter , is the proportionality constant that Mitscherlich called the effect-factor
(Wirkungsfaktor in the original publication which appeared in German). The larger , , the
faster yield approaches its asymptote. Solving this generating equation leads to various
mathematical forms (parameterizations) of the Mitscherlich equation that are known under
different names. Four of them are given in §5.7; a total of eight parameterizations are
presented in §A5.9.1. We prefer to call simply the Mitscherlich equation what has been
termed Mitscherlich's law of physiological relationships. Two common forms of the equation
are
C aBb !a" expe ,aB B! bfb
[5.51]
C aBb ! a0 !bexpe ,Bf.
100
Model 2
80
Model 1
Model 3
60
40
Model 1: k = 0.03, x0 = -20, x = 45.12
Model 2: k = 0.04, x0 = -20, x = 55.07
20 Model 3: k = 0.05, x0 = -10, x = 39.35
Both equations are three parameter models (Figure 5.15). In the first the parameters are
!, ,, and B! ; in the second equation the parameters are !, 0, and ,. ! represents the
asymptotic yield and B! is the fertilizer concentration at which the yield is !, i.e., CaB! b !.
Black (1993, p. 273) calls B! the availability index of the nutrient in the soil (and seed)
when none is added in the fertilizer or as Mead et al. (1993, p. 264) put it, “the amount of
fertilizer already in the soil.” Since B ! fertilizer is the minimum that can be applied, B! is
obtained by extrapolating the yield-nutrient relationship below the lowest rate to the point
where yield is exactly zero (Figure 5.15). This assumes that the Mitscherlich equation extends
past the lowest level applied, which may not be the case. We caution therefore against
attaching too much validity to the parameter B! . The second parameterization replaces the
parameter B! with the yield that is obtained if no fertilizer is added, 0 CÐ!Ñ. The relation-
ship between the two parameterizations is
0 !a" expe,B! fb.
In both model formulas, the parameter , is a scale parameter that governs the rate of
change. It is not the rate of change as is sometimes stated. Figure 5.15 shows three
Mitscherlich equations with asymptote ! "!! that vary in , and B! . With increasing ,, the
asymptote is reached more quickly (compare Models 1 and 2 in Figure 5.15). It is also clear
from the figure that B! is an extrapolated value.
One of the methods for finding starting values in a nonlinear model that was outlined in
§5.4.3 relies on the expected value parameterization of the model (Ratkowsky 1990, Ch.
2.3.1). Here we choose values of the regressor variables and rewrite the model in terms of the
mean response at those values. We call these expected value parameters since they corres-
pond to the means at the particular values of the regressors that were chosen. For each
regressor value for which an expected value parameter is obtained, one parameter of the
original model is replaced. An expected value parameterization for the Mitscherlich model
due to Schnute and Fournier (1980) is
CÐBÑ . a. . b" )7" Î" )8"
7 " a8 "baB B bÎaB B b [5.52]
8 number of observations.
Here, . and . are the expected value parameters for the yield at nutrient levels B and B ,
respectively. Expected value parameterizations have advantages and disadvantages. Finding
starting values is particularly simple if a model is written in terms of expected value param-
eters. If, for example, B #& and B "&! are chosen for the sugar cane data, reasonable
starting values (from Fig. 5.14) are identified as .! "#! and .! "(&. Models in expec-
ted value parameterization are also closer to linear models in terms of the statistical properties
of the estimators because of low parameter-effects curvature. Ratkowsky (1990) notes that the
Mitscherlich model is notorious for high parameter-effects curvature which gives particular
relevance to [5.52]. A disadvantage is that the interpretation of parameters in terms of
physical or biological quantities is lost compared to other parameterizations.
Starting values for the standard forms of the Mitscherlich model can also be found
relatively easily by using the various devices described in §5.4.3. Consider the model
C aBb !a" expe ,aB B! bfbÞ
Since ! is the upper asymptote, Figure 5.14 would suggest a starting value of !! #!!.
Once ! is fixed we can rewrite the model as
ln!! C lne!f ,B! ,B.
This is a linear regression with response lne!! C f, intercept lne!f ,B! , and slope ,. For
the averages of the sugar cane yield data listed in Table 5.4 and graphed in Figure 5.14 we
From Output 5.8 we gather that a reasonable starting value is ,! !Þ!"$&'. We delib-
erately ignore all other results from this linear regression since the value !! #!! that was
substituted to enable a linearization by taking logarithms was only a guess. Finally, we need a
starting value for B! , the nutrient concentration at which the yield is !. Visually extrapolating
the response trend in Figure 5.14, a value of B!! #& seems not unreasonable as a first
guess.
Output 5.8.
The REG Procedure
Model: MODEL1
Dependent Variable: y2
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Now that starting values have been assembled we fit the nonlinear regression model with
proc nlin. The statements that accomplish this in the parameterization for which the starting
values were obtained are
proc nlin data=CaneMeans method=newton;
parameters alpha=200 kappa=0.0136 nitro0=-25;
Mitscherlich = alpha*(1-exp(-kappa*(nitro-nitro0)));
model yield = Mitscherlich;
run;
The method=newton option of the proc nlin statement selects the Newton-Raphson algo-
rithm. If the method= option is omitted, the procedure defaults to the Gauss-Newton method.
In either case proc nlin implements not the unmodified algorithms but provides internally
necessary modifications such as step-halving that stabilize the algorithm. Among other fitting
methods that can be chosen are the Marquardt-Levenberg algorithm (method=Marquardt) that
is appropriate if the columns of the derivative matrix are highly correlated (Marquardt 1963).
Prior to Release 6.12 of The SAS® System the user had to specify first derivatives of the
mean function with respect to all parameters for the Gauss-Newton method and first and
second derivatives for the Newton-Raphson method. To circumvent the specification of
derivatives, one could use method=dud which invoked a derivative-free method (Ralston and
Jennrich 1978). This acronym stands for Does not Use Derivatives and the method enjoyed
popularity because of this feature. The numerical properties of this algorithm are typically
poor and the algorithm is not efficient in terms of computing time. Since The SAS® System
calculates derivatives automatically starting with Release 6.12, the DUD method should no
longer be used. There is no justification in our opinion for using a method that approximates
derivatives over one that determines the actual derivatives. Even in newer releases the user
can still enter derivatives through der. statements of proc nlin. For the Mitscherlich model
above one would code, for example,
proc nlin data=CaneMeans method=gauss;
parameters alpha=200 kappa=0.0136 nitro0=-25;
Mitscherlich = alpha*(1-exp(-kappa*(nitro-nitro0)));
model yield = Mitscherlich;
der.alpha = 1 - exp(-kappa * (nitro - nitro0));
der.kappa = alpha * ((nitro - nitro0) * exp(-kappa * (nitro - nitro0)));
der.nitro0 = alpha * (-kappa * exp(-kappa * (nitro - nitro0)));
run;
to obtain a Gauss-Newton fit of the model. Not only is the added programming not worth the
trouble and mistakes in coding the derivatives are costly, the built-in differentiator of proc
nlin is of such high quality that we recommend allowing The SAS® System to determine the
derivatives. If the user wants to examine the derivatives used by SAS® , add the two options
list listder to the proc nlin statement.
Finally, the results of fitting the Mitscherlich model by the Newton-Raphson method in
the parameterization
C3 !a" expe ,aB3 B! bfb /3 ß 3 "ß âß ',
where the /3 are uncorrelated random errors with mean ! and variance 5 # , with the statements
proc nlin data=CaneMeans method=newton noitprint;
parameters alpha=200 kappa=0.0136 nitro0=-25;
Mitscherlich = alpha*(1-exp(-kappa*(nitro-nitro0)));
model yield = Mitscherlich;
run;
shown as Output 5.9. The procedure converges after six iterations with a residual sum of
s Ñ &(Þ#'$". The model fits the data very well as measured by
squares of WÐ)
&(Þ#'$"
Pseudo-V # " !Þ*%(.
"!ß ((&Þ)
The converged iterates (the parameter estimates) are
s
) c!
sß ,
sß B
w w
s! d c#!&Þ)ß !Þ!""#ß $)Þ((#)d ,
from which a prediction of the mean yield at fertilizer level '! 51 2+" , for example, can
be obtained as
C '!b #!&Þ)a" expe !Þ!""#a'! $)Þ((#)bfb "$(Þ(##.
sa
Estimation Summary
Method Newton
Iterations 6
R 8.93E-10
PPC 6.4E-11
RPC(kappa) 7.031E-6
Object 6.06E-10
Objective 57.26315
Observations Read 6
Observations Used 6
Observations Missing 0
For each parameter in the parameters statement proc nlin lists its estimate, (asymptotic)
estimated standard error, and (asymptotic) *&% confidence interval. For example, ! s #!&Þ)
with easeÐ!
sÑ )Þ*%"&. The asymptotic *&% confidence interval for ! is calculated as
s >!Þ!#&ß$ easea!
! sb #!&Þ) $Þ")# )Þ*%"& c"((Þ$ß #$%Þ#d .
Based on this interval one would, for example, reject the hypothesis that the upper yield
asymptote is #&! and fail to reject the hypothesis that the asymptote is #!!.
The printout of the Approximate Correlation Matrix lists the estimated correlation
coefficients between the parameter estimates,
Covs)4 ß s)5
s s)4 ß s)5
Corr .
ÊVars)4 Vars)5
!Þ(&, Corrc,sß B
s! d !Þ*"#. Studying the derivatives of the Mitscherlich model in this
parameterization, this is not surprising. They all involve the same term
expe ,aB B! bf.
Highly correlated parameter estimators are indicative of poor conditioning of the Fw F matrix
and can cause instabilities during iterations. In the presence of large correlations one should
switch to the Marquardt-Levenberg algorithm or change the parameterization. Below we will
see how the expected value parameterization leads to considerably smaller correlations.
If the availability index B! is of lesser interest than the control yield one can obtain an
estimate of sCa!b s0 from the parameter estimates in Output 5.9. Since 0
!a" expe,B! fb, we simply substitute estimates for the unknowns and obtain
s0 !
sa" expe,
ssB! fb #!&Þ)a" expe !Þ!""#$)Þ((#)fb (#Þ&.
Although it is easy to obtain the point estimate of 0, it is not a simple task to calculate the
standard error of this estimate needed for a confidence interval, for example. Two possibi-
lities exist to accomplish that. One can refit the model in the parameterization
C3 ! a0 !bexpe ,B3 f /3 ,
that explicitly involves 0 . proc nlin calculates approximate standard errors and *&%
confidence intervals for each parameter. The second method uses the capabilities of proc
nlmixed to estimate the standard error of nonlinear functions of the parameters by the delta
method. We demonstrate both approaches.
The statements
proc nlin data=CaneMeans method=newton noitprint;
parameters alpha=200 kappa=0.0136 ycontrol=72;
Mitscherlich = alpha + (ycontrol - alpha)*(exp(-kappa*nitro));
model yield = Mitscherlich;
run;
fit the model in the new parameterization (Output 5.10). The quality of the model fit has not
changed from the first parameterization in terms of B! . The analysis of variance tables in
Outputs 5.9 and 5.10 are identical. Furthermore, the estimates of ! and , and their standard
errors have not changed. The parameter labeled ycontrol now replaces the term nitro0 and
its estimate agrees with the calculation based on the estimates of the model in the first param-
eterization. From Output 5.10 we are able to state that with (approximately) *&% confidence
the interval c&*Þ'#&ß )&Þ%%!d contains the control yield.
Using proc nlmixed, the parameterization of the model need not be changed in order to
obtain an estimate and a confidence interval of the control yield:
proc nlmixed data=CaneMeans df=3 technique=NewRap;
parameters alpha=200 kappa=0.0136 nitro0=-25;
s2 = 19.0877;
Mitscherlich = alpha*(1-exp(-kappa*(nitro-nitro0)));
model yield ~ normal(Mitscherlich,s2);
estimate 'ycontrol' alpha*(1-exp(kappa*nitro0));
run;
The variance of the error distribution is fixed at the residual mean square from the earlier
fits (see Output 5.10). Otherwise proc nlmixed will estimate the residual variance by maxi-
Output 5.10.
The NLIN Procedure
Iterative Phase
Dependent Variable yield
Method: Newton
Estimation Summary
Method Newton
Iterations 5
R 4.722E-8
PPC(kappa) 1.355E-8
RPC(alpha) 0.000014
Object 2.864E-7
Objective 57.26315
Observations Read 6
Observations Used 6
Observations Missing 0
Finally, we fit the Mitscherlich model in the expected value parameterization [5.52] and
choose B #& and B "&!.
proc nlin data=CaneMeans method=newton;
parameters mustar=125 mu2star=175 theta=0.75;
n = 6; xstar = 25; x2star = 150;
m = (n-1)*(nitro-xstar)/(x2star-xstar) + 1;
Mitscherlich = mustar + (mu2star-mustar)*(1-theta**(m-1))/
(1-theta**(n-1));
model yield = Mitscherlich;
run;
The model fit is identical to the preceding parameterizations (Output 5.12). Notice that
the correlations among the parameters are markedly reduced. The estimators of . and . are
almost orthogonal.
Specifications
Data Set WORK.CANEMEANS
Dependent Variable yield
Distribution for Dependent Variable Normal
Optimization Technique Newton-Raphson
Integration Method None
Dimensions
Observations Used 6
Observations Not Used 0
Total Observations 6
Parameters 3
Parameters
alpha kappa nitro0 NegLogLike
200 0.0136 -25 22.8642468
Iteration History
Iter Calls NegLogLike Diff MaxGrad Slope
1 10 16.7740728 6.090174 24.52412 -11.0643
2 15 16.1072166 0.666856 1767.335 -1.48029
3 20 15.8684503 0.238766 21.15303 -0.46224
4 25 15.8608402 0.00761 36.2901 -0.01514
5 30 15.8607649 0.000075 0.008537 -0.00015
6 35 15.8607649 6.02E-10 0.000046 -1.21E-9
Fit Statistics
-2 Log Likelihood 31.7
AIC (smaller is better) 37.7
AICC (smaller is better) 49.7
BIC (smaller is better) 37.1
Parameter Estimates
Standard
Parameter Estimate Error DF t Value Pr>|t| Lower Upper
alpha 205.78 8.9496 3 22.99 0.0002 177.30 234.26
kappa 0.01121 0.001863 3 6.02 0.0092 0.005281 0.01714
nitro0 -38.7728 6.5613 3 -5.91 0.0097 -59.6539 -17.8917
Additional Estimates
Standard
Label Estimate Error DF t Value Pr > |t| Lower Upper
ycontrol 72.5329 4.0571 3 17.88 0.0004 59.6215 85.4443
Estimation Summary
Method Newton
Iterations 3
R 4.211E-8
PPC 3.393E-9
RPC(mustar) 0.000032
Object 1.534E-6
Objective 57.26315
Observations Read 6
Observations Used 6
Observations Missing 0
The variable pred was set to one for observations in the filler data set to distinguish
observations from filler data. The output out= statement saves predicted values and *&%
confidence limits for the mean yield in the data set nlinout. The first fifteen observations of
the output data set are shown below and a graph of the predictions is illustrated in Figure
5.16. Observations for which pred=. are the observations to which the model is fitted,
observations with pred=1 are the filler data.
Output 5.13.
Obs nitro yield pred PREDICTED LOWERM UPPERM
220
Predicted Sugar Cane Yield
170
120
70
Figure 5.16. Predicted sugar cane yields and approximate *&% confidence limits.
where s)3Þ is the average of the estimates s)34 across the O simulations, i.e.,
O
s)3Þ " " s)34 .
O 4"
If the curvature is strong, the estimated variance of the parameter estimators is an under-
estimate. Similar to the relative bias we calculate the relative excess variance as
Î =#3 Varss)3 Ñ
RelativeExcessVariance%3 "!! Ð Ó . [5.54]
Ï s
Var s
)
3 Ò
Here, =#3 is the sample variance of the estimates for )34 in the simulations,
8
" #
=#3 "s)34 s)3Þ ,
O " 4"
and Var sÒs)3 Ó is the estimated asymptotic variance of s)3 from the original fit. Whether the
relative bias and the excess variance are significant can be tested by calculating test statistics
s)3Þ s)3
^ a"b ÈO "Î#
ss)3
Var
^ a"b and ^ a#b are compared against cutoffs of a standard Gaussian distribution to determine
the significance of the bias Ð^ a"b Ñ or the variance excess Ð^ a#b Ñ. This seems like a lot of
trouble to determine whether model curvature induces bias and excess variability of the
coefficients, but it is fairly straightforward to implement the process with The SAS® System.
The complete code including tests for Gaussianity, histograms of the parameter estimates in
the simulations, and statistical tests for excess variance and bias can be found on the CD-
ROM.
0.04 200
0.03 150
0.02 100
0.01 50
0.00 0
190 230 270 0.000 0.006 0.012 0.018
a k
0.06
0.05
0.04
0.03
0.02
0.01
0.00
-70 -40 -10
x0
Figure 5.17. Sampling distribution of parameter estimates when fitting Mitscherlich equation
in parameterization Ec]3 d !a" expe ,aB3 B! bfb to data in Table 5.4.
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
98 106 114 170 180 190
m* m**
10
0
0.61 0.72 0.83
q
Figure 5.18. Sampling distribution of parameter estimates when fitting Mitscherlich equation
in expected value parameterization to data in Table 5.4.
Table 5.5. T -values of tests of Gaussianity for standard and expected value parameterization
of Mitscherlich equation fitted to data in Table 5.4
Parameter Estimate
!
s ,
s B
s! s
. s
. )
T -value !Þ!!!" !Þ!#( !Þ!!!" !Þ%$# !Þ$#! !Þ("$
the wrong values. Particularly concerning is the large excess variance in the estimate of the
asymptotic yield !. Nonlinear least squares estimation underestimates the variance of this
parameter by "&Þ"%. Significance tests about the yield asymptote should be interpreted with
the utmost caution. The expected value parameterization fairs much better. Its relative biases
are an order of magnitude smaller than the biases in the standard parameterization and not
s , and s) are reliable as shown by the small excess
s , .
significant. The variance estimates of .
variance and the large :-values.
Table 5.6. Bias [5.53] and variance [5.54] excesses for standard and expected value
parameterization of Mitscherlich equation fitted to data in Table 5.4
Statistic and T -Value
s)3Þ RelativeBias% T RelativeExcessVariance% T
!s #!(Þ% !Þ)! !Þ!!!" "&Þ"! !Þ!!!*
,
s !Þ!"" !Þ!# !Þ*) !Þ%) !Þ)*
B
s! $*Þ& "Þ*' !Þ!!!# &Þ"# !Þ#%
s
. "!&Þ# !Þ!* !Þ#$ !Þ"! !Þ*&
s
. ")!Þ* !Þ!!# !Þ*( #Þ!! !Þ''
s) !Þ(&' !Þ""$ !Þ%% !Þ"' !Þ**
Here, ! is the change-point parameter and # determines the radius of curvature at the change
point. The two functions connect smoothly.
Figure 5.19. Sediment settling data based on Table 2 in Watts and Bacon (1974). Reprinted
with permission from Technometrics. Copyright © 1974 by the American Statistical Associa-
tion. All rights reserved.
3.5
3.0
ln(number of primordia)
2.5
2.0
1.5
1.0
10 20 30 40 50 60 70
Days since sowing
Figure 5.20. Segmented linear trend for Kirby's wheat shoot apex data (Kirby 1974 and
Lerman 1980). Adapted with permission from estimates reported by Lerman (1980).
Copyright © 1980 by the Royal Statistical Society.
Linear-plateau models are special cases of these segmented models where the transition
of the segments is not smooth, there is a kink at the join-point. They are in fact special cases
of the linear-slope models that connect two linear segments. Kirby (1974) examined the
shoot-apex development in wheat where he studied the natural logarithm of the number of
primordia as a function of days since sowing (Figure 5.20). Arguing on biological grounds it
was believed that the increase in ln(# primordia) slows down sharply (abruptly) at the end of
spikelet initiation which can be estimated from mature plants. The kink this creates in the
response is obvious in the model graphed in Figure 5.20 which was considered by Lerman
(1980) for Kirby's data.
If the linear segment on the right-hand side has zero slope, we obtain the linear-plateau
model. Anderson and Nelson (1975) studied various segmented models for crop yield as a
function of fertilizer, the linear-plateau model being a special case. We show some of these
models in Figure 5.21 along with the terminology used in the sequel.
Linear Model
Linear-Plateau Model
7 Linear-Slope Model
Linear-Slope-Plateau Model
6
E[Y]
0 2 4 6 8 10
X
Figure 5.21. Some members of the family of linear segmented models. The linear-slope
model (LS) joins two line segments with non-zero slopes, the linear-plateau model (LP) has
two line segments, the second of which has zero slope and the linear-slope-plateau model
(LSP) has two line segments that connect to a plateau.
Anderson and Nelson (1975) consider the fit of these models to two data sets of corn
yields from twenty-two locations in North Carolina and ten site-years in Tennessee. We
repeat part of the Tennessee data in Table 5.7 (see also Figure 5.22).
Table 5.7. Tennessee average corn yields for two locations and three years as a function of
nitrogen (kg ha" ) based on experiments of Engelstad and Parks (1971)
Knoxville Jackson
"
N (kg ha ) 1962 1963 1964 1962 1963 1964
! %%Þ' %&Þ" '!Þ* %'Þ& #*Þ$ #)Þ)
'( ($Þ! ($Þ# (&Þ* &*Þ! &&Þ# $(Þ'
"$% (&Þ# )*Þ$ )$Þ( ("Þ* ((Þ$ &&Þ#
#!" )$Þ$ *"Þ# )%Þ$ ($Þ" ))Þ! ''Þ)
#') ()Þ% *"Þ% )"Þ) (%Þ& )*Þ% '(Þ!
$&& )!Þ* ))Þ! )%Þ& (&Þ& )(Þ! '(Þ)
Data appeared in Anderson and Nelson (1975). Used with permission of the International
Biometric Society.
80
40
Average Corn Yield (q/ha)
80
40
year: 1964 year: 1964
Jackson Knoxville
80
40
Figure 5.22. Corn yields from two locations and three years, according to Anderson and
Nelson (1975).
Table 5.8 shows the models, their residual sums of squares, and the join-points that
Anderson and Nelson (1975) determined to best fit the particular subsets of the data. Because
we can fit these models as nonlinear models we can estimate the join-points from the data in
most cases. As we will see convergence difficulties can be encountered if, for example, a
linear-slope-plateau model is fit in nonlinear form to a data set with only six points. The non-
linear version of this model has five parameters and there may not be sufficient information in
the data to estimate the parameters.
Table 5.8. Results of fitting the models selected by Anderson and Nelson (1975) for the
Tennessee corn yield data of Table 5.7 (The join-points were fixed and the
resulting models fit by linear regression)
Location Year Model Type WWV Join-Point " Join-Point 2 5†
Knoxville 1962 LSP‡ "%Þ#' '( #!" $
Knoxville 1963 LP "!Þ&* "!!Þ& #
Knoxville 1964 LP %Þ&' "!!Þ& #
Jackson 1962 LP ""Þ!' "'(Þ& #
Jackson 1963 LP (Þ"# "'(Þ& #
Jackson 1964 LP "$Þ%* #!" #
†
5 Number of parameters estimated in the linear regression model with fixed join-points
‡
LSP: Linear-Slope-Plateau model, LP: Linear-Plateau model, LS: Linear-Slope model
Before fitting the models in nonlinear form, we give the mathematical expressions for the
LP, LS, and LSP models from which the model statements in proc nlin will be build. For
completeness we include the simple linear regression model too (SLR). Let B denote N con-
centration and !" , !# the two-join points. Recall that M aB !" b, for example, is the indicator
function that takes on value " if B !" and ! otherwise. Furthermore, define the following
four quantities
)" "! "" B
)# "! "" !"
[5.55]
)$ "! "" !" "# aB !" b
)% "! "" !" "# a!# !" b.
)" is the linear slope of the first segment, )# the yield achieved when the first segment reaches
concentration B !" , and so forth. The three models now can be written as
SLR: Ec] d )"
LP: Ec] d )" M aB !" b )# M aB !" b
[5.56]
LS: Ec] d )" M aB !" b )$ M aB !" b
LSP: Ec] d )" M aB !" b )$ M a!" B !# b )% M aB !# b.
We find this representation of the linear-plateau family of models useful because it suggests
how to test certain hypotheses. Take the LS model, for example. Comparing )$ and )# we see
that the linear-slope model reduces to a linear-plateau model if "# ! since then )# )$ .
Furthermore, if "" "# , an LS model reduces to the simple linear regression model (SLR).
The proc nlin statements to fit the various models follow.
proc sort data=tennessee; by location year; run;
With proc nlmixed we can fit the various models and perform the necessary hypothesis
tests through the contrast or estimate statements of the procedure. To fit linear-slope
models and compare them to the LP and SLR models use the following statements.
proc nlmixed data=tennessee df=3;
parameters b0=45 b1=0.43 b2=0 a1=67 s=2;
firstterm = b0+b1*n;
secondterm = b0+b1*a1+b2*(n-a1);
model yield ~ normal(firstterm*(n <= a1) + secondterm*(n > a1),s*s);
estimate 'Test against SLR' b1-b2;
estimate 'Test against LP ' b2;
contrast 'Test against SLR' b1-b2;
contrast 'Test against LP ' b2;
by location year;
run;
Table 5.9. Results of fitting the linear-plateau type models to the Tennessee corn yield data
of Table 5.7 (The join-points were estimated from the data)
Location Year Model Type WWV Join-Point " 5†
Knoxville 1962 LP‡ $'Þ"! )#Þ# $
Knoxville 1963 LP (Þ)* "!(Þ! $
Knoxville 1964 LP %Þ&% "!"Þ$ $
Jackson 1962 LS !Þ!& "$%Þ) %
Jackson 1963 LP &Þ$" "'#Þ& $
Jackson 1964 LP "$Þ#$ #!$Þ* $
†
5 No. of parameters estimated in the nonlinear regression model with estimated join-points
‡
LP: Linear-Plateau model; LS: Linear-Slope model
In Table 5.9 we show the results of the nonlinear models that best fit the six site-years.
The linear-slope-plateau model for Knoxville in 1962 did not converge with proc nlin. This
is not too surprising since this model has & parameters a"! , "" , "# , !" , !# b and only six obser-
vations. Not enough information is provided by the data to determine all nonlinear param-
eters. Instead we determined that a linear-plateau model best fits these data if the join-point is
estimated.
Comparing Tables 5.8 and 5.9 several interesting facts emerge. The model selected as
best by Anderson and Nelson (1975) based on fitting linear regression models with fixed
join-points are not necessarily the best models selected when the join-points are estimated.
For data from Knoxville 1963 and 1964 as well as Jackson 1963 and 1964 both approaches
arrive at the same basic model, a linear-plateau relationship. The residual sums of squares
between the two approaches then must be close if the join-point in Anderson and Nelson's
approach was fixed at a value close to the nonlinear least squares iterate of !" . This is the
case for Knoxville 1964 and Jackson 1964. As the estimated join-point is further removed
from the fixed join point (e.g., Knoxville 1963), the residual sum of squares in the nonlinear
model is considerably lower than that of the linear model fit.
For the data from the Jackson location in 1962, the nonlinear method selected a different
model. Whereas Anderson and Nelson (1975) select an LP model with join-point at "'(Þ&,
fitting a series of nonlinear models leads one to a linear-slope (LS) model with join-point at
"$%Þ) kg ha" . The residual sums of squares of the nonlinear model is more than #!! times
smaller than that of the linear model. Although the slope a"# b estimate of the LS model is
close to zero (Output 5.14), so is its standard error and the approximate *&% confidence inter-
val for a"# b does not include zero ac!Þ!"!&ß !Þ!#&$db. Not restricting the second segment of
the model to be a flat line significantly improves the model fit (not only over a model with
fixed join-point, but also over a model with estimated join-point).
Output 5.14.
------------------- location=Jackson year=1962 -----------------------
Estimation Summary
Method Newton
Iterations 7
R 1.476E-6
PPC(b2) 1.799E-8
RPC(b2) 0.000853
Object 0.000074
Objective 0.053333
Observations Read 6
Observations Used 6
Observations Missing 0
The data from Knoxville in 1962 are a somewhat troubling case. The linear-slope-plateau
model that Anderson and Nelson (1975) selected does not converge when the join-points are
estimated from the data. Between the LS and LP models the former does not provide a signi-
80
40
Average Corn Yield (q/ha)
80
40
year: 1964 year: 1964
Jackson Knoxville
80
40
Figure 5.23. Predicted corn yields for Tennessee corn yield data (Model for Jackson 1962 is
a linear-slope model; all others are linear-plateau models).
or
Ec]34 d a"!4 ""4 R S$34 bM eR S$34 !4 f a"!4 ""4 !4 bM eR S$34 !4 f.
Depth = 30 cm
110 Depth = 60 cm
100
90
Relative Yield Percent
80
70
60
50
40
30
0 20 40 60 80
Soil NO3 (mg kg-1)
Figure 5.24. Relative yields as a function of soil NO$ for $! and '! cm sampling depths.
Data from Binford, Blackmer, and Cerrato (1992, Figures 2c, 3c) containing only N respon-
sive site years on sites that received !, ""#, ##%, or $$' kg ha" N. Data kindly made
available by Dr. A. Blackmer, Department of Agronomy, Iowa State University. Used with
permission.
If soil samples from the top $! cm a4 "b are indicative of the amount of N in the root-
ing zone and movement through macropores causes marked dispersion as suggested by
Blackmer et al. (1989), one would expect the ! to $! cm data to yield a larger intercept
a"!" "!# b, smaller slope a""" ""# b and larger critical concentration a!" !# b. The
plateaus, however, should not be significantly different. Before testing
L! À "!" "!# vs. L" À "!" "!#
L! À """ ""# vs. L" À """ ""# [5.58]
L! À !" !# vs. L" À !" !# ,
we examine whether there are any differences in the plateau models between the two sampl-
ing depths. To this end we fit the full model [5.57] and compare it to the reduced model
"! "" R S$ 34 R S$ 34 !
Ec]34 d [5.59]
"! "" ! R S$ 34 !,
that does not vary parameters by sample depth with a sum of squares reduction test.
%include 'DriveLetterOfCDROM:\Data\SAS\BlackmerData.txt';
The full model parameterizes the responses for '! cm sampling depth a"!# ß ""# ß !# b as
"!# "!" ?"! , ""# """ ?"" , !# !" ?! so that differences between the sampling
depths in the parameters can be accessed immediately on the output. The reduced model has a
residual sum of squares of WWÐs) Ñ< $*ß ('"Þ' on %(( degrees of freedom (Output 5.15).
Output 5.15.
The NLIN Procedure
Estimation Summary
Method Gauss-Newton
Iterations 8
R 0
PPC 0
RPC(alpha) 0.000087
Object 4.854E-7
Objective 39761.57
Observations Read 480
Observations Used 480
Observations Missing 0
The full model's residual sum of squares is WWÐs ) Ñ0 #*ß #$'Þ& on %(% degrees of
freedom (Output 5.16). The test statistic for the three degree of freedom hypothesis
is
a$*ß ('"Þ' #*ß #$'Þ&bÎ$
J9,= &'Þ)(*
#*ß #$'Þ&Î%(%
Estimation Summary
Method Marquardt
Iterations 6
R 0
PPC 0
RPC(del_alp) 3.92E-6
Object 1.23E-10
Objective 29236.55
Observations Read 480
Observations Used 480
Observations Missing 0
Even if a Bonferroni adjustment is made to protect the experimentwise Type-I error rate
in this series of three tests, at the experimentwise &% error level all three tests lead to rejec-
tion of their respective null hypotheses. Table 5.11 shows the estimates for the full model.
The two plateau values of *(Þ*# and *(Þ** are very close and probably do not warrant a
statistical comparison. To demonstrate how a statistical test for "!" """ !" "!# ""# !#
can be performed we carry it out.
The first method relies on reparameterizing the model. Let "!4 ""4 !4 X4 denote the
plateau for sampling depth 4 and notice that the model becomes
Ec]34 d aX4 ""4 aR S$34 !4 bbM eR S$34 !4 f X4 M eR S$34 !4 f
""4 aR S$34 !4 bM eR S$34 !4 f X4 .
The intercepts "!" and "!# were eliminated from the model which now contains X" and
X# X" ?X as parameters. The SAS® statements
proc nlin data=blackmer method=marquardt noitprint;
parms b11=3.56 alp1=23.13 T1=97.91 b12=5.682 alp2=16.28 del_T=0;
T2 = T1 + del_T;
model30 = b11*(no3-alp1)*(no3 <= alp1) + T1;
model60 = b12*(no3-alp2)*(no3 <= alp2) + T2;
model ryp = model30*(depth=30) + model60*(depth=60);
run;
yield Output 5.17. The approximate *&% confidence interval for ?X aÒ "Þ')$ß "Þ)$%Ób con-
tains zero and there is insufficient evidence at the &% significance level to conclude that the
relative yield plateaus differ among the sampling depths. The second method of comparing
the plateau values relies on the capabilities of proc nlmixed to estimate linear and nonlinear
will do the trick (Output 5.18). Since the nlmixed procedure approximates a likelihood and
estimates all distribution parameters, it would iteratively determine the variance of the
Gaussian error distribution. To prevent this we fix the variance with the s2 = 61.6805; state-
ment. This is the residual mean square estimate obtained from fitting the full model in proc
nlin (see Output 5.16 or Output 5.17). Also, because proc nlmixed determines residual
degrees of freedom by a method different from proc nlin we fix the residual degrees of
freedom with the df= option of the proc nlmixed statement.
Output 5.17.
The NLIN Procedure
Estimation Summary
Method Marquardt
Iterations 1
R 2.088E-6
PPC(alp1) 4.655E-7
RPC(del_T) 75324.68
Object 0.000106
Objective 29236.55
Observations Read 480
Observations Used 480
Observations Missing 0
Specifications
Data Set WORK.BLACKMER
Dependent Variable ryp
Distribution for Dependent Variable Normal
Optimization Technique Dual Quasi-Newton
Integration Method None
Dimensions
Observations Used 480
Observations Not Used 0
Total Observations 480
Parameters 6
Fit Statistics
-2 Log Likelihood 3334.7
AIC (smaller is better) 3346.7
AICC (smaller is better) 3346.9
BIC (smaller is better) 3371.8
Parameter Estimates
Standard
Parameter Estimate Error DF t Value Pr > |t| Lower Upper
b01 15.1943 2.8322 474 5.36 <.0001 9.6290 20.7595
b11 3.5760 0.1762 474 20.29 <.0001 3.2297 3.9223
alp1 23.1324 0.4848 474 47.71 <.0001 22.1797 24.0851
del_b0 -9.7424 4.2357 474 -2.30 0.0219 -18.0656 -1.4192
del_b1 2.1060 0.3203 474 6.57 <.0001 1.4766 2.7354
del_alp -6.8461 0.5691 474 -12.03 <.0001 -7.9643 -5.7278
Additional Estimates
Standard
Label Estimate Error DF t Value Pr > |t|
Difference in Plateaus -0.07532 0.8950 474 -0.08 0.9330
Figure 5.25 shows the predicted response functions for the ! to $! cm and ! to '! cm
sampling depths.
Depth = 30 cm
110 Depth = 60 cm
100
90
Relative Yield Percent
80
70
60
50
40
30
0 20 40 60 80
Soil NO3 (mg kg-1)
where ]34567 denotes the dry weight percentage, ,3 denotes the 3th run a3 "ß #b, /34 the 4th
replicate within run 3 a4 "ß âß %b, !5 the effect of the 5 th herbicide a5 "ß #b, "6 the effect
of the 6th size class a6 "ß âß $b, and #7 the effect of the 7th rate a7 "ß âß 'b. One can
include additional interaction terms in model [5.60] but for expository purposes we will not
pursue this issue here. The analysis of variance table (Table 5.12, SAS® output not shown) is
produced in SAS® with the statements
proc glm data=VelvetFactorial;
class run rep herb size rate;
model drywtpct = run rep(run) herb size rate herb*size herb*rate
size*rate herb*size*rate;
run; quit;
The analysis of variance table shows significant Herb Rate and Size Rate inter-
actions (at the &% level) and significant Rate and Size main effects.
By declaring the rate of application a factor in model [5.60], rates are essentially discret-
ized and the continuity of rates of application is lost. Some information can be recovered by
testing for linear, quadratic, up to quintic trends of dry weight percentages. Because of the
interactions of rate with the size and herbicide factors, great care should be exercised since
these trends may differ for the two herbicides or the three size classes. Unequal spacing of the
rates of application is a further hindrance, since published tables of contrast coefficients re-
quire a balanced design with equal spacing of the levels of the quantitative factor. From
Figure 5.26 it is seen that the dose response curves cannot be described by simple linear or
20
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Rate of Application (kg ai/ha)
Figure 5.26. Herbicide Size class (leave stage) sample means as a function of application
rate in factorial velvetleaf dose-response experiment. Data made kindly available by Dr.
James J. Kells, Department of Crop and Soil Sciences, Michigan State University. Used with
permission.
Models [5.61] through [5.64] are fit in SAS® proc nlin with the following series of
statements. The full model [5.61] for the # $ factorial is fit first (Output 5.19). Size classes
are identified with the second subscript corresponding to the $, %, and & leave stages. For
example, beta_14 is the parameter for herbicide " (glufosinate) and size class # (% leave
stage).
title 'Full Model [5.61]';
proc nlin data=velvet noitprint;
parameters beta_13=0.05 beta_14=0.05 beta_15=0.05
beta_23=0.05 beta_24=0.05 beta_25=0.05
gamma=-1.5;
alpha = 100;
Output 5.19.
The NLIN Procedure
Estimation Summary
Method Gauss-Newton
Iterations 15
Subiterations 1
Average Subiterations 0.066667
R 5.673E-6
PPC(beta_25) 0.000012
RPC(beta_25) 0.000027
Object 2.35E-10
Objective 2463.247
Observations Read 36
Observations Used 36
Observations Missing 0
The proc nlin statements to fit the completely reduced model [5.62] and the models
without herbicide ([5.63]) and size class effects ([5.64]) are as follows.
Once the residual sums of squares and degrees of freedom for the models are obtained
(Table 5.13, output not shown), the sum of squares reduction tests can be carried out:
$ß "('Þ!Î&
L! À no treatment effects J9,= (Þ%() : PraJ&ß#* (Þ%()b !Þ!!!"
#ß %'$Þ#Î#*
#%(Þ*Î$
L! À no herbicide effects J9,= !Þ*($ : PraJ$ß#* !Þ*($b !Þ%"))
#ß %'$Þ#Î#*
#ß *%!Þ(Î%
L! : no size class effects J9,= )Þ'&& : PraJ%ß#* )Þ'&&b !Þ!!!"
#ß %'$Þ#Î#*
Table 5.13. Residual sums of squares for full and various reduced models
The significant treatment effects appear to be due to a size effect alone but comparing
models [5.61] through [5.64] is somewhat unsatisfactory. For example, model [5.63] of no
Herbicide effects reduces the model degrees of freedom by three although there are only two
herbicides. Model [5.63] not only eliminates a Herbicide main effect, but also the
Herbicide Size interaction and the resulting model contains a Size Class main effect only.
A similar phenomenon can be observed with model [5.64]. There are two degrees of freedom
where the four terms in the sum correspond to the grand mean, factor A main effects, factor B
main effects, and A B interactions, respectively. The absence of the main effects and
interactions in a linear model can be represented by complete sets of contrasts among the cell
means as discussed in §4.3.3 (see also Schabenberger, Gregoire, and Kong 2000). With two
herbicide levels L" and L# and three size classes W$ , W% , and W& the cell mean contrasts for
the respective effects are given in Table 5.14. Notice that the number of contrasts for each
effect equals the number of degrees of freedom for that effect.
Table 5.14. Contrasts for main effects and interactions in unfolded # $ factorial design
Effect Contrast L" W$ L" W% L" W& L# W$ L# W% L# W&
Herbicide Main H " " " " " "
Size Main S1 " " ! " " !
S2 " " # " " #
Herb Size (H S1) " " ! " " !
(H S2) " " # " " #
To test whether the Herbicide main effect is significant, we fit the full model and test
whether the linear combination
."$ ."% ."& .#$ .#% .#&
differs significantly from zero. This can be accomplished with the contrast statement of the
nlmixed procedure. As in the previous application we fix the residual degrees of freedom and
the error variance estimate to equal those for the full model obtained with proc nlin.
proc nlmixed data=velvet df=29;
parameters beta_13=0.05 beta_14=0.05 beta_15=0.05
beta_23=0.05 beta_24=0.05 beta_25=0.05
gamma=-1.5;
s2 = 84.9396;
alpha = 100;
mu_13 = alpha*beta_13*(rate**gamma)/(1+beta_13*(rate**gamma));
mu_14 = alpha*beta_14*(rate**gamma)/(1+beta_14*(rate**gamma));
mu_15 = alpha*beta_15*(rate**gamma)/(1+beta_15*(rate**gamma));
mu_23 = alpha*beta_23*(rate**gamma)/(1+beta_23*(rate**gamma));
mu_24 = alpha*beta_24*(rate**gamma)/(1+beta_24*(rate**gamma));
mu_25 = alpha*beta_25*(rate**gamma)/(1+beta_25*(rate**gamma));
run; quit;
Output 5.20.
The NLMIXED Procedure
Specifications
Data Set WORK.VELVET
Dependent Variable drywtpct
Distribution for Dependent Variable Normal
Optimization Technique Dual Quasi-Newton
Integration Method None
Dimensions
Observations Used 36
Observations Not Used 0
Total Observations 36
Parameters 7
Fit Statistics
-2 Log Likelihood 255.1
AIC (smaller is better) 269.1
AICC (smaller is better) 273.1
BIC (smaller is better) 280.2
Parameter Estimates
Standard
Parameter Estimate Error DF t Value Pr > |t| Lower Upper
beta_13 0.01792 0.01066 29 1.68 0.1036 -0.00389 0.0397
beta_14 0.03290 0.01769 29 1.86 0.0731 -0.00329 0.0690
beta_15 0.09266 0.04007 29 2.31 0.0281 0.01071 0.1746
beta_23 0.01287 0.008162 29 1.58 0.1257 -0.00382 0.0295
beta_24 0.01947 0.01174 29 1.66 0.1079 -0.00453 0.0435
beta_25 0.06651 0.03578 29 1.86 0.0732 -0.00667 0.1397
gamma -1.2007 0.1586 29 -7.57 <.0001 -1.5251 -0.8763
Contrasts
Num Den
Label DF DF F Value Pr > F
No Herbicide Effect 1 29 1.93 0.1758
No Size Class Effect 2 29 4.24 0.0242
No Interaction 2 29 0.52 0.6010
Based on the Contrasts table in Output 5.20 we reject the hypothesis of no Size Class
effect but fail to reject the hypothesis of no interaction and no Herbicide main effect at the &%
level. Based on these results we could fit a model in which the " parameters vary only by
The Contrasts table added to the procedure output reveals that differences between size
classes are significant at the &% level except for the $ and % leaf stages.
Output 5.21.
Contrasts
Num Den
Label DF DF F Value Pr > F
where ] is the response to dosage B, in terms of -O , the dosage at which the response is O %
between the asymptotes $ and !. For example, -&! would be the dosage that achieves a resp-
onse halfway between the lower and upper asymptotes. The general formula for this reparam-
eterization is
!$
Ec] lBd $ ,
" OÎa"!! O bexpe" lnaBÎ-O bf
Although the model is popular in dose-response studies it does not necessarily fit all data sets
of this type. It assumes, for example, that the trend between $ and ! is sigmoidal and mono-
tonically increases or decreases. The frequent application of log-logistic models in herbicide
dose-response studies (see, e.g., Streibig 1980, Streibig 1981, Lærke and Streibig 1995,
Seefeldt et al. 1995, Hsiao et al. 1996, Sandral et al. 1997) tends to elevate the model in the
eyes of some to a law-of-nature to which all data must comply. Whether the relationship
between dose and response is best described by a linear, log-logistic, or other model must be
re-assessed for every application and every set of empirical data. Figure 5.27 shows the
sample mean relative growth percentages of barnyardgrass (Echinochloa crus-galli (L.) P.
Beauv.) treated with glufosinate [2-amino-4-(hydroxymethylphosphinyl) butanoic
acid] (NH% )# SO% (open circles) and glyphosate [isopropylamine salt of R -
(phosphonomethyl)glycine] (NH% )# SO% (closed circles). Growth is expressed relative to an
untreated control and the data points shown are sample means calculated across eight
replicate values at each concentration. A log-logistic model appears appropriate for the
glufosinate response but not for the glyphosate response, which exhibits an effect known as
hormesis. The term hormesis originates from the Greek for “setting into motion” and the
notion that every toxicant is a stimulant at low levels (Schulz 1988, Thiamann 1956) is also
known as the Arndt-Schulz law.
120
100
Relative Growth %
80
60
40
20
-4 -3 -2 -1 0
ln(dose) (ln kg ae / ha)
Figure 5.27. Mean relative growth percentages for barnyard grass as a function of log
dosage. Open circles represent active ingredient glufosinate, closed circles glyphosate. Data
kindly provided by Dr. James J. Kells, Department of Crop and Soil Sciences, Michigan State
University. Used with permission.
where # measures the initial rate of increase at low dosages. The Brain-Cousens model [5.65]
is a simple modification of the log-logistic and it is perhaps somewhat surprising that adding
a term #B in the numerator should do the trick. In this parameterization it is straightforward to
test the hypothesis of hormetic effects statistically. Fit the model and observe whether the
asymptotic confidence interval fo # includes !. If the confidence interval fails to include ! the
hypothesis of the absence of a hormetic effect is rejected.
But how can a dose-response model other than the log-logistic be modified if the re-
searcher anticipates hormetic effects or wishes to test their presence? To construct hormetic
models we rely on the idea of combining mathematical switching functions. Schabenberger
and Birch (2001) proposed hormetic models constructed by this device and the Brain-Cous-
ens model is a special case thereof. In process models for plant growth switching mechanisms
are widely used (e.g., Thornley and Johnson 1990), for example, to switch on or off a
mathematical function or constant or to switch from one function to another. The switching
functions from which we build dose-response models are mathematical functions W aBb that
take values between ! and " as dosage B varies. In the log-logistic model
!$
] $
" expe" lnaBÎ-&! bf
the term c" expe" lnaBÎ-&! bfd" is a switch-off function for " ! (Figure 5Þ28) and a
switch-on function for " !.
With " !, $ is the lower and ! the upper asymptote of dose response. The role of the
switching function is to determine how the transition between the two extrema takes place.
This suggests the following technique to develop dose-response models. Let W aBß )b be a
switch-off function and notice that V aBß )b " W aBß )b is a switch-on function. Denote the
min and max mean dose-response in the absence of any hormetic effects as .min and .max . A
1.0
0.6
0.4
0.2
0.0
Figure 5Þ28. Switch-off behavior of the log-logistic term c" expe" lnaBÎ-&! bfd" for
-&! !Þ%, " %.
By choosing the switching function from a flexible family of mathematical models, the
nonhormetic dose response can take on many shapes, not necessarily sigmoidal and sym-
metric as implied by the log-logistic model. To identify possible switch-on functions one can
choose V aBß )b as the cumulative distribution function (cdf) of a continuous random variable
with unimodal density. Choosing the cdf of a random variable uniformly distributed on the
interval a+ß ,b leads to the switch-off function
, "
W aBß )b B
,+ ,+
and a linear interpolation between .min and .max . A probably more useful cdf that permits a
sigmoidal transition between .min and .max is that of a two-parameter Weibull random
variable which leads to the switch-off function (Figure 5.29)
Gregoire and Schabenberger (1996b) use a switching function derived from an extreme value
distribution in the context of modeling the merchantable volume in a tree bole (see
application §8.4.1), namely
W aBß )b exp !B/"B .
This model was derived by considering a switch-off function derived from the Gompertz
growth model (Seber and Wild, 1989, p. 330),
WÐBß )Ñ expe expe !aB " bff,
which is sigmoidal with inflection point at B " but not symmetric about the inflection
point. To model the transition in tree profile from neiloid to parabolic to cone-shaped seg-
which is derived from the family of growth models developed for modeling nutritional intake
by Morgan et al. (1975). For " " this switching function has a point of inflection at
B !ea" "bÎa" "bf"Î" and is hyperbolic for " ". Swinton and Lyford (1996) use the
model by Morgan et al. (1975) to test whether crop yield as a function of weed density takes
on hyperbolic or sigmoidal shape. Figure 5.29 displays some of the sigmoidal switch-off
functions.
1.0
Gompertz
Weibull
0.8 Morgan et al. (1975)
Logistic
0.6
0.4
0.2
0.0
Figure 5.29. Some switch-off functions W aBß )b discussed in the text. The functions were
selected to have inflection points at B !Þ&.
100
80
Response
Hormetic Zone
60 q2)
S2(x,q
q1)
S1(x,q
40
20
DMS LDS
0
0.0 0.2 0.4 0.6 0.8 1.0
Dosage or Application Rate x
Figure 5.30. Hormetic and nonhormetic dose response. LDS is the limiting dosage for
stimulation. DMS is the dosage of maximum stimulation.
Dose-response models without hormetic effect suggest monotonic changes in the response
with increasing or decreasing dosage. A hormetic effect is the deviation from this general
pattern and in the case of reduced response with increasing dose a beneficial effect is usually
observed at low dosages (Figure 5.30).
The method proposed by Schabenberger and Birch (2001) to incorporate hormetic be-
havior consists of combining a standard model without hormetic effect and a model for the
hormetic component. If W aBß )b is a switch-off function and 0 aBß 9b is a monotonically
increasing function of dosage, then
] .min a.max .min bW" aBß )" b 0 aBß 9bW# aBß )# b [5.67]
is a hormetic model. The switching functions W" ab and W# ab will often be of the same func-
tional form but this is not necessary. One might, for example, combine a Weibull switching
function to model the dose-response trend without hormesis with a hormetic component
0 aBß 9bW# aBß )b where W# aBß )b is a logistic switching function. The Brain-Cousens model
(Brain and Cousens 1989)
! $ 9B
] $
" )expa" lnaBbb
is a special case of [5.67] where W" aBß )b W# aBß )b is a log-logistic switch-off function and
0 aBß 9b # B. When constructing hormetic models, 0 aBß 9b should be chosen so that
0 aBß 9b ! for a known set of parameters. The absence of a hormetic effect can then be
tested. To prevent a beneficial effect at zero dose we would furthermore require that
0 a!ß 9b !. The hormetic model will exhibit a maximum for some dosage (the dosage of
maximum stimulation, Figure 5.30) if the equation
`0 ÐBß 9Ñ `W# aBß )# b
W# aBß )# b 0 aBß 9b
`B `B
has a solution in B.
The limiting dose for stimulation and the maximum dose of stimulation are only defined
for models with a hormetic zone (Figure 5.30). Dosages beyond this zone are interpreted in
the same fashion as for a nonhormetic model. This does not imply that the researcher can ig-
nore the presence of hormetic effects when only dosages beyond the hormetic zone (such as
PH&! ) are of importance. Through simulation, Schabenberger and Birch (2001) demonstrate
the effects of ignoring hormesis. Bias of up to "$% in estimating -#! , -&! , and -(& were ob-
served when hormesis was not taken into account in modeling the growth response. The esti-
mate of the response at the limiting dose for stimulation also had severe negative bias. Once
the model accounted for hormesis through the switching function mechanism, these biases
were drastically reduced.
In the remainder of this application we fit a log-logistic model to the barnyardgrass data
from which Figure 5.27 was created. Table 5.15 shows the dosages of the two active ingre-
dients and the growth percentages averaged across eight independent replications at each
dosage. The data not averaged across replications are analyzed in detail in Schabenberger,
Tharp, Kells, and Penner (1999).
To test whether any of the two reponses are hormetic we can fit the Brain-Cousens model
!4 $4 #4 B34
Ec]34 d $4
" )4 expe"4 lnaB34 bf
to the combined data, where ]34 denotes the response for ingredient 4 at dosage B34 . The main
interest in this application is to compare the dosages that lead to &!% reduction in growth re-
sponse. The Brain-Cousens model is appealing in this regard since it allows fitting a hormetic
model to the glyphosate response and a standard log-logistic model to the glufosinate res-
ponse. It does not incorporate an effective dosage as a parameter in the parameterization
[5.65], however. Schabenberger et al. (1999) changed the parameterization of the Brain-
Cousens model to enable estimation of -O using the method of defining relationships dis-
cussed in §5.7.2. The model in which -&! , for example, can be estimated whether the
response is hormetic a- !b or not a- !b, is
!4 $4 #4 B34 -&!4
Ec]34 d $4 , =4 " ##4 . [5.68]
" =4 expe"4 lneB34 Î-&!4 ff !4 $4
See Table 1 in Schabenberger et al. (1999) for other parameterizations that allow estimation
of general -O , LDS, and DMS in the Brain-Cousens model. The proc nlin statements to fit
model [5.68] are
proc nlin data=hormesis method=newton noitprint;
parameters alpha_glu=100 delta_glu=4 beta_glu=2.0 RD50_glu=0.2
alpha_gly=100 delta_gly=4 beta_gly=2.0 RD50_gly=0.2
gamma_glu=300 gamma_gly=300;
bounds gamma_glu > 0, gamma_gly > 0;
omega_glu = 1 + 2*gamma_glu*RD50_glu / (alpha_glu-delta_glu);
omega_gly = 1 + 2*gamma_gly*RD50_gly / (alpha_gly-delta_gly);
term_glu = 1 + omega_glu * exp(beta_glu*log(rate/RD50_glu));
term_gly = 1 + omega_gly * exp(beta_gly*log(rate/RD50_gly));
model barnyard =
(delta_glu + (alpha_glu - delta_glu + gamma_glu*rate) / term_glu ) *
(Tx = 'glufosinate') +
(delta_gly + (alpha_gly - delta_gly + gamma_gly*rate) / term_gly ) *
(Tx = 'glyphosate') ;
run;
The parameters are identified as *_glu for glufosinate and *_gly for the glyphosate
response. The bounds statement ensures that proc nlin constrains the estimates of the hor-
mesis parameters to be positive. The omega_* statements calculate =" and =# and the term_*
statements the denominator of model [5.68]. The model statement uses logical variables to
choose between the mean functions for glufosinate and glyphosate.
At this stage we are focusing on parameter estimates for #" and ## , as they represent the
hormetic component. The parameter are coded as gamma_glu and gamma_gly (Output 5.22).
The asymptotic *&% confidence interval for #" includes ! ac "(('Þ&ß $*!*Þ'db, whereas that
for ## does not ac$!'Þ!ß "!)'Þ)db. This confirms the hormetic effect for the glyphosate re-
sponses. The :-value for the test of L! : #" ! vs. L" : #" ! can be calculated with
data pvalue; p=1-ProbT(1066.6/1024.0,4); run; proc print data=pvalue; run;
and turns out to be : !Þ"()#, sufficiently large to dismiss the notion of hormesis for the
glufosinate response. The :-value for L! : ## ! versus L" : ## ! is : !Þ!!$) and ob-
tained similarly with the statements
data pvalue; p=1-ProbT(696.4/140.6,4); run; proc print data=pvalue; run;
Output 5.22.
The NLIN Procedure
Estimation Summary
Method Newton
Iterations 11
Subiterations 7
Average Subiterations 0.636364
R 6.988E-7
PPC(beta_glu) 4.248E-8
RPC(gamma_glu) 0.00064
Object 1.281E-7
Objective 81.21472
Observations Read 14
Observations Used 14
Observations Missing 0
The model we focus on to compare -&! values among the two herbicides thus has a hor-
metic component for glyphosate a4 #b, but not for glufosinate a4 "b. Fitting these two
The advantage of this coding method is that parameters for which proc nlin gives estimates,
estimated asymptotic standard errors, and asymptotic confidence intervals, express differen-
ces between the two herbicides. Output 5.23 was generated from the following statements.
proc nlin data=hormesis method=newton noitprint;
parameters alpha_glu=100 delta_glu=4 beta_glu=2.0 RD50_glu=0.2
alpha_dif=0 delta_dif=0 beta_dif=0 RD50_dif=0
gamma_gly=300;
bounds gamma_gly > 0;
model barnyard =
(delta_glu + (alpha_glu - delta_glu + gamma_glu*rate) / term_glu ) *
(Tx = 'glufosinate') +
(delta_gly + (alpha_gly - delta_gly + gamma_gly*rate) / term_gly ) *
(Tx = 'glyphosate') ;
run;
Notice that
alpha_gly=100 delta_gly=4 beta_gly=2.0 RD50_gly=0.2
The glyphosate parameters are then reconstructed below the bounds statement. A sideeffect of
this coding method is the ability to choose zeros as starting values for the ? parameters
assuming initially that the two treatments produce the same response.
Output 5.23.
The NLIN Procedure
Estimation Summary
Method Newton
Iterations 11
Subiterations 10
Average Subiterations 0.909091
R 1.452E-8
PPC 7.24E-9
RPC(alpha_dif) 0.000055
Object 9.574E-9
Objective 125.6643
Observations Read 14
Observations Used 14
Observations Missing 0
The difference in -&! values between the two herbicides is positive (Output 5.23). The
-&! estimate for glufosinate is !Þ"##* kg aeÎha and that for glyphosate is !Þ"##* !Þ!"(%
!Þ"%!$ kg aeÎha. The difference is not statistically significant at the &% level, since the
asymptotic *&% confidence interval for ?- includes zero ac Þ!#"'ß !Þ!&'%db.
Predicted values for the two herbicides are shown in Figure 5.31. The hormetic effect for
glyphosate is very pronounced. The negative estimate s$ " %Þ!!)$ suggests that the lower
asymptote of relative growth percentages is negative, which is not very meaningful. Fortu-
nately, the growth responses do not achieve that lower asymptote across the range of dosages
observed which defuses this issue.
80
60
40
20
0
-4 -3 -2 -1 0
ln(dose) (ln kg ae / ha)
Figure 5.31. Predicted responses for glufosinate and glyphosate for barnyard grass data. Esti-
mated -&! values of !Þ"##* kg aeÎha and !Þ"%!$ kg aeÎha are also shown.
What will happen if the hormetic effect for the glyphosate response is being ignored, that
is, we fit a log-logistic model
!4 $4
Ec]34 d $4 ?
" expe"4 lneB34 Î-&!4 ff
The solid line in Figure 5.31 will be forced to monotonically decrease as the dashed line.
Because of the solid circles in excess of "!!% the glyphosate model will attempt to stay ele-
vated for as long as possible and then decline sharply toward the solid circles on the right
(Figure 5.32). The estimate of "# will be large and have very low precision. Also, the residual
sum of squares should increase dramatically.
All of these effects are apparent in Output 5.24 which was generated by the statements
proc nlin data=hormesis method=newton noitprint;
parameters alpha_glu=100 delta_glu=4 beta_glu=2.0 RD50_glu=0.122
alpha_dif=0 delta_dif=0 beta_dif=0 RD50_dif=0;
term_glu = 1 + exp(beta_glu*log(rate/RD50_glu));
term_gly = 1 + exp(beta_gly*log(rate/RD50_gly));
Estimation Summary
Method Newton
Iterations 77
Subiterations 16
Average Subiterations 0.207792
R 9.269E-6
PPC(beta_dif) 0.002645
RPC(beta_dif) 0.005088
Object 8.27E-11
Objective 836.5044
Observations Read 14
Observations Used 14
Observations Missing 0
glufosinate
120 glyphosate
glufosinate model (log-logistic)
glyphosate model (log-logistic)
100
Relative Growth %
80
60
40
20
0
-4 -3 -2 -1 0
ln(dose) (ln kg ae / ha)
is nonsensical. Notice, however, that the fit of the model to the glufosinate data has not
changed (compare to Output 5.23, and compare Figures 5.31 and 5.32).
In either case the yield per plant is a convex function of plant density. Several parameters
are of particular interest in the study of yield-density models. If ] aBb tends to a constant -= as
density tends toward !, -= reflects the species' potential in the absence of competition from
other plants. Similarly, if Y aBb tends to a constant -/ as density increases toward infinity, -/
measures the species' potential under increased competition for environmental resources.
Ratkowsky (1983, p. 50) terms -= the genetic potential and -/ the environmental potential of
the species. For asymptotic relationships agronomists are often not only interested in -= and -/
but also in the density that produces a certain percentage of the asymptotic yield. In Figure
a) b)
0.9
U(x)
0.6 0.6
0.3 0.3
Y(x)
Y(x)
x0.8 xmax
0.0 0.0
0 2 4 6 8 0 2 4 6 8
Density x Density x
Figure 5.33. Asymptotic (a) and parabolic (b) yield-density relationships. Y aBb denotes yield
per unit area at density B, ] aBb denotes yield per plant at density B. Both models are based on
the Bleasdale-Nelder model Y aBb Ba! " Bb"Î) discussed in the text and in §A5.9.1.
The most basic nonlinear yield-density model is the reciprocal simple linear regression
Ec] d a! " Bb"
due to Shinozaki and Kira (1956). Its area yield function EcY d Ba! " Bb" is strictly
asymptotic with genetic potential -= "Î! and asymptotic yield per unit area Y a_b "Î" .
In applications one may not want to restrict modeling efforts from the outset to asymptotic
relationships and employ a model which allows both asymptotic and parabolic relationships,
depending on parameter values. A simple extension of the Shinozaki-Kira model is known as
the Bleasdale-Nelder model (Bleasdale and Nelder 1960),
This model is extensively discussed in Mead (1970), Gillis and Ratkowsky (1978), Mead
(1979), and Ratkowsky (1983). A more general form with four parameters that was originally
proposed is discussed in §A5.9.1. For ) " the model is asymptotic and parabolic for ) "
(Figure 5.33). A one-sided statistical test of L! : ) " vs. L" : ) " allows testing for
asymptotic vs. parabolic structure of the relationship between Y and B. Because of its
biological relevance, such a test should always be performed when the Bleasdale-Nelder
and in the parabolic case a) "b the maximum yield per unit area of
a")bÎ)
) ")
Ymax
" !
due to Holliday (1960) which also allows parabolic and asymptotic relationships and its
parameter estimators show close-to-linear behavior.
When fitting yield-density models, care should be exercised because the variance of the
plant yield ] typically increases with the yield. Mead (1970) states that the assumption
Varc] d 5 # Ec] d#
is often tenable. Under these circumstances the logarithm of ] has approximately constant
variance. If 0 aBß )b is the yield-density model, one approach to estimation of the parameters )
is to fit
Eclne] fd lne0 aBß )bf
assuming that the errors of this model are zero mean Gaussian random variables with constant
variance 5 # . For the Bleasdale-Nelder model this leads to
"
Eclne] fd lna! " Bb"Î) lne! " Bf.
)
Alternatively, one can fit the model ] 0 aBß )b assuming that ] follows a distribution with
variance proportional to Ec] d# . The family of Gamma distributions has this property, for
example. Here we will use the logarithmic transformation and revisit the fitting of yield
density models in §6.7.3 under the assumption of Gamma distributed yields.
The data in Table 5.16 and Figure 5.34 represent yields per plant of three onion varieties
grown at varying densities. There were three replicates of each density, the data values repre-
sent their averages. An exploratory graph of "Î] versus density shows that a linear relation-
ship is not unreasonable, confirmed by a loess smooth of "Î] vs. B (Figure 5.34).
Variety: 1
0.06
0.03
Inverse plant yield in grams-1
0.00 Variety: 2
0.06
0.03
Variety: 3
0.00
0.06
0.03
0.00
0 5 10 15 20 25 30
Density (x)
Figure 5.34. Relationships between inverse plant yield and plant densities for three onion
varieties. Dashed line is a nonparametric loess fit. Data from Mead (1970).
The relationship between Y and density is likely of an asymptotic nature for any of the
three varieties. The analysis commences with a fit of the full model
where the subscript 3 "ß âß $ denotes the varieties and B34 is the 4th density a4 "ß âß "!b
at which the yield of variety 3 was observed. The following hypotheses are to be addressed
subsequently:
Prior to tackling these hypotheses the full model should be tested against the most
reduced model
or even
lne]34 f lna! " B34 b" /34
to determine whether there are any differences among the three varieties. In addition we are
also interested in obtaining confidence intervals for the genetic potential -= , for the density
Bmax should the relationship be parabolic, for the density B!Þ* and the asymptote Y a_b
should the relationship be asymptotic. In order to obtain these intervals one could proceed by
reparameterizing the model such that -= , Bmax , B!Þ* , and Y a_b are parameters and refit the
resulting model(s). Should the relationship be parabolic this proves difficult since -= , for
example, is a function of both ! and ). Instead we fit the final model with the nlmixed
procedure of The SAS® System that permits the estimation of arbitrary functions of the model
parameters and calculates the standard errors of the estimated functions by the delta method.
The full model is fit with the SAS® statements
proc nlin data=onions method=marquardt;
parameters a1=5.4 a2=5.4 a3=5.4
b1=1.7 b2=1.7 b3=1.7
t1=1 t2=1 t3=1;
term1 = (-1/t1)*log(a1 + b1*density);
term2 = (-1/t2)*log(a2 + b2*density);
term3 = (-1/t3)*log(a3 + b3*density);
model logyield = term1*(variety=1)+term2*(variety=2)+term3*(variety=3);
run;
Starting values for the !3 and "3 were obtained by assuming )" )# )$ " and fitting
the inverse relationship "ÎÐ] Î"!!!Ñ ! " B with a simple linear regression package. For
scaling purposes, the plant yield was expressed in kilograms rather than in grams as in Table
5.16. The full model achieves a residual sum of squares of WÐs)Ñ !Þ"!!% on #" degrees of
freedom. The asymptotic model loge]34 f lne! " B34 f /34 with common potentials
has a residual sum of square of !Þ#"%! on #) degrees of freedom. The initial test for determ-
ining whether there are any differences in yield-density response among the three varieties is
rejected:
a!Þ#"%! !Þ"!!%bÎ(
J9,= $Þ$*%ß : !Þ!"$*
!Þ"!!%Î#"
The model restricted under L! : )" )# )$ " is fit with the statements
proc nlin data=onions method=marquardt;
parameters a1=5.4 a2=5.4 a3=5.4 b1=1.7 b2=1.7 b3=1.7;
term1 = -log(a1 + b1*density); term2 = -log(a2 + b2*density);
term3 = -log(a3 + b3*density);
model logyield = term1*(variety=1)+term2*(variety=2)+term3*(variety=3);
run;
and achieves WÐs)ÑL! !Þ"$&! on #" degrees of freedom. The J -test has test statistic J9,=
a!Þ"$&! !Þ"!!%bÎÐ$!Þ"!!%Î#"Ñ #Þ%"# and :-value : PraJ$ß#" #Þ%"#b !Þ!*&. At
the &% significance level L! cannot be rejected and the model
lne]34 f ln!3 "3 B34 /34
is used as the full model henceforth. Varying " and fixing ! at a common value for the varie-
ties is accomplished with the statements
proc nlin data=onions method=marquardt;
parameters a=4.5 b1=1.65 b2=1.77 b3=1.90;
term1 = -log(a + b1*density); term2 = -log(a + b2*density);
term3 = -log(a + b3*density);
model logyield = term1*(variety=1)+term2*(variety=2)+term3*(variety=3);
run;
This model has WÐ) s ÑL! !Þ"&"* with #' residual degrees of freedom. The test for L! :
!" !# !$ leads to J9,= a!Þ"&"* !Þ"$&!bÎa#!Þ"$&!Î#%b "Þ&! a: !Þ#%$b. It is
reasonable to assume that the varieties share a common genetic potential. Similarly, the invar-
iance of the "3 can be tested with
proc nlin data=onions method=marquardt;
parameters a1=5.4 a2=5.4 a3=5.4 b=1.7;
term1 = -log(a1 + b*density); term2 = -log(a2 + b*density);
term3 = -log(a3 + b*density);
model logyield = term1*(variety=1)+term2*(variety=2)+term3*(variety=3);
run;
This model achieves WÐs ) ÑL! !Þ")%$ with #' residual degrees of freedom. The test for
L! :"" "# "$ leads to J9,= a!Þ")%$ !Þ"$&!bÎa#!Þ"$&!Î#%b %Þ$) a: !Þ!#%b.
The notion of invariant " parameters is rejected.
We are now in a position to settle on a final model for the onion yield density data,
lne]34 f lne! "3 B34 f /34 . [5.70]
Had we started with the Holliday model instead of the Bleasdale-Nelder model, the initial test
for an asymptotic relationship would have yielded J9,= "Þ&(& a: !Þ##&b and the final
model would have been the same.
We are now interested in calculating confidence intervals for the common genetic
potential a"Î!b, the variety-specific environmental potentials a"Î"3 b, the density that produ-
The final parameter estimates are ! s %Þ&$'%, " s " "Þ''"", "s # "Þ()'', and
s
" $ "Þ*"(& (Output 5.25). The predicted yield per area at density B (Figure 5.35) is thus
calculated (in grams ft# ) as
s "!!! Ba%Þ&$'% "Þ''""Bb"
Variety 1: Y
s "!!! Ba%Þ&$'% "Þ()''Bb"
Variety 2: Y
s "!!! Ba%Þ&$'% "Þ*"(&Bb" .
Variety 3: Y
Specifications
Data Set WORK.ONIONS
Dependent Variable logyield
Distribution for Dependent Variable Normal
Optimization Technique Dual Quasi-Newton
Integration Method None
Iteration History
Parameter Estimates
Standard
Parameter Estimate Error DF t Value Pr > |t| Lower Upper
a 4.5364 0.3467 26 13.08 <.0001 3.8237 5.2492
b1 1.6611 0.06212 26 26.74 <.0001 1.5334 1.7888
b2 1.7866 0.07278 26 24.55 <.0001 1.6370 1.9362
b3 1.9175 0.07115 26 26.95 <.0001 1.7713 2.0637
Additional Estimates
Standard t
Label Estimate Error DF Value Pr > |t| Lower Upper
x_09 (1) 24.57 2.5076 26 9.80 <.0001 19.4246 29.733
x_09 (2) 22.85 2.4217 26 9.44 <.0001 17.8741 27.829
x_09 (3) 21.29 2.1638 26 9.84 <.0001 16.8447 25.740
The genetic potential is estimated as ##!Þ%% grams ft# with asymptotic *&% confi-
dence interval c")&Þ)ß #&&Þ"d. The estimated yields per unit area asymptotes for varieties 1, 2,
and 3 are '!#Þ!", &&*Þ(", and &#"Þ&" grams ft# , respectively. Only varieties 1 and 3 differ
significantly in this parameter a: !Þ!!$#b and the density that produces *!% of the
asymptotic yield a: !Þ!!%(b.
500
450
400
350
300 x0.9
0 5 10 15 20 25 30
Density x
Figure 5.35. Observed and predicted values for onion yield-density data.
the errors /3 are uncorrelated with variance proportional to a power of the mean,
Varc/3 d 5 # 0 ax3 ß )b- 5 # Ec]3 d- .
Consequently, /3 /3 0 ax3 ß )b-Î# will have constant variance 5 # and a weighted least
squares approach would obtain updates of the parameter estimates as
"
s)?" s)? F
sw W" F sw W" y fÐxß s
s F )Ñ,
where W is a diagonal matrix of weights 0 ax3 ß )b- . Since the weights depend on ) they
should be updated whenever the parameter vector is updated, i.e., at every iteration. This
problem can be circumvented when the variance of the model errors is not proportional to a
power of the mean, but proportional to some other function 1ax3 b that does not depend on the
parameters of the model. In this case /3 /3 È1ax3 b will be the variance stabilizing trans-
form for the errors. But how can we find this function 1ax3 b? One approach is by trial and
error. Try different weight functions and examine the weighted nonlinear residuals. Settle on
the weight function that stabilizes the residual variation. The approach we demonstrate here is
also an ad-hoc procedure but it utilizes the data more.
Before going into further details, we take a look at the data and the model we try to fit.
The Richards curve (Richards 1959) is a popular model for depicting plant growth owing to
its flexibility and simple interpretation. It is not known, however, for excellent statistical
properties of its parameter estimates. The Richards model — also known as the Chapman-
Richards model (Chapman 1961) — exists in a variety of parameterizations, for example,
#
]3 !" /" B3 /3 . [5.71]
Here, ! is the maximum growth achievable (upper asymptote), " is the rate of growth, and
the parameter # determines the shape of the curve near the origin. For # " the shape is sig-
moidal. The covariate B in the Richards model is often the age of an organism, or a measure
of its size. Values for # in the neighborhood of "Þ! are common as are values for " at
approximately –!Þ&.
50
30
Height (m)
10
-10
0 10 20 30 40 50
Age
Figure 5.36. Height in meters of "!! Sitka spruce trees as a function of tree age in years.
Data generated according to discussion in Rennolls (1993).
Inspired by results in Rennolls (1993) the simulated heights in meters of "!! Sitka
spruces (Picea sitchensis (Bong.) Carr.) is plotted as a function of their age in years in Figure
30
Sample Variance by Group
20
10
1 2 3 4 5 6 7
Sqrt(Age)
Figure 5.37. Sample standard deviations for groups of six observations against the square
root of tree age.
To obtain a weighted nonlinear least squares analysis in SAS® , we call upon the
_weight_ variable in proc nlin. The _weight_ variable can refer to a variable in the data set
or to a valid SAS® expression. SAS® calculates it for each observation and assigns the recip-
rocal values as diagonal elements of the W matrix. The statements (Output 5.26) to model the
variances of an observation as a multiple of the root ages are
Output 5.26.
The NLIN Procedure
Estimation Summary
Method Gauss-Newton
Iterations 5
R 6.582E-7
PPC(gamma) 5.639E-7
RPC(gamma) 0.000015
Object 6.91E-10
Objective 211.0326
Observations Read 100
Observations Used 100
Observations Missing 0
Figure 5.38 displays the residuals from the ordinary nonlinear least squares analyses [(a)
and (c)] and the weighted residuals from the weighted analysis (b). The heterogeneous
variances in the plot of unweighted residuals is apparent and the weighted residuals show a
homogeneous band as residuals of a proper model should. The squared ordinary residuals
plotted against the regressor (panel c) do not lend themselves easily to discern how the error
variance relates to the regressor. Because of the tightness of the observations for young trees
(Figure 5.36) many residuals are close to zero, and a few large residuals for older trees over-
power the plot. It is because we find this plot of squared residuals hard to interpret for these
data that we prefer the grouping approach in this application.
What has been gained by applying weighted nonlinear least squares instead of ordinary
nonlinear least squares? It turns out that the parameter estimates differ little between the two
analyses and the predicted trends will be almost indistinguishable from each other. The
culprit, as so often is the case, is the estimation of the precision of the coefficients and the
precision of the predictions. The variance of an observation around the mean trend is assumed
to be constant in ordinary nonlinear least squares. From Figure 5.36 it is obvious that this
variability is small for young trees and grows with tree age. The estimate of the common var-
iance will then be too small for older trees and too large for younger trees. This effect be-
comes apparent when we calculate prediction (or confidence) intervals for the weighted and
unweighted analyses (Figure 5.39). The *&% prediction intervals are narrower for young trees
than the intervals from the unweighted analysis, since they take into account the actual
a) b)
Uweighted Ordinary Residuals
6 2
Weighted Residuals
1
0
0
-6 -1
-2
10 30 50 10 30 50
Age Age
c)
Uweighted Residual Squared
80
50
20
10 Age 30 50
Figure 5.38. Residuals from an ordinary unweighted nonlinear least squares analysis (a), a
weighted nonlinear least squares analysis (b), and the square of the ordinary residuals plotted
against the regressor (c).
50
30
Height (m)
10
-10
0 10 20 30 40 50
Age
Figure 5.39. Predictions of tree height in weighted nonlinear least squares analysis (solid line
in center of point cloud). Upper and lower solid lines are *&% prediction intervals in
weighted analysis, dashed lines are *&% prediction intervals from ordinary nonlinear least
squares analysis.
“The objection is primarily that the theory of errors assumes that errors
are purely random, i.e., 1. errors of all magnitudes are possible; 2.
smaller errors are more likely to occur than larger ones; 3. positive and
negative errors of equal absolute value are equally likely. The theory of
Rodewald makes insufficient accommodation of this theory; only two
errors are possible, in particular: if O is the germination percentage of
seeds, the errors are O and " O .” J. C. Kapteyn, objecting to
Rodewald's discussion of seed germination counts in terms of binomial
probabilities. In Rodewald, H. Zur Methodik der Keimprüfungen, Die
Landwirtschaftlichen Versuchs-Stationen, vol. 49, p. 260. (Quoted in
German translated by first author.)
6.1 Introduction
6.2 Components of a Generalized Linear Model
6.2.1 Random Component
6.2.2 Systematic Component and Link Function
6.2.3 Generalized Linear Models in The SAS® System
6.3 Grouped and Ungrouped Data
6.4 Parameter Estimation and Inference
6.4.1 Solving the Likelihood Problem
6.4.2 Testing Hypotheses about Parameters and Their Functions
6.4.3 Deviance and Pearson's \ # Statistic
6.4.4 Testing Hypotheses through Deviance Partitioning
6.4.5 Generalized V # Measures of Goodness-of-Fit
6.5 Modeling an Ordinal Response
6.5.1 Cumulative Link Models
6.5.2 Software Implementation and Example
• Each GLM has three components: the link function, the linear predictor,
and the random component.
• The Gaussian linear regression and analysis of variance models are special
cases of generalized linear models.
In the preceding chapters we explored statistical models where the response variable is
continuous. Although these models cover a wide range of situations they do not suffice for
many data in the plant and soil sciences. For example, the response may not be a continuous
variable, but a count or a frequency. The distribution of the errors may have a mean of zero,
but may be far from a Gaussian distribution. These data/model breakdowns can be addressed
by relying on asymptotic or approximate results, by transforming the data, or by using models
specifically designed for the particular response distribution. Poisson-distributed counts, for
example, can be approximated by Gaussian random variables if the average count is
sufficiently large. In binomial experiments consisting of independent and identical binary ran-
dom variables the Gaussian approximation can be invoked provided the product of sample
size and the smaller of success or failure probability is sufficiently large a &b. When such
approximations allow discrete responses to be treated as Gaussian, the temptation to invoke
standard analysis of variance or regression analysis is understandable. The analyst must keep
in mind, however, that other assumptions may still be violated. Since for Poisson random
variables the mean equals the variance, treatments where counts are large on average will also
have large variability compared to treatments where counts are small on average. The homo-
scedasticity assumption in an experiment with count responses is likely to be violated even if
a Gaussian approximation to the response distribution holds. When Gaussian approximations
fail, transformations of the data can achieve greater symmetry, remove variance heteroge-
neity, and create a scale on which effects can be modeled additively (Table 6.1). Transform-
ing the data is not without problems, however. The transformation that establishes symmetry
may not be the one that homogenizes the variances. Results of statistical analyses are to be
interpreted on the transformed scale, which may not be the most meaningful. The square root
of weed counts or the arcsine of the proportion of infected plants is not a natural metric for
interpretation.
If the probability distribution of the response is known, one should not attempt to force
the statistical analysis in a Gaussian framework if tools are available specifically designed for
that distribution. Generalized Linear Models (GLMs) extend linear statistical modeling to re-
sponse distributions that belong to a broad family of distributions, known as the exponential
family. It contains the Bernoulli, Binomial, Poisson, Negative Binomial, Gamma, Gaussian,
Beta, Weibull, and other distributions. GLM theory is based on work by Nelder and Wedder-
burn (1972) and Wedderburn (1974). It was subsequently popularized in the monograph by
McCullagh and Nelder (1989). GLMs combine elements from linear and nonlinear models
and we caution the reader on the outset not to confuse the acronym GLM with the glm
procedure of The SAS® System. The glm procedure fits linear models and conducts inference
assuming Gaussian errors, a very special case of a generalized linear model. The SAS®
acronym stands for General Linear Model, the generality being that it can fit regression,
analysis of variance, and analysis of covariance models by unweighted or weighted least
squares, not that it can fit generalized linear models. Some of the procedures in The SAS®
System that can fit generalized linear models are proc genmod, proc logistic, proc probit,
proc nlmixed, and proc catmod (see §6.2.3).
Linear models are not well-suited for modeling the effects of experimental factors and
covariates on discrete outcomes for a number of reasons. The following example highlights
some of the problems encountered when a linear model is applied to a binary response.
where 13 denotes the probability that the 3th well is contaminated above threshold level.
The expected value of ]3 is easily found as
Ec]3 d "C3 :aC3 b "Pra]3 "b !Pra]3 !b Pra]3 "b 13 Þ
1.2
1.0
0.8
0.6
0.4
0.2
0.0
-0.2
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
x
Figure 6.1. Simulated data for well contamination example. Straight line is obtained by
fitting the linear regression model ]3 "! "" B3 , the sigmoidal line from a logistic
regression model, a generalized linear model.
Generalized linear models inherit from linear models a linear combination of covariates
and parameters, xw3 ", termed the linear predictor. This additive systematic part of the model
is not an expression for the mean response Ec]3 d, but for a transformation of Ec]3 d. This trans-
formation, 1aEc]3 db, is called the link function of the generalized linear model. It maps the
average response on a scale where covariate effects are additive and it ensures range restric-
tions.
Every distribution in the exponential family suggests a particular link function known as
the canonical link function, but the user is at liberty to pair any suitable link function with any
distribution in the exponential family. The three components of a GLM, linear predictor, link
function, and random component are discussed in §6.2. This section also introduces various
procedures of The SAS® System that are capable of fitting generalized linear models to data.
How to estimate the parameters of a generalized linear model and how to perform statistical
inference is the focus in §6.4. Because of the importance of multinomial, in particular ordinal,
responses in agronomy, a separate section is devoted to cumulative link models for ordered
outcomes (§6.5).
With few exceptions, distributions in the exponential family have functionally related
moments. For example, the mean and variance of a Binomial random variable are Ec] d 81
and Varc1d 81a" 1b Ec] da" 1b where 8 is the binomial sample size and 1 is the
success probability. If ] has a Poisson distribution then Ec] d Varc] d. Means and variances
cannot be determined independently as is the case for Gaussian data. The modeler of discrete
data often encounters situation where the data appear more dispersed than is permissible for a
particular distribution. This overdispersion problem is addressed in §6.6.
Applications of generalized linear models to problems that arise in the plant and soil
sciences follow in §6.7. Mathematical details can be found in Appendix A on the CD-ROM
as §A6.8.
The models covered in this chapter extend the previously discussed statistical models to
non-Gaussian distributions. We assume, however, that data are uncorrelated. The case of
correlated, non-Gaussian data in the clustered data setting is discussed in §8 and in the spatial
setting in §9.
• The random component is the distribution of the response chosen from the
exponential family.
for some functions , a•b and - a•b. The parameter ) is called the natural parameter and < is a
dispersion (scale) parameter. Some important members of the exponential family are shown
in Table 6.2.
Table 6.2. Important distributions in the exponential family (The Bernoulli, Binomial,
Negative Binomial, and Poisson are discrete distributions)
e) f
Binomial, F a8ß 1b 8ln" /) 81 8 "exp
expe)f 81a" 1b " ln "1 1
The function ,a)b is important because it relates the natural parameter to the mean and
variance of ] . We have Ec] d . , w a)b and Varc] d , ww a)b< where , w a)b and , ww a)b are
the first and second derivatives of , a)b, respectively. The second derivative , ww a)b is also
termed the variance function of the distribution. When the variance function is expressed in
terms of the mean ., instead of the natural parameter ), it is denoted as 2a.b. Hence,
Varc] d 2a.b< (Table 6.2). The variance function 2a.b depends on the mean for all distri-
butions in Table 6.2, except the Gaussian. An estimate 1 s of the success probability 1 for
Bernoulli data thus lends itself directly to a moment estimator of the variance, 1s a" 1sb. For
Gaussian data an estimate of the mean does not provide any information about the variability
in the data.
The natural parameter ) can be expressed as a function of the mean . Ec] d by
inverting the relationship ,w a)b .. For example, in the Bernoulli case,
expe)f .
Ec] d . Í ) ln .
" expe)f ".
Denoted )Ð.Ñ in Table 6.2, this function is called the natural or canonical link function. It
is frequently the link function of choice, but it is not a requirement to retain the canonical link
(§6.2.2). We now discuss the important distributions shown in Table 6.2 in more detail.
The Bernoulli distribution is a discrete distribution for binary (success/failure) outcomes.
If the two possible outcomes are coded
" if outcome is a success
]
! if outcome is a failure,
A Binomial aF a8ß 1bb random variable can thus be thought of as the sum of 8 independent
Bernoulli aF a1bb random variables.
Example 6.2. Seeds are stored at four temperature regimes aX" to X% b and under addi-
tion of chemicals at four different concentrations a!ß !Þ"ß "Þ!ß "!b. To study the effects
of temperature and chemical concentration a completely randomized experiment is
conducted with a % % factorial treatment structure and four replications. For each of
the '% experimental sets, &! seeds were placed on a dish and the number of seeds that
germinated under standard conditions was recorded. The data, taken from Mead,
Curnow, and Hasted (1993, p. 325) are shown in Table 6.3.
Let ]345 denote the number of seeds germinating for the 5 th replicate of temperature 3
and chemical concentration 4. For example, C"#" "$ and C"## "# are the realized
values for the first and second replication of temperature X" and concentration !Þ".
These are realizations of F a&!ß 1"# b random variables if the seeds germinated indepen-
dently in a dish and there are no differences that affect germination between the dishes.
Alternatively, one can think of each seed for this treatment combination as a Bernoulli
Table 6.3. Germination data from Mead, Curnow, and Hasted (1993, p. 325)†
(Values represent counts out of &! seeds for four replicates)
Chemical Concentration
Temperature 0 Ð4 "Ñ 0.1 Ð4 #Ñ 1.0 Ð4 $Ñ 10 Ð4 %Ñ
X" a3 "b *ß *ß $ß ( "$ß "#ß "%ß "& #"ß #$ß #%ß #( %!ß $#ß %$ß $%
X# a3 #b "*ß $!ß #"ß #* $$ß $#ß $!ß #' %$ß %!ß $(ß %" %)ß %)ß %*ß %)
X$ a3 $b (ß (ß #ß & "ß #ß %ß %ß )ß "!ß 'ß ( $ß %ß )ß &
X% a3 %b %ß *ß $ß ( "$ß 'ß "&ß ( "'ß "$ß ")ß "* "$ß ")ß ""ß "'
†
Used with permission.
As for an experiment with continuous response where interest lies in comparing treat-
ment means, we may be interested in similar comparisons of the form
134 13w 4 ,
1"4 1#4 1$4 1%4 ,
1Þ" 1Þ# 1Þ$ 1Þ% ,
The number of trials in a binomial experiment until the 5 th success occurs follows the
Negative Binomial law. One of the many ways in which the probability mass function of a
Negative Binomial random variable can be written is
5C" C 5
: aC b a" 1b 1 ß C !ß "ß â. [6.4]
C
See Johnson et al. (1992, Ch. 5) and §A6.8.1 for other parameterizations. A special case of
the Negative Binomial distribution is the Geometric distribution for which 5 ". Notice that
the support of the Negative Binomial distribution has no upper bound and the distribution can
thus be used to model counts without natural denominator, i.e., counts that cannot be convert-
ed to proportions. Examples are the number of weeds per 7# , the number of aflatoxin con-
taminated peanuts per 7$ and the number of earthworms per 0 >$ . Traditionally, the Poisson
distribution is more frequently applied to model such counts than the Negative Binomial
distribution. The probability mass function of a Poissona-b random variable is
-C -
: aC b / ß C !ß "ß â . [6.5]
Cx
A special feature of the Poisson random variable is the identity of mean and variance,
Ec] d Varc] d -. Many count data suggest variation that exceeds the mean count. The
Negative Binomial distribution has the same support aC !ß "ß âb as the Poisson distribution
but allows greater variability. It is a good alternative model for count data that exhibit excess
variation compared to the Poisson model. This connection between Poisson and Negative
Binomial distributions can be made more precise if one considers the parameter - a random
Gamma-distributed random variable (see §A6.8.6 for details and §6.7.8 for an application).
The family of Gamma distributions encompasses continuous, non-negative, right-
skewed probability densities (Figure 6.2). The Gamma distribution has two non-negative
parameters, ! and " , and density function
"
0 aC b " ! C !" expe CÎ" fß C !, [6.6]
>a!b
_
where >a!b '! >!" /> .> is known as the gamma function. The mean and variance of a
Gamma random variable are Ec] d !" . and Varc] d !" # .# Î!. The density
function can be rewritten in terms of the mean . and the scale parameter ! as
!
" C! !
0 aC b exp C ß C !, [6.7]
>a!bC . .
from which the exponential family terms in Table 6.2 were derived. We refer to the param-
eterization [6.7] when we denote a Gammaa.ß !b random variable. The Exponential distri-
bution for which ! " and the Chi-squared distribution with / degrees of freedom for which
! / Î# and " # are special cases of Gamma distributions (Figure 6.2).
α = 0.5
1.2
α = 1.0
α = 4.0
α = 3.0
0.8
Density f(y)
α = 2.0
0.4
0.0
0 1 2 3 4
y
Because of its skewness, Gamma distributions are useful to model continuous, non-nega-
tive, right-skewed outcomes with heterogeneous variances and play an important role in
analyzing time-to-event data. If on average ! events occur independently in .Î! time units,
the time that elapses until the !th event occurs is a Gammaa.Î!ß !b random variable.
The variance of a Gamma random variable is proportional to the square of its mean.
Hence, the coefficient of variation of a Gamma-distributed random variable remains constant
as its mean changes. This suggests a Gamma model for data in which the standard deviation
of the outcome increases linearly with the mean and can be assessed in experiments with
replication by calculating standard deviations across replicates.
Example 6.3. McCullagh and Nelder (1989, pp. 317-320) discuss a competition experi-
ment where various seed densities of barley and the weed Sinapis alba were grown in a
competition experiment with three replications (blocks). We focus here on a subset of
their data, the monoculture barley dry weight yields (Table 6.4).
Table 6.4. Monoculture barley yields and seeding densities in three blocks
(Experimental units were individual pots)
Dry weights
Seeds sown Block 1 Block 2 Block 3 C = GZ
$ #Þ!( &Þ$# $Þ"% $Þ&" "Þ'& %(Þ"*
& "!Þ&( "$Þ&* "%Þ'* "#Þ*& #Þ"$ "'Þ%(
( #!Þ)( *Þ*( &Þ%& "#Þ!* (Þ*$ '&Þ&$
"! 'Þ&* #"Þ%! #$Þ"# "(Þ!% *Þ!* &$Þ$%
"& )Þ!) ""Þ!( )Þ#) *Þ"% "Þ'( ")Þ#)
#$ "'Þ(! 'Þ'' "*Þ%) "%Þ#) 'Þ(% %(Þ##
$% #"Þ## "%Þ#& $)Þ"" #%Þ&$ "#Þ#( &!Þ!#
&" #'Þ&( $*Þ$( #&Þ&$ $!Þ%* (Þ(" #&Þ#)
(( #$Þ(" #"Þ%% "*Þ(# #"Þ'# #Þ!! *Þ#'
""& #!Þ%' $!Þ*# %"Þ!# $!Þ)! "!Þ#) $$Þ$(
Reproduced from McCullagh and Nelder (1989, pp. 317-320) with permission.
12
10
Sample standard deviation
0
0 5 10 15 20 25 30 35
Sample mean dry weight
Figure 6.3. Sample standard deviation as a function of sample mean in barley yield
monoculture.
Sample standard deviations calculated from only three observations are not reliable.
When plotting = against C, however, a trend between standard deviation and sample
mean is obvious and a linear trend ÈVarc] d # Ec] d does not seem unreasonable.
In §5.8.7 it was stated that the variability of the yield per plant ] generally increases with
] in yield-density studies and competition experiments. Mead (1970), in a study of the
Bleasdale-Nelder yield-density model (Bleasdale and Nelder 1960), concludes that it is
reasonable for yield data to assume that Varc] d º Ec] d# , precisely the mean-variance
relationship implied by the Gamma distribution. Many research workers would choose a
logarithmically transformed model
Eclne] fd lne0 aBß " bf,
where B denotes plant density, assuming that Varclne] fd is constant and lne] f is Gaussian.
As an alternative, one could model the relationship
Ec] d 0 aBß " b
When modeling sample variances, one should not resort to a Gaussian distribution, but
instead draw on a properly scaled Gamma distribution.
Example 6.4. Hart and Schabenberger (1998) studied the variability of the aflatoxin
deoxynivalenol (DON) on truckloads of wheat kernels. DON is a toxic secondary
metabolite produced by the fungus Gibberella zeae during the infection process. Data
were gathered in 1996 by selecting at random ten trucks arriving at mill elevators. For
each truck ten double-tubed probes were inserted at random from the top in the truck-
load and the kernels trapped in the probe extracted, milled, and submitted to enzyme-
linked immunosorbent assay (ELISA). Figure 6.4 shows the truck probe-to-probe
sample variance as a function of the sample mean toxin concentration per truck.
1.2
0.8
0.4
0.0
0 2 4 6 8 10 12
DON concentration (ppm)
where 1a•b is a properly chosen link function. These data are analyzed in §6.7.7.
The Log-Gaussian distribution is also right-skewed and has variance proportional to the
squared expectation. It is popular in modeling bioassay data or growth data where a logarith-
mic transformation establishes Gaussianity. If lne] f is Gaussian with mean . and variance
5 # , then ] is a Log-Gaussian random variable with mean Ec] d expe. 5 # Î#f and
variance Varc] d expe#. 5 # faexpe5 # f "b. It is also a reasonable model for right-
skewed data that can be transformed to symmetry by taking logarithms. Unfortunately, it is
not a member of the exponential family and does not permit a generalized linear model. In the
GLM framework, the Gamma distributions provide an excellent alternative in our opinion.
Amemiya (1973) and Firth (1988) discuss testing Log-Gaussian vs. Gamma distributions and
vice versa.
The Inverse Gaussian distribution is also skew-symmetric with a long right tail and is a
member of the exponential family. It is not really related to the Gaussian distribution although
there are some parallel developments for the two families. For example, the independence of
sample mean and sample variance that is true for the Gaussian is also true for the Inverse
Gaussian distribution. The name stems from the inverse relationship between the cumulative
generating functions of the two distributions. Also, the formulas of the Gaussian and Inverse
Gaussian probability density functions bear some resemblance to each other. Folks and
Chhikara (1978) suggested naming it the Tweedie distribution in recognition of fundamental
work on this distribution by Tweedie (1945, 1957a, 1957b). A random variable ] is said to
have an Inverse Gaussian distribution if its probability density function (Figure 6.5) is given
by
µ = 1.0
1.0
0.8 µ = 2.0
µ = 3.0
Density f(y)
0.6
µ = 5.0
0.4
0.2
0.0
The distribution finds application in stochastic processes as the distribution of the first
passage time in a Brownian motion, and in analysis of lifetime and reliability data. The rela-
tionship with a passage time in Brownian motion suggests its use as the time a tracer remains
in an organism. Folks and Chhikara (1978) pointed out that the skewness of the Inverse
Gaussian distribution makes it an attractive candidate for right-skewed, non-negative, con-
tinuous outcomes whether or not the particular application relates to passage time in a
stochastic process. The mean and variance of the Inverse Gaussian are given by Ec] d .
and Varc] d .$ 5 # . Notice that 5 # is not an independent scale parameter. The variability of
] is determined by . and 5 # jointly, whereas for a Gaussian distribution Varc] d 5 # . The
variance of the Inverse Gaussian increases more sharply in the mean than that of the Gamma
distribution (for 5 # !" ).
and term it the linear predictor (3 . The linear predictor is chosen in generalized linear
models in much the same way as the mean function is built in classical linear regression or
classification models. For example, if the binomial counts or proportions are analyzed in the
completely randomized design of Example 6.2 (p. 306), the linear predictor contains an inter-
cept, temperature, concentration effects, and their interactions. In contrast to the classical
linear model where (3 is the mean of an observation the linear predictor in a generalized
linear model is set equal to a transformation of the mean,
1aEc]3 db (3 xw3 " Þ [6.9]
This transformation 1a•b is called the link function and serves several purposes. It is a
transformation of the mean .3 onto a scale where the covariate effects are additive. In the
terminology of §5.6.1, the link function is a linearizing transform and the generalized linear
model is intrinsically linear. If one studies a model with mean function
.3 expe"! "" B3 f expe"! fexpe"" B3 f,
then lne.3 f is a linearizing transformation of the nonlinear mean function. A second purpose
of the link function is to confine predictions under the model to a suitable range. If ]3 is a
Bernoulli outcome then Ec]3 d 13 is a success probability which must lie between ! and ".
Since no restrictions are placed on the parameters in the linear predictor xw3 " , the linear
predictor can range from _ to _. To ensure that the predictions are in the proper range
one chooses a link function that maps from a!ß "b to a _ß _b. One such possibility is the
logit transformation
1
logita1b ln , [6.10]
"1
which is also the canonical link for the Bernoulli and Binomial distributions (see Table 6.2).
Models with logit link are termed logistic models and can be expressed as
1
ln xw ". [6.11]
"1
Once parameter estimates in the generalized linear model have been obtained (§6.4), the
mean of the outcome at any value of x is predicted as
s .
s 1" xw "
.
Several link functions can properly restrict the expectation but provide different scales on
which the covariate effects are additive. The representation of the probability (mass) density
functions in Table 6.2 suggests canonical link functions shown there as )a.b. The canonical
link for Binomial data is thus the logit link, for Poisson counts the log link, and for Gaussian
data the identity link (no transformation). Although relying on the canonical link leads to
some simplifications in parameter estimation, these are not of concern to the user of
generalized linear models in practice. Functions other than the canonical link may be of
interest. We now review popular link functions for binary data and proportions, counts, and
continuous variables.
with mean . and variance a!1b# Î$. If . !, ! ", the distribution is called the Standard
Logistic with cdf 1 expeC fÎa" expeC fb. Inverting the standard logistic cdf yields the
logit function
1
J " a1b ln .
"1
In terms of a generalized linear model for ]3 with Ec]3 d 13 and linear predictor (3 xw3 " we
obtain
13
J " a13 b logita13 b ln w
x3 " Þ [6.14]
" 13
In the logistic model the parameters have a simple interpretation in terms of log odds
ratios. Consider a two-group comparison of successes and failures. Define a dummy variable
as
" if 3th observation is in the treated group a4 "b
B34
! if 3th observation is in the control group a4 !b,
reduces to
"! "" 4 " (treated group)
logita14 b
"! 4 ! (control group).
The gradient "" measures the change in the logit between the control and the treated group. In
terms of the success and failure probabilities one can construct a # # table.
The odds S are defined as the ratio of the success and failure probabilities in a particular
group,
1!
Scontrol / "!
" 1!
1"
Streated /"! "" .
" 1"
Successes are expe"! f times more likely in the control group than failures and expe"! "" f
times more likely than failures in the treated group. How much the odds have changed by
applying the treatment is expressed by the odds ratio
Streated expe"! "" f
SV / "" ,
Scontrol expe"! f
or the log odds ratio lnaSV b "" . If the log odds ratio is zero, the success/failure ratio is the
same in both groups. Successes are then no more likely relative to failures under the treatment
than under the control. A test of L! :"" ! thus tests for equal success/failure odds in the two
groups. From Table 6.5 it is seen that this implies equal success probabilities in the groups.
These ideas generalize to comparisons of more than two groups.
Why is the logistic distribution our first choice to develop a link function and not the
omnipotent Gaussian distribution? Assume we choose the standard Gaussian cdf
(
" "
1 Fa(b ( exp ># .> [6.15]
_ È#1 #
as the inverse link function. The link function is then given by F" a1b ( but neither
F" a1b nor Fa(b exist in closed form and must be evaluated numerically. Models using the
inverse Gaussian cdf are termed probit models (please note that the inverse Gaussian func-
tion refers to the inverse cdf of a Gaussian random variable and is not the same as the Inverse
Gaussian random variable which is a variable with a right-skewed density). Often, there is
little to be gained in practice using a probit over a logistic model. The cumulative distribution
functions of the standard logistic and standard Gaussian are very similar (Figure 6.6). Both
are sigmoidal and symmetric about ( !. The main difference is that the Gaussian tails are
less heavy than the Logistic tails and thus approach probability ! and " more quickly. If the
distributions are scaled to have the same mean and variance, the cdfs agree even more than is
evident in Figure 6.6. Since logita1b is less cumbersome numerically than F" a1b, it is often
preferred.
1.0
Type-1 Extreme (min)
0.9
0.7
0.6
0.4
0.3
Type-1 Extreme (max)
0.2
0.1
0.0
-5 -4 -3 -2 -1 0 1 2 3 4 5
η
Figure 6.6. Cumulative distribution functions as inverse link functions. The corresponding
link functions are the logit for the Standard Logistic, the probit for the Standard Gaussian, the
log-log link for the Type-1 Extreme (max) value and complementary log-log link for the
Type-1 Extreme (min) value distribution.
The symmetry of the Standard Gaussian and Logistic distributions implies that 1 ! is
approached at the same rate as 1 ". If 1 departs from ! slowly and approaches " quickly or
vice versa, the probit or logit links are not appropriate. Asymmetric link functions can be
derived from appropriate cumulative distribution functions. A Type-I Extreme value distri-
bution (Johnson et al. 1995, Ch. 23) is given by
J aC b expe expe aC !bÎ" ff [6.16]
and has a standardized form with ! !, " ". This distribution, also known as the Gumbel
or double-exponential distribution arises as the distribution of the largest value in a random
sample of size 8. By putting B –C one can obtain the distribution of the smallest value.
Consider the standardized form 1 J a(b expe –expe(ff. Inverting this cdf yields the link
function
lne lne1ff xw3 ", [6.17]
known as the log-log link. Its complement lne –lne" 1ff, obtained by changing successes
to failures and vice versa, is known as the complementary log-log link derived from
J a(b " expe –expe(ff. The complementary log-log link behaves like the logit for 1 near
! and has smaller values as 1 increases. The log-log link behaves like the logit for 1 near "
and yields larger values for small 1 (Figure 6.7).
2
log-log
1 probit
η 0
-1 compl. log-log
logit
-2
-3
-4
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
π
Generalized linear models with log link are often called log-linear models. They play an
important role in regression analysis of counts and in the analysis of contingency tables.
Consider the generic layout of a two-way contingency table in Table 6.6. The counts 834 in
row 3, column 4 of the table represent the number of times variable \ was observed at level 4
and variable ] simultaneously took on level 3.
If the row and column variables a] and \ b are independent, the cell counts 834 are
determined by the marginal row and column totals alone. Under a Poisson sampling model
where the count in each cell is the realization of a Poissona-34 b random variable, the row and
column totals are PoissonÐ-3Þ !N4" -34 Ñ and PoissonÐ-Þ4 !3" M
-34 Ñ variables, respective-
MßN
ly, and the total sample size is a PoissonÐ-ÞÞ !3ß4 -34 Ñ random variable. The expected count
-34 under independence is then related to the marginal expected counts by
-3Þ -Þ4
-34
-ÞÞ
Taking logarithms leads to
lne-34 f lna-ÞÞ b lne-3Þ f lne-Þ4 f . !3 "4 , [6.18]
a generalized linear model with log link for Poisson-distributed random variables and a linear
predictor consisting of a grand mean ., row effects !3 , and column effects "4 . The linear pre-
dictor is akin to that in a two-way layout without interactions such as a randomized block
design, which is precisely the layout of Table 6.6. We can think of !3 and "4 as main effects
of the row and column variables. The Poisson sampling scheme applies if the total number of
observations a8ÞÞ b is itself a random variable, i.e., prior to data collection the total number of
observations being cross-classified is unknown. If the total sample size is known, one is
fortuitously led to the same general decomposition for the expected cell counts as in [6.18].
Conditional on 8ÞÞ the M N counts in the table are realizations of a multinomial distribution
with cell probabilities 134 and marginal probabilities 13Þ and 1Þ4 . The expected count in cell
3ß 4, if \ and ] are independent, is
-34 8ÞÞ 13Þ 1Þ4
In §6.7.6 the agreement between two raters of the same experimental material is analyzed
by comparing series of log-linear models that structure the interaction between the ratings.
applies. This is a generalized linear model with inverse link and linear predictor ! " B. Its
yield per unit area equation is
B
EcY d . [6.21]
! "B
As for yield per plant we can model yield per unit area as a generalized linear model with
inverse link and linear predictor " !ÎB. This is a hyperbolic function of plant density and
the reciprocal link is adequate. In terms of the linear predictor ( this hyperbolic model gives
rise to
( EcY d" " !ÎB. [6.22]
( " !" ÎB !# B,
These transformations include as special cases the logarithmic transform, reciprocal trans-
form, and square root transform. If used as link functions, the inverse link functions are
"Î-
. a(- "b -Á!
expe(f - !.
Figure 6.8 shows the inverse functions for - !Þ$ß !Þ&ß "Þ!ß "Þ&ß and #Þ!.
20
λ = 0.3
18
16
14
λ = 0.5
12
µ
10
8
λ = 1.0
6
λ = 1.5
4
λ = 2.0
2
0 1 2 3 4 5
η
Binomial responses are coded in the events/trial syntax. This syntax requires two data set
variables. The number of Bernoulli trials (= the size of the binomial experiment) is coded as
variable trials and the number of successes as events. Consider the seed germination data
in Example 6.2 (Table 6.3, p. 307). The linear predictor of the full model fitted to these data
contains temperature and concentration main effects and their interactions. If the data set is
entered as shown in Table 6.3, the events/trial syntax would be used.
data germination;
input temp $ conc germnumber;
trials = 50;
datalines;
T1 0 9
T1 0 9
T1 0 3
T1 0 7
T1 0.1 13
T1 0.1 12
and so forth
;;
run;
proc logistic data=germination;
class temp conc / param=glm;
model germnumber/trials = temp conc temp*conc;
run;
The logistic procedure in Release 8.0 of The SAS® System inherits from the
experimental tlogistic procedure in Release 7.0 the ability to use different coding methods
for classification variables. The coding method is selected with the param= option of the
class statement. We prefer the coding scheme for classification variables that corresponds to
the coding method in the glm procedure. This is not the default of proc logistic and hence
we use the param=glm option in the example code above. We prefer glm-type coding because
the specification of contrast coefficients in the contrast statement of proc logistic is then
identical to the specification of contrast coefficients in proc glm (§4.3.4).
To analyze ordinal responses or Bernoulli variables, the single-trial syntax is used. The
next example shows rating data from a factorial experiment with a % # treatment structure
arranged in a completely randomized design. The ordered response variable has three levels,
Poor, Medium, and Good. Also, a Bernoulli response variable (medresp) is created taking the
value " if the response was Medium, ! otherwise.
data ratings;
input REP A B RESP $;
medresp = (resp='Medium');
datalines;
1 1 1 Medium
1 1 2 Medium
1 2 1 Medium
1 2 2 Medium
1 3 1 Good
1 3 2 Good
1 4 1 Good
1 4 2 Good
2 1 1 Poor
2 1 2 Medium
2 2 1 Poor
2 2 2 Medium
2 3 1 Good
2 3 2 Good
2 4 1 Medium
2 4 2 Good
3 1 1 Medium
3 1 2 Medium
3 2 1 Poor
3 2 2 Medium
3 3 1 Good
3 3 2 Good
3 4 1 Good
3 4 2 Good
4 1 1 Poor
4 1 2 Medium
4 2 1 Poor
4 2 2 Medium
4 3 1 Good
4 3 2 Good
4 4 1 Medium
4 4 2 Good
run;
The proportional odds model (§6.5) for ordered data is fit with the statements
proc logistic data=ratings;
class A B / param=glm;
model resp = A B A*B;
run;
By default, the values of the response categories are sorted according to their internal
format. For a character variable such as RESP, the sort order is alphabetical. This results in the
correct order here, since the alphabetical order corresponds to Good-Medium-Poor. If, for
example, the Medium category were renamed Average, the internal order of the categories
would be Average-Good-Poor. To ensure proper category arrangement in this case, one can
use the order= option of the proc logistic statement. For example, one can arrange the data
such that all responses rated Good appear first followed by the Average and the Poor
responses. Then, the correct order is established with the statements
For Bernoulli responses coded ! and ", proc logistic will also arrange the categories
according to internal formatting. For numeric variables this is an ascending order. Conse-
quently, proc logistic will model the probability that the variable takes on the value !.
Modeling the probability that the variable takes on the value " is usually preferred, since this
is the mean of the response. This can be achieved with the descending option of the proc
logistic statement:
By default, proc logistic will use a logit link. Different link functions are selected with
the link= option of the model statement. To model the Bernoulli response medresp with a
complementary log-log link, for example, the statements are
proc logistic data=ratings descending;
class A B / param=glm;
model medresp = A B A*B / link=cloglog;
run;
The model statement provides numerous other options, for example the selection=
option to perform automated covariate selection with backwise, forward, and stepwise
methods. The ctable option produces a classification table for Bernoulli responses which
classifies observed responses depending on whether the predicted responses are above or
below some probability threshold, useful to establish the sensitivity and specificity of a
logistic model for purposes of classification. The online manuals, help files, and documenta-
tion available from SAS Institute discuss additional options and features of the procedure.
Like proc logistic the genmod procedure accepts responses coded in single-trial or
events/trial syntax. The latter is reserved for grouped Binomial data (see §6.3 on grouped vs.
ungrouped data). The order= option of the proc genmod statement affects the ordering of
classification variables as in proc logistic, but not the ordering of the response variable. A
separate option (rorder=) of the proc genmod statement is used to determine the ordering of
the response.
The code to produce a logistic analysis in the seed germination Example 6.2 (Table 6.3,
p. 307) with proc genmod is
data germination;
input temp $ conc germnumber;
trials = 50;
datalines;
T1 0 9
T1 0 9
T1 0 3
T1 0 7
T1 0.1 13
T1 0.1 12
and so forth
;;
run;
proc genmod data=germination;
class temp conc;
model germnumber/trials = temp conc temp*conc link=logit dist=binomial;
run;
The link function and distribution are selected with the link= and dist= options of the
model statement. The next statements perform a Poisson regression with linear predictor
( "! "" B" "# B# and log link;
proc genmod data=yourdata;
model count = x1 x2 / link=log dist=poisson;
run;
For each distribution, proc genmod will apply a default link function if the link= option
is omitted. These are the canonical links for the distributions in Table 6.2 and the cumulative
logit for the multinomial distribution. This does not work the other way around. By specify-
ing a link function but not a distribution function proc genmod does not select a distribution
for which this is the canonical link. That would be impossible since the canonical link does
not identify the distribution. The Negative Binomial and Binomial distributions both have a
canonical log link, for example. Since the default distribution of proc genmod is the Gaussian
distribution (if the response is in single-trial syntax) statements such as
proc genmod data=ratings;
class A B;
model medresp = A B A*B / link=logit;
run;
do not fit a Bernoulli response with a logit link, but a Gaussian response (which is not
sensible if medresp takes on only values ! and ") with a logit link. For the analysis of a
Bernoulli random variable in proc genmod use instead
proc genmod data=ratings;
class A B;
model medresp = A B A*B / link=logit dist=binomial;
run;
If the events/trials syntax is used the distribution of the response will default to the
Binomial and the link to the logit.
The genmod procedure has a contrast statement and an estimate statement akin to the
statements of the same name in proc glm. An lsmeans statement is also available except for
ordinal responses. The statements
proc genmod data=ratings;
class A B;
model medresp = A B A*B / dist=binomial link=logit;
lsmeans A A*B / diff;
run;
perform pairwise comparisons for the levels of factor A and the A B cell means.
Because there is only one method of coding classification variables in genmod (in contrast
to logistic, see above), and this method is identical to the one used in proc glm, coefficients
are entered in exactly the same way as in glm. Consider the ratings example above with a
% # factorial treatment structure.
proc genmod data=ratings;
class A B;
model resp = A B A*B / dist=multinomial link=cumlogit;
contrast 'A1+A2-2A3=0' A 1 1 -2 0;
run;
The contrast statement tests a hypothesis of the form A" 0 based on the asymptotic
distribution of the linear combination A" s and the estimate statement estimates the linear com-
w
bination a " Þ Here, " are the parameters in the linear predictor. In other words, the hypoth-
esis is tested on the scale of the linear predictor, not the scale of the mean response. In some
instances hypotheses about the mean values can be expressed as linear functions of the " s and
the contrast or estimate statement are sufficient. Consider, for example, a logistic regres-
sion model for Bernoulli data,
13
logita13 b ln "! "" B34 ,
" 13
The hypothesis L! : 1control 1treated can be tested as the simple linear hypothesis L! :"" !.
In other instances hypotheses or quantities of interest do not reduce or are equivalent to
simple linear functions of the parameter. An estimate of the ratio
1control " expe "! "" f
,
1treated " expe "! f
for example, is a nonlinear function of the parameters. The variance of the estimated ratio
1
scontrol Î1
streated must be approximated from a Taylor series expansion, although exact methods
for this particular ratio exist (Fieller 1940, see our §A6.8.5). The nlmixed procedure,
although not specifically designed for generalized linear models, can be used to our
advantage in this case.
for PH&! . For the probit or logit link we have 1a!Þ&b ! and thus
log"! aPH&! b "! Î""
PH&! "!"! Î"" .
Once the parameters of the probit or logistic regression models have been estimated the
obvious estimates for the lethal dosages on the logarithmic and original scale of insecticide
concentration are
s " and "!"s ! Î"s " .
s ! Î"
"
These are nonlinear functions of the parameter estimates and standard errors must be
approximated from Taylor series expansions unless one is satisfied with fiducial limits for the
PH&! (see §A6.8.5). The data set and the proc nlmixed code to fit the logistic regression
model and to estimate the lethal dosages are as follows.
data kills;
input concentration kills;
trials = 20;
logc = log10(concentration);
datalines;
0.375 0
0.75 1
1.5 8
3.0 11
6.0 16
12.0 18
24.0 20
;;
run;
The parameters statement defines the parameters to be estimated and assigns starting
values, in the same fashion as the parameters statement of proc nlin. The statement pi =
1/(1+exp(-intcpt - b*logc)) calculates the probability of an insect being killed through
the inverse link function. This is a regular SAS® programming statement for 1 1" a(b. The
model statement specifies that observations of the data set variable kills are realizations of
Binomial random variables with binomial sample size given by the data set variable trials
and success probability according to pi. The estimate statements calculate estimates for the
PH&! values on the logarithmic and original scale of insecticide concentrations.
If a probit analysis is desired as in Mead et al. (1993) the code changes only slightly.
Only the statement for 1 1" a(b must be altered. The inverse of the probit link is the
cumulative Standard Gaussian probability density
(
" "
1 Fa(b ( exp ># .>,
_ È #1 #
which can be calculated with the probnorm() function of The SAS® System (the SAS®
function calculating the linked value is not surprisingly called the probit() function. A call
to probnorm(1.96) returns the result !Þ*(&, probit(0.975) returns "Þ*'). Because of the
similarities of the logit and probit link functions the same set of starting values can be used
for both analyses.
proc nlmixed data=kills;
parameters intcpt=-1.7 b=4.0;
pi = probnorm(intcpt + b*logc);
Like the genmod procedure, nlmixed allows the user to perform inference for distributions
other than the built-in distributions. In genmod this is accomplished by programming the
deviance (§6.4.3) and variance function (deviance and variance statements). In nlmixed the
model statement is altered to
model response ~ general(logl);
where logl is the log-likelihood function (see §6.4.1 and §A6.8.2) of the data constructed
with SAS® programming statements. For the Bernoulli distribution the log-likelihood for an
individual a!ß "b observation is simply 6a1à Cb C lne1f a" C blne" 1f and for the
Binomial the log-likelihood kernel for the binomial count C is 6a1à C b C lne1f
a8 Cblne" 1f. The following nlmixed code also fits the probit model above.
proc nlmixed data=kills;
parameters intcpt=-1.7 b=4.0;
p = probnorm(intcpt + b*logc);
logl = kills*log(p) + (trials-kills)*log(1-p);
model kills ~ general(logl);
estimate 'LD50' -intcpt/b;
estimate 'LD50 original' 10**(-intcpt/b);
run;
• Grouping data changes weights in the exponential family models and im-
pacts the asymptotic behavior of parameter estimates.
So far we have implicitly assumed that each data point represents a single observation. This is
not necessarily the case. Consider an agronomic field trial in which four varieties of wheat are
to be compared with respect to their resistance to infestation with the Hessian fly (Mayetida
destructor). The varieties are arranged in a randomized block design, and each experimental
unit is a $Þ( $Þ( m field plot. 834 plants are sampled in the 4th block for variety 3, D34 of
which show damage. If plants on a plot are infected independently of each other the data from
each plot can also be considered a set of independent and identically distributed Bernoulli
variables
A hypothetical data set for the outcomes of such an experiment with two blocks is shown in
Table 6.7. The data set contains &' total observations, #$ of which correspond to damaged
plants.
Table 6.7. Hypothetical data for Hessian fly experiment (four varieties in two blocks)
Block 4 " Block 4 #
5 Entry 3 1 Entry 2 Entry 3 Entry 4 Entry 1 Entry 2 Entry 3 Entry 4
" " ! ! " " ! ! !
# ! ! " ! ! ! ! !
$ ! " ! ! " " " !
% " ! ! ! " ! ! "
& ! " ! " " ! " !
' " ! " " ! ! !
( " " ! ! " !
) ! " "
834 ) ' ( ) ( & ( )
D34 % # $ % % " $ #
One could model the ]345 with a generalized linear model for &' binary outcomes.
Alternatively, one could model the number of damaged plants aD34 b per plot or the proportion
of damaged plants per plot aD34 Î834 b. The number of damaged plants corresponds to the sum
of the Bernoulli variables
834
D34 "C345
5"
The sums D34 and averages C34 are grouped versions of the original data. The sum of indepen-
dent and identical Bernoulli variables is a Binomial random variable and of course in the
exponential family (see Table 6.2 on p. 305). It turns out that the distribution of the average
of random variables in the exponential family is also a member of the exponential family. But
the number of grouped observations is smaller than the size of the original data set. In Table
6.7 there are &' ungrouped and ) grouped observations. A generalized linear model needs to
be properly adjusted to reflect this grouping. This adjustment is made either through the va-
riance function or by introducing weights into the analysis. In the Hessian fly example, the
variance function 2a134 b of a Bernoulli observation for variety 3 in block 4 is 134 a" 134 b, but
2a134 b 834 134 a" 134 b for the counts D34 and 2a134 b 8"
34 134 a" 134 b for the proportions
C34 . The introduction of weights into the exponential family density or mass function
accomplishes the same. Instead of [6.1] we consider the weighted version
aC) , a)bb
0 aC b exp - aCß <,Ab. [6.24]
<A
Example 6.5 Earthworm Counts. In 1995 earthworms (Lubricus terrestris L.) were
counted in four replications of a #% factorial experiment at the W.K. Kellogg Biological
Station in Battle Creek, Michigan. The treatment factors and levels were Tillage Ðchisel-
plow and Notill), Input Level (conventional and low), Manure application (yes/no) and
Crop (corn and soybean). Of interest was whether the L. terrestris density varies under
these management protocols and how the various factors act and interact. Table 6.8
displays the total worm count for the '% #% % replicates experimental units (juvenile
and adult worms).
Table 6.8. Ungrouped worm count data (#Îft# ) in #% factorial design (numbers
in each cell of table correspond to counts on replicates)
Tillage
Chisel-Plow No Tillage
Crop Manure Input Level Input Level
Low Conventional Low Conventional
Corn Yes &ß &ß %ß # &ß "ß &ß ! )ß %ß 'ß % "%ß *ß *ß '
No $ß ""ß !ß ! #ß !ß 'ß " #ß #ß ""ß % "&ß *ß 'ß %
Soybean Yes )ß 'ß !ß $ )ß %ß #ß # #ß #ß "$ß ( &ß $ß 'ß !
No )ß &ß $ß "" #ß 'ß *ß % (ß &ß ")ß $ #$ß "#ß "(ß *
Unless the replication effects are block effects, the four observations per cell share the
same linear predictor consisting of Tillage, Input level, Crop, Manure effects, and their
interactions. Grouping to model the averages reduces the '% observations to 8a1b "'
observations.
Tillage
Chisel-Plow No Tillage
Crop Manure Input Level Input Level
Low Conventional Low Conventional
Corn Yes %Þ!! #Þ(& &Þ&! *Þ&!
No $Þ&! #Þ#& %Þ(& )Þ&!
Soybean Yes %Þ#& %Þ!! 'Þ!! $Þ&!
No 'Þ(& &Þ#& )Þ#& "&Þ#&
When grouping data, observations that have the same set of covariates or design effects,
i.e., share the same linear predictor xw ", are summed or averaged. In the Hessian fly example
each block entry combination is unique, but 834 observations were collected for each ex-
perimental unit. In experiments where treatments are replicated, grouping is often possible
even if only a single observation is gathered on each unit. If covariates are continuous and
their values unique, grouping is not possible.
In the previous two examples it appears as a matter of convenience whether data are
grouped or not. But this choice has subtle implications. Diagnosing the model-data disagree-
ment in generalized linear models based on residuals or goodness-of-fit measures such as
Pearson's \ # statistic or the deviance (§6.4.3) is only meaningful if data are grouped. Group-
ing is a special case of clustering where the elements of a cluster are reduced to a single
observation (the cluster total or average). Asymptotic results for grouped data can be obtained
by increasing the number of groups while holding the size of each group constant or by
assuming that the number of groups is fixed and the group size grows. The respective asymp-
totic results are not identical. For ungrouped data, it is only reasonable to consider asymptotic
results under the assumption that the sample size 8 grows to infinity. No distinction between
group size and group number is made. Finally, if data are grouped, computations are less
timeconsuming. Since generalized linear models are fit by iterative procedures, grouping
large data sets as much as possible is recommended.
For random variables with distribution in the exponential family the specification of the
joint distribution of the data is made simple by the relationship between mean and variance
(§6.2.1). If the 3 "ß ÞÞÞß 8 observations are independent, the likelihood for the complete
response vector
y cC" ß ÞÞÞß C8 dw
becomes
8 8
¿a)ß <à yb $¿a)ß <à C3 b $expeaC3 )3 , a)3 bbÎ< - aC3 ß <bf, [6.25]
3" 3"
Since .3 1" a(3 b 1" axw3 " b, where 1" a•b is the inverse link function, the log-likelihood
is a function of the parameter vector " and estimates are found as the solutions of
`6a.ß <à yb
0Þ [6.28]
`"
Details of this maximization problem are found in §A6.8.2.
Since generalized linear models are nonlinear, the estimating equations resemble those
for nonlinear models. For a general link function these equations are
Fw V" ay .b 0, [6.29]
and
Xw ay .b 0 [6.30]
if the link is canonical. Here, V is a diagonal matrix containing the variances of the responses
on its diagonal aV VarcYdb and F contains derivatives of . with respect to " . Furthermore,
if the link is the identity, it follows that a solution to [6.30] is
s Xw V" y.
Xw V" X" [6.31]
This would suggest a generalized least squares estimator " s aXw V" Xb Xw V" y. The diffi-
"
culty with generalized linear models is that the variances in V are functionally dependent on
s which determines the estimate of the mean, .
the means. In order to calculate " s, V must be
s
evaluated at some estimate of .. Once " is calculated, V should be updated. The procedure
to solve the maximum likelihood problem in generalized linear models is hence iterative
where variance estimates are updated after updates of "s , and known as iteratively reweighted
least squares (IRLS, §A6.8.3). Upon convergence of the IRLS algorithm, the variance of " s is
estimated as
"
s "
Var sw V
s F s" F
s , [6.32]
where the variance and derivative matrix are evaluated at the converged iterate.
s and substituting
The mean response is estimated by evaluating the linear predictor at "
the result into the linear predictor,
E s .
sc] d 1" as(b 1" xw " [6.33]
Because of the nonlinearity of most link functions, E sc] d is not an unbiased estimator of Ec] d,
s
even if " is unbiased for " , which usually it is not. An exception is the identity link function
where EÒxw " s Ó xw " if the estimator is unbiased and EÒxw " s Ó Ec] d provided the model is
correct. Estimated standard errors of the predicted mean values are usually derived from
Taylor series expansions and are approximate in the following sense. The Taylor series of
s Ñ around some value " is an approximate linearization of 1" Ðxw "
1" Ðxw " s Ñ. The variance of
this linearization is a function of the model parameters and is estimated by substituting the
parameter estimates without taking into account the uncertainty in these estimates themselves.
A Taylor series of . s Ñ around ( leads to the linearization
s 1" Ðs(Ñ 1" Ðxw "
`1" as(b `1" as(b w s
1" as(b » 1" a(b as( (b . x " xw " . [6.34]
`s( l ( `s( l (
and is estimated as
# #
s c. `1" as(b w sw s" s
" `1" as(b w s s
Var sd » x F V F x x Var " x. [6.35]
`s( `s(
If the link function is canonical, a simplification arises. In that case ( )a.b, the derivative
`1" Ð(ÑÎ` ( can be written as ` .Î` )Ð.Ñ and the standard error of the predicted mean is esti-
mated as
s c.
Var #
sb xw Var
s d » 2 a. s x .
s " [6.36]
special case of this linear hypothesis is a test of L! :"4 ! where "4 is the 4th element of " .
The standard approach of dividing the estimate of "4 by its estimated standard error is useful
for generalized linear models, too.
The statistic
#
s #4
" Î s4
" Ñ
[ Ð Ó [6.37]
s "
Var s4 s4 Ò
Ï ese"
has an asymptotic Chi-squared distribution with one degree of freedom. [6.37] is a special
case of a Wald test statistic. More generally, to test L0 : A" d compare the test statistic
w " "
[ A" sw V
s d AF s" F s d
s Aw A" [6.38]
against cutoffs from a Chi-squared distribution with ; degrees of freedom (where ; is the
rank of the matrix A). Such tests are simple to carry out because they only require fitting a
single model. The contrast statement in proc genmod of The SAS® System implements such
linear hypotheses (with d 0) but does not produce Wald tests by default. Instead it
calculates a likelihood ratio test statistic which is computationally more involved but also has
better statistical properties (see §1.3). Assume you fit a generalized linear model and obtain
the parameter estimates " s 0 from the IRLS algorithm. The subscript 0 denotes the full model.
A reduced model is obtained by invoking the constraint A" d. Call the estimates obtained
under this constraint "s < . If 6.
s0 ß <à y is the log-likelihood attained in the full and 6a.
s < ß < à yb
is the log-likelihood in the reduced model, twice their difference
A #6.
s 0 ß < à y 6 a.
s < ß < à y b [6.39]
where
" if observation 3 is from group "
B34
! if observation 3 is from group #.
The ratio
is a nonlinear function of "! and "" but the hypothesis L! : 1" Î1# " is equivalent to the
linear hypothesis L! : "" !. In other cases it may not be possible to find an equivalent linear
hypothesis. In the logistic dose-response model
logita1b "! "" B,
and obtain its estimated standard error. Calculate an approximate *&% confidence interval for
PH&! , relying on the asymptotic Gaussian distribution of "s , as
If the confidence interval does not cover the concentration B! , reject the hypothesis that
PH&! B! ; otherwise fail to reject. A slightly conservative approach is to replace the
standard Gaussian cutoff with a cutoff from a > distribution (which is what proc nlmixed
does). The key is to derive a good estimate of the standard error of the nonlinear function of
the parameters. For general nonlinear functions of the parameters we prefer approximate stan-
dard errors calculated from Taylor series expansions. This method is very general and typi-
cally produces good approximations. The estimate statement of the nlmixed procedure in
The SAS® System implements the calculation of standard errors by a first-order Taylor series
for nonlinear functions of the parameters. For certain functions, such as the PH&! above,
exact formulas have been developed. Finney (1978, pp. 80-82) gives formulas for fiducial
intervals for the PH&! based on a theorem by Fieller (1940) and applies the result to test the
identity of equipotent dosages for two assay formulations. Fiducial intervals are akin to confi-
dence intervals; the difference between the two approaches is largely philosophical. The
interpretation of confidence limits appeals to conceptual, repeated sampling such that the
repeatedly calculated intervals include the true parameter value with a specified frequency.
Fiducial limits are values of the parameter that would produce an observed statistic such as
PH s&! with a given probability. See Schwertman (1996) and Wang (2000) for further details
on the comparison between fiducial and frequentist inference. In §A6.8.5 Fieller's derivation
is examined and compared to the expression of the standard error of a ratio of two random
variables derived from a first-order Taylor series expansion. As it turns out the Taylor series
method results in a very good approximation provided that the slope "" in the dose-response
model is considerably different from zero. For the approximation to be satisfactory, the
standard > statistic for testing L! : "" !,
s"
"
>9,= ,
ese"s"
Here, @ are the degrees of freedom associated with the model deviance and ! denotes the
significance level. For a &% significance level this translates into a >9,= of about * or more in
absolute value (Finney 1978, p. 82). It should be noted, however, that the fiducial limits of
Fieller (1940) are derived under the assumption of Gaussianity and unbiasedness of the
estimators.
where 8a1b is the size of the grouped data set and 2a.s3 b is the variance function evaluated at
the estimated mean (see §6.2.1 for the definition of the variance function). \ # thus takes the
form of a weighted residual sum of squares. The deviance of a generalized linear model is
derived from the likelihood principle. It is proportional to twice the difference between the
maximized log-likelihood evaluated at the estimated means . s3 and the largest achievable log
likelihood obtained by setting . s3 C3 . Two versions of the deviance are distinguished,
depending on whether the distribution of the response involves a scale parameter < or not.
Recall from §6.3 the weighted exponential family density
0 aC3 b expeaC3 )3 , a)3 bbÎa<A3 b - aC3 ß <bf.
s
. )3 )a.
If s3 b is the estimate of the natural parameter in the model under consideration and
)3 )aC3 b is the canonical link evaluated at the observations, the scaled deviance is defined
as
8a1b 8 1
Þ
sb #"6)3 ß <à C3 #"6s)3 ß <à C3
H ay à .
3" 3"
8a1b
# Þ Þ
"C3 )3 s)3 , )3 , s)3
< 3"
#e6ayß <à yb 6a.
sß <à ybf. [6.41]
When H ayà . sb is multiplied by the scale parameter < , we simply refer to the deviance
s b < H ay à .
H ay à . sb <#e6ayß <à yb 6a.sß <à ybf. In [6.41], 6ayß <à yb refers to the log-
likelihood evaluated at . y, and 6a. sß <à yb to the log-likelihood obtained from fitting a
particular model. If the fitted model is saturated, i.e., fits the data perfectly, the (scaled)
deviance and \ # statistics are identically zero.
The utility of \ # and H ayà .
sb lies in the fact that, under certain conditions, both have a
well-known asymptotic distribution. In particular,
. .
\ # Î< Ä ;#8a1b : , H ay à .
sb Ä ;#8a1b : , [6.42]
where : is the number of estimated model parameters (the rank of the model matrix X).
Table 6.10 shows deviance functions for various distributions in the exponential family.
The deviance in the Gaussian case is simply the residual sum of squares. The deviance-based
s Hayà .
scale estimate < sbÎÐ8a1b :Ñ in this case is the customary residual mean square
error. Furthermore, the deviance and Pearson \ # statistics are identical then and their scaled
versions have exact (rather than approximate) Chi-squared distributions.
Table 6.10. Deviances for some exponential family distributions (in the Binomial case
83 denotes the Binomial sample size and C3 the number of successes)
Distribution H ay à .
sb
8
.
Bernoulli #! C3 ln "s
s3
. lne" .
s3 f
3
3"
8
Binomial #! C3 ln .sC3 a83 C3 bln <<33
C3
.s
3 3
3"
8
Negative Binomial #! C3 ln .sC3 aC3 "Î5 bln .sC3 "Î5
"Î5
3 3
3"
8
Poisson #! C3 ln .sC3 aC3 .
s3 b
3
3"
8
! aC3 . #
Gaussian s3 b
3"
8
Gamma #! ln .sC3 aC3 .
s3 bÎ.
s3
3
3"
8
! aC3 . #
Inverse Gaussian s#3 C3
s 3 b Î .
3"
As the agreement between data aC3 b and model fit (. s3 ) improves, \ # will decrease in
value. On the contrary a model not fitting the data well will result in a large value of \ # (and
a large value for H ayà .
sb). If the conditions for the asymptotic result hold, one can calculate
the :-value for L! : model fits the data as
If the :-value is sufficiently large, the model is acceptable as a description of the data
generating mechanism. Before one can rely on this goodness-of-fit test, the conditions under
which the asymptotic result holds, must be met and understood (McCullagh and Nelder 1989,
p. 118). The first requirement for the result to hold is independence of the observations. If
overdispersion arises from autocorrelation or randomly varying parameters, both of which
induce correlations, \ # and H ayà .
sb do not have asymptotic Chi-squared distributions. More
importantly, it is assumed that data are grouped, the number of groups Ð8a1b Ñ remains fixed,
and the sample size in each group tends to infinity, thereby driving the within group variance
to zero. If these conditions are not met, large values of \ # or H ayà .sb do not necessarily
indicate poor fit and should be interpreted with caution. Therefore, if data are ungrouped
(group size is ", 8Ð1Ñ 8), one should not rely on \ # or H ayà . sb as goodness-of-fit
measures.
The asymptotic distributions of \ # Î< and Hayà .
sbÎ< suggest a simple method to esti-
mate the extra scale parameter < in a generalized linear model. Equating \ # Î< with its
asymptotic expectation,
"
E\ # ¸ 8Ð1Ñ :,
<
and has an asymptotic Chi-squared distribution with ; degrees of freedom. If [6.43] is signifi-
cantly large, reject model Q< in favor of Q0 . Since H ayà .
s b #e 6a y ß < à y b 6 a .
sß <à ybf is
twice the difference between the maximal and the maximized log-likelihood, [6.43] can be
rewritten as
s < b H y à .
H ay à . s0 #6.
s 0 ß < à y 6 a.
s< ß <à yb A, [6.44]
and it is thereby established that this procedure is also the likelihood ratio test (§A6.8.4) for
testing Q0 versus Q< .
To demonstrate the test of hypotheses through deviance partitioning, we use Example 6.2
(p. 306, data appear in Table 6.3). Recall that the experiment involves a % % factorial
treatment structure with factors temperature aX" ß X# ß X$ ß X% b and concentration of a chemical
a!ß !Þ"ß "Þ!ß "!b and their effect on the germination probability of seeds. For each of the "'
treatment combinations % dishes with &! seeds each are prepared and the number of germi-
nating seeds are counted in each dish. From an experimental design standpoint this is a com-
pletely randomized design with a % % treatment structure. Hence, we are interested in deter-
mining the significance of the Temperature main effect, the main effect of the chemical Con-
centration, and the Temperature Concentration interaction. If the seeds within a dish and
between dishes germinate independently of each other, the germination count in each dish can
be modeled as a Binomiala&!ß 134 b random variable where 134 denotes the germination proba-
bility if temperature 3 and concentration 4 are applied. Table 6.11 lists the models successive-
ly fit to the data.
Applying a logit link, model ó is fit to the data with the genmod procedure statements:
proc genmod data=germrate;
model germ/trials = /link=logit dist=binomial;
run;
Model Information
Data Set WORK.GERMRATE
Distribution Binomial
Link Function Logit
Response Variable (Events) germ
Response Variable (Trials) trials
Observations Used 64
Number Of Events 1171
Number Of Trials 3200
The deviance for this model is ""*$Þ)! on '$ degrees of freedom (Ouput 6.1). It is
almost "* times larger than " and this clearly indicates that the model does not account for the
variability in the data. This could be due to the seed counts being more dispersed than
Binomial random variables and/or the absence of important effects in the model. The
s ! –!Þ&%*( translates into an estimated success probability of
intercept estimate "
"
1
s !Þ$&'.
" expe!Þ&%*(f
This is the overall proportion of germinating seeds. Tallying all successes (= germinations) in
Table 6.3 one obtains "ß "(" germination on '% dishes containing &! seeds each and
"ß "("
1
s !Þ$&'.
'%&!
Notice that the degrees of freedom for the model denote the number of groups 8a1b '%
minus the numbers of estimated parameters.
Models ô through ÷ are fit similarly with the SAS® statments (output not shown)
/* model ô */
proc genmod data=germrate;
class temp;
model germ/trials = temp /link=logit dist=binomial;
run;
/* model õ */
proc genmod data=germrate;
class conc;
model germ/trials = conc /link=logit dist=binomial;
run;
/* model ö */
proc genmod data=germrate;
class temp conc;
model germ/trials = temp conc /link=logit dist=binomial;
run;
/* model ÷ */
proc genmod data=germrate;
class temp conc;
model germ/trials = temp conc temp*conc /link=logit dist=binomial;
run;
The deviance and \ # statistics for any of the five models are very close. To test hypoth-
eses about the various treatment factors, differences of deviances between two models are
compared to cutoffs from Chi-squared distributions with degrees of freedom equal to the
difference in DF for the models. For example, comparing the deviances of models ó and ô
yields "ß "*$Þ) %$!Þ"" ('$Þ'* on $ degrees of freedom. The :-value for this test can be
calculated in SAS® with
data pvalue; p = 1-probchi(763.69,3); run; proc print; run;
The result is a :-value near zero. But what does this test mean? We are comparing a model
with Temperature effects only (ô) against a model without any effects (ó), that is, in the
absence of any concentration effects and/or interactions. We tested the hypothesis that a
model with Temperature effects explains the variation in the data as well as a model contain-
ing only an intercept. Similarly, the question of a significant Concentration effect in the
absence of temperature effects (and the interaction) is addressed by comparing the deviance
difference "ß "*$Þ) *)!Þ!* #"$Þ(" against a Chi-squared distribution with $ degrees of
freedom a: !Þ!!!"b. The significance of the Concentration effect in a model already con-
taining a Temperature main effect is assessed by the deviance difference %$!Þ"" "%)Þ"!
#)#Þ!". This value differs from the deviance reduction of #"$Þ( which was obtained by
adding Concentration effects to the null model. Because of the nonlinearity of the logistic link
function the effects in the generalized linear model are not orthogonal. The significance of a
particular effect depends on which other effects are present in the model, a feature of
sequential tests (§4.3.3) under nonorthogonality. Although either Chi-squared statistic would
be significant in this example, it is easy to see that it does make a difference in which order
the effects are tested. The most meaningful test that can be derived from Table 6.12 is that of
the Temperature Concentration interaction by comparing deviances of models ö and ÷.
Here, the full model includes all possible effects (two main effects and the interaction) and
the reduced model (ö) excludes only the interaction. From this comparison with a deviance
difference of "%)Þ"! &&Þ'% *#Þ%' on &( %) * degrees of freedom a :-value of
<!Þ!!!" is obtained, sufficient to declare a significant Temperature Concentration
interaction.
An approach to deviance testing that does not depend on the order in which terms enter
the model is to use partial deviances, where the contribution of an effect is evaluated as the
deviance decrement incurred by adding the effect to a model containing all other effects. In
proc genmod of The SAS® System this is accomplished by adding the type3 option to the
model statement. The type1 option of the model statement will conduct a sequential test of
model effects. The following statements request sequential (type1) and partial (type3) likeli-
hood ratio tests in the full model. The ods statement preceding the proc genmod code excludes
the lengthy table of parameter estimates from the output.
ods exclude parameterestimates;
proc genmod data=germrate;
class temp conc;
model germ/trials = temp conc temp*conc /link=logit
dist=binomial type1 type3;
run;
The sequential (Type1) and partial (Type3) deviance decrements are not identical (Output
6.2). Adding Temperature effects to a model containing no other effects yields a deviance
reduction of ('$Þ'* (as also calculated from the data in Table 6.12). Adding Temperature
effects to a model containing Concentration effects (and the interactions) yields a deviance
reduction of )!%Þ#%. The partial and sequential tests for the interaction are the same, because
this term entered the model last.
Wald tests for the partial or sequential hypotheses instead of likelihood ratio tests
are requested with the wald option of the model statement. In this case both the Chi-squared
statistics and the : -values will change because the Wald test statistics do not correspond to a
difference in deviances between a full and a reduced model. We prefer likelihood ratio over
Wald tests unless data sets are so large that obtaining the computationally more involved
likelihood ratio test statistic is prohibitive.
Output 6.2.
The GENMOD Procedure
Model Information
temp 4 1 2 3 4
conc 4 0 0.1 1 10
Algorithm converged.
Intercept 1193.8014
temp 430.1139 3 763.69 <.0001
conc 148.1055 3 282.01 <.0001
temp*conc 55.6412 9 92.46 <.0001
where WWX7 !83" aC3 C b# is the total sum of squares corrected for the mean and
WWV !83" aC3 sC 3 b# is the residual (error) sum of squares. Even in the absence of an
intercept in the model, WWX7 is the correct denominator since it is the sample mean C that
would be used to predict C if the response were unrelated to the covariates in the model. The
ratio WWVÎWWX7 can be interpreted as the proportion of variation unexplained by the model.
Generalized linear models are also nonlinear models (unless the link function is the identity
link) and a goodness-of-fit measure akin to [6.45] seems reasonable to measure model-data
agreement. Instead of sums of squares, the measure should rest on deviances, however. Since
for Gaussian data with identity link the deviance is an error sum of squares (Table 6.10), the
measure should in this case also reduce to the standard V # measure in linear models. It thus
seems natural to build a goodness-of-fit measure that involves the deviance of the model that
is fit and compares it to the deviance of a null model not containing any explanatory
variables. Since differences of scaled deviances are also differences in log likelihoods
between full and reduced models we can use
6.
s 0 ß < à y 6 a.
s ! ß < à y b,
where 6. s0 ß <à y is the log likelihood in the fitted model and 6a.
s! ß <à yb is the log likelihood
in the model containing only an intercept (the null model). For binary response models a
generalized V # measure was suggested by Maddala (1983). Nagelkerke (1991) points out that
it was also proposed for any model fit by the maximum likelihood principle by Cox and Snell
(1989, pp. 208-209) and Magee (1990), apparently independently:
#
ln" V# 6.
s 0 ß < à y 6 a.
s ! ß < à y b
8
#
V# " exp 6.s 0 ß < à y 6 a.
s! ß <à yb. [6.46]
8
Nagelkerke (1991) discusses that this generalized V # measure has several appealing proper-
ties. If the covariates in the fitted model have no explanatory power, the log likelihood of the
fitted model, 6.s0 ß <à y will be close to the likelihood of the null model and V# approaches
zero. V# has a direct interpretation in terms of explained variation in the sense that it
partitions the contributions of covariates in nested models. But unlike the V # measure in
linear models, [6.46] is not bounded by " from above. Its maximum value is
#
maxV# " exp 6a.
s! ß <à yb. [6.47]
8
Nagelkerke (1991) thus recommends scaling V# and using the measure
# V#
V [6.48]
maxeV# f
instead which is bounded between ! and " and referred to as the rescaled generalized V # . The
logistic procedure of The SAS® System calculates both generalized V # measures if
requested by the rsquare option of the model statement.
• The POM can be thought of as a series of logistic curves and in the two-
category case reduces to logistic regression.
• The POM can be fit to data with the genmod procedure of The SAS®
System that enables statistical inference very much akin to what
practitioners expect from an ANOVA-based package.
Ordinal responses arise frequently in the study of soil and plant data. An ordinal (or ordered)
response is a categorical variable whose values are related in a greater/lesser sense. The
assessment of turf quality in nine categories from best to worst results in an ordered response
variable as does the grouping of annual salaries in income categories. The difference between
the two types of ordered responses is that salary categories stem from categorizing an under-
lying (latent) continuous variable. Anderson (1984) terms this a grouped ordering. The
assignment to a category can be made without error and different interpreters will assign
salaries to the same income categories provided they use the same grouping. Assessed
orderings, on the contrary, involve a more complex process of determining the outcome of an
observation. A turf scientist rating the quality of a piece of turf combines information about
the time of day, the brightness of the sun, the expectation for the particular grass species, past
experience, and the disease and management history of the experimental area. The final
assessment of turf quality is a complex aggregate and compilation of these various factors. As
a result, there will be variability in the ratings among different interpreters that complicates
the analysis of such data. The development of clearcut rules for category assignment helps to
reduce the interrater variability in assessed orderings, some room for interpretation of these
rules invariably remains.
In this section we are concerned with fitting statistical models to ordinal responses in
general and side-step the issue of rater agreement. Log-linear models for contingency tables
can be used to describe and infer the degree to which interpreters of the same material rate
of cumulative link models (McCullagh 1980, McCullagh 1984, McCullagh and Nelder 1989).
It is not a bona fide generalized linear model but very closely related to logistic regression
models, which justifies its discussion in this chapter (the correspondence of these models to
GLMs can be made more precise by using composite link functions, see e.g., Thompson and
Baker 1981). For only two ordered categories the POM reduces to a standard GLM for
Bernoulli or Binomial outcomes because the Binomial distribution is a special case of the
multinomial distribution where one counts the number of outcomes out of R independent
Bernoulli experiments that fall into one of N categories. Like other generalized linear models
cumulative link models apply a link function to map the parameter of interest onto a scale
where effects are linear. Unlike the models for Bernoulli or Binomial data, the link function is
not applied to the probability that the response takes on a certain value, but to the cumulative
probability that the response occurs in a particular category or below. It is this ingenious
construction from which essential simplifications arise. Our focus on the proportional odds
model is not only motivated by its elegant formulation, convenient mathematics, and straight-
forward interpretation. It can furthermore be easily fitted with the logistic and genmod
procedures of The SAS® System and is readily available to those familiar with fitting
generalized linear models.
For a given distribution of the latent variable, the placement of the cutoff parameters deter-
mines the probabilities to observe the ordinal variable ] . When fitting a cumulative link
model to data, these parameters are estimated along with the effects of covariates and experi-
mental factors. Notice that the number of cutoff parameters that need to be estimated is one
less than the number of ordered categories.
To motivate an application consider an experiment conducted in a completely
randomized design. If 73 denotes the effect of the 3th treatment and 5 indexes the replications,
we put
\35 . 73 /35
as the model for the latent variable \ observed for replicate 5 of treatment 3. The probability
that ]35 , the ordinal outcome for replicate 5 of treatment 3, is at most in category 4 is now
determined by the distribution of the experimental errors /35 as
Pra]35 4b Pra\35 !4 b Pra/35 !4 . 73 b Pr/35 !4 73 .
f(x) Distribution of
latent variable X
0.4
Pr(
Pr(
0.3
α 2<
Y=
) = 2) =
38
X<
=2 <α
3) =
0.3
α 3)
X
0.5
1<
=
(α
82
0.2
Pr
(Y
Pr
0.0 X
1 2 3 4 5 6 7
α1=2.3 α2=3.7 α3=5.8
Figure 6.9. Relationship between latent variable \ µ Ka%ß "b and an ordinal outcome ] . The
probability to observe a particular ordered value depends on the distribution of the latent
variable and the spacing of the cutoff parameters !4 ; !! _, !% _.
The cutoff parameters !4 and the grand mean . have been combined into a new cutoff
!4 !4 . in the last equation. The probability that the ordinal outcome for replicate 5 of
treatment 3 is at most in category 4 is a cumulative probability, denoted #354 .
Choosing a probability distribution for the experimental errors is as easy or difficult as in
a standard analysis. The most common choices are to assume that the errors follow a Logistic
distribution Pra/ >b "Îa" /> b or a Gaussian distribution. The Logistic error model
leads to a model with a logit link function, the Gaussian model results in a probit link
function. With a logit link function we are led to
Pra]35 4b
logitaPra]35 4bb logita#354 b ln !4 73 . [6.49]
Pra]35 4b
The term cumulative link model is now apparent, since the link is applied to the cumulative
probabilities #354 . This model was first described by McCullagh (1980, 1984) and termed the
proportional odds model (see also McCullagh and Nelder, 1989, §5.2.2). The name stems
from the fact that #354 is a measure of cumulative odds (Agresti 1990, p. 322) and hence the
logarithm of the cumulative odds ratio for two treatments is (proportional to) the treatment
difference
logita#354 b logita#3w 54 b 73w 73 .
In a regression example where logita#34 b !4 " B3 , the cutoff parameters serve as separate
intercepts on the logit scale. The slope " measures the change in the cumulative logit if the
regressor B changes by one unit. The change in cumulative logits between B3 and B3w is
logita#34 b logita#3w 4 b " aB3w B3 b
and proportional to the difference in the regressors. Notice that this effect of the regressors or
treatments on the logit scale does not depend on 4; it is the same for all categories.
By inverting the logit transform the probability to observe at most category 4 for replicate
5 of treatment 3 is easily calculated as
"
Pra]35 4b ,
" exp !45 73
Here, 1354 is the probability that an outcome for replicate 5 of treatment 3 will fall into
category 4. Notice that the probability to fall into the last category aN b is obtained by sub-
tracting the cumulative probability to fall into the previous category from ". In fitting the
proportional odds model to data this last probability is obtained automatically and is the
reason why only N " cutoff parameters are needed to model an ordered response with N
categories.
The proportional odds model has several important features. It allows one to model ordi-
nal data independently of the scoring system in use. Whether categories are labeled as
+ß ,ß -ß â or "ß #ß $ß â or "ß #!ß $%ß â or mild, medium, heavy, â the analysis will be the
same. Users are typically more interested in the probabilities that an outcome is in a certain
category, rather than cumulative probabilities. The former are easily obtained from the cumu-
lative probabilities by taking differences. The proportional odds model is further invariant
under category amalgamations. If the model applies to an ordered outcome with N categories,
it also applies if neighboring categories are combined into a new response with N N cate-
gories (McCullagh 1980, Greenwood and Farewell 1988). This is an important property since
ratings may be collected on a scale finer than that eventually used in the analysis and parame-
ter interpretation should not depend on the number of categories. The development of the
proportional odds model was motivated by the existence of a latent variable. It is not a
requirement for the validity of this model that such a latent variable exists and it can be used
for grouped and assessed orderings alike (McCullagh and Nelder 1989, p.154; Schabenberger
1995).
Other statistical models for ordinal data have been developed. Fienberg's continuation
ratio model (Fienberg 1980) models logits of the conditional probabilities to observe category
4 given that the observation was at least in category 4 instead of cumulative probabilities (see
also Cox 1988, Engel 1988). Continuation ratio models are based on factoring marginal
probabilities into a series of conditional probabilities and standard GLMs for binomial
outcomes (Nelder and Wedderburn 1972, McCullagh and Nelder 1989) can be applied to the
terms in the factorization separately. A disadvantage is that the factorization is not unique.
Agresti (1990, p. 318) discusses adjacent category logits where probabilities are modeled
relative to a baseline category. Läärä and Matthews (1985) establish an equivalence between
continuation ratio and cumulative models if a complementary log-log instead of a logit
transform is applied. Studying a biomedical example, Greenwood and Farewell (1988) found
that the proportional odds and the continuation ratio models led to the same conclusions
regarding the significance of effects.
Cumulative link models are fit by maximum likelihood and estimates are derived by
iteratively reweighted least squares as for generalized linear models. The test of hypotheses
proceeds along similar lines as discussed in §6.4.2.
Here, >35 is the time point at which replicate 5 of treatment 3 was observed
a3 "ß âß %à 5 "ß âß %à 4 "ß #b.
Table 6.13. Observed category frequencies for four treatments at four dates
(shown are the total counts across four replicates at each occasion)
Treatment
Category E a 3 "b F a3 #b G a3 $b H a3 %b !
poor a4 "b %ß #ß %ß % %ß $ß %ß % !ß !ß !ß ! "ß !ß !ß ! *ß &ß )ß )
average a4 #b !ß #ß !ß ! !ß "ß !ß ! "ß !ß %ß % #ß #ß %ß % $ß &ß )ß )
good !ß !ß !ß ! !ß !ß !ß ! $ß %ß !ß ! "ß #ß !ß ! %ß 'ß !ß !
â and so forth â
;;
run;
The order=data option of proc logistic ensures that the categories of the response
variable rating are internally ordered as they appear in the data set, that is, poor before
average before good. If the option were omitted proc logistic would sort the levels alpha-
betically, implying that the category order is average good poor, which is not the correct
ordination. The param=glm option of the class statement asks proc logistic to code the
classification variable in the same way as proc glm. This means that a separate estimate for
the last level of the treatment variable tx is not estimated to ensure the constraint !>% 3"
73 !. 7% will be absorbed into the cutoff parameters and the estimates for the other treatment
effects reported by proc logistic represent differences with 7% .
One should always study the Response Profile table on the procedure output to make
sure that the category ordering used by proc logistic (and influenced by the order= option)
agrees with the intended ordering. The Class Level Information Table shows the levels
of all variables listed in the class statement as well as their coding in the design matrix
(Output 6.3).
The Score Test for the Proportional Odds Assumption is a test of the assumption
that changes in cumulative logits are proportional to changes in the explanatory variables.
Two models are compared to calculate this test, a full model in which the slopes and gradients
vary by category and a reduced model in which the slopes are the same across categories.
Rather than actually fitting the two models, proc logistic performs a score test that requires
only the reduced model to be fit (see §A6.8.4). The reduced model in this case is the
proportional odds model and rejecting the test leads to the conclusion that it is not appropriate
for these data. In this example, the score test cannot be rejected and the :-value of !Þ"&$) is
sufficiently large not to call into doubt the proportionality assumption (Output 6.3).
Model Information
Data Set WORK.ORDEXAMPLE
Response Variable rating
Number of Response Levels 3
Number of Observations 64
Link Function Logit
Optimization Technique Fisher's scoring
Response Profile
Ordered Total
Value rating Frequency
1 poor 30
2 average 24
3 good 10
The rsquare option of the model statement in proc logistic requests the two general-
ized V # measures discussed in §6.4.5. Denoted as R-Square is the generalized measure of
Cox and Snell (1989, pp. 208-209) and Magee (1990), and Max-rescaled R-Square denotes
the generalized measure by Nagelkerke (1991) that ranges between ! and ". The log
likelihood for the null and fitted models are 6a. s! <à yb –"#*Þ''(Î# –'%Þ)$$& and
6 .
s0 ß <à y –#)Þ&'#, respectively, so that
#
V# " exp a #)Þ&'# '%Þ)$$&b !Þ'()".
'%
The Analysis of Maximum Likelihood Estimates table shows the parameter estimates
and their standard errors as well as Chi-square tests testing each parameter against zero.
Notice that the estimate for 7% is shown as ! with no standard error since this effect is
absorbed into the cutoffs. The cutoff parameters labeled Intercept and Intercept2 thus are
estimates of
!" 7%
!# 7%
s# s7 % s$ # "
! s " !Þ%)'& 'Þ&"* !Þ)(($" 'Þ*!*
"
Pra] # at time "b !Þ***
" expe 'Þ*!*f
Similarly, for the probability of at most a poor rating at time > "
s" s7 % s$ # "
! s " &Þ'"& 'Þ&"* !Þ)(($" "Þ()"$
"
Pra] " at time "b !Þ)&'
" expe "Þ()"$f
For an experimental unit receiving treatment #, there is an )&Þ'% chance to receive a poor
rating and only a **Þ*% )&Þ'% "%Þ$% chance to receive an average rating at the first
time point (Table 6.14). Each block of three numbers in Table 6.14 is an estimate of the
multinomial distribution for a given treatment at a particular time point. A graph of the linear
predictors shows the linearity of the model in treatment effects and time on the logit scale and
the proportionality assumption which results in parallel lines on that scale (Figure 6.10).
Inverting the logit transform to calculate cumulative probabilities from the linear predictors
shows the nonlinear dependence of probabilities on treatments and the time covariate (Figure
6.11).
To compare the treatments at a given time point, we formulate linear combinations of the
cumulative logits which leads to linear combinations of the parameters. For example, compar-
ing treatments E and F at time > " the linear combination is
logite#"4" f logite##4" f !4 7" " !4 7# " 7" 7# .
The cutoff parameters have no effect on this comparison; the treatment difference has the
same magnitude, regardless of the category (Figure 6.10). In terms of the quantities proc
logistic estimates, the contrast is identical to $" $# . The variance-covariance matrix
(obtained with the covb option of the model statement in proc logistic, output not given) is
shown in Table 6.15. For example, the standard error for ! s" s7 % is È#Þ$)* "Þ&%& as
appears on the output in the Analysis of Maximum Likelihood Estimates table.
10
B, j = 2
A, j = 2
Logit 5
B, j = 1
A, j = 1
0
C, j = 2
-5 C, j = 1
Figure 6.10. Linear predictors for treatments E, F , and G . The vertical difference between
the lines for categories 4 " and 4 # is constant for all time points and the same
a&Þ'"&! !Þ%)'& &Þ"#)&b for all treatments.
A, j = 2 B, j = 2
1.0
B, j = 1
0.8
A, j = 1
0.6
Pr(Y <= j)
0.4
C, j = 2
0.2
C, j = 1
0.0
Figure 6.11. Predicted cumulative probabilities for treatments E, F , and G . Because the line
for Eß 4 # lies completely above the line for Eß 4 " in Figure 6.10, the cumulative proba-
bility to observe a response in at most category # is greater than the cumulative probability to
observe a response in at most category ". Since the cumulative probabilities are ordered, the
category probabilities are guaranteed to be non-negative.
to the proc logistic code above. The test for the treatment main effect can be performed by
using a set of orthogonal contrasts, for example let
The Wald statistic for the treatment main effect aL! : c 0b becomes
This test is shown on the proc logistic output in the Type III Analysis of Effects table.
The same analysis can be obtained in proc genmod. The statements, including all pairwise
treatment comparisons follows.
More complicated proportional odds models can be fit easily with these procedures. On
occasion one may encounter a warning message regarding the separability of data points and
a possibly questionable model fit. This can occur, for example, when one treatment's
responses are all in the same category, since then there is no variability among the replicates.
This phenomenon is more likely for small data sets and applications where many classifica-
tion variables are involved in particular interactions. There are several possibilities to correct
this: amalgamate adjacent categories to reduce the number of categories; fit main effects and
low-order interactions only and exclude high-order interactions; include effects as continuous
covariates rather than as classification variables when they relate to some underlying
continuous metric such as rates of application or times of measurements.
6.6 Overdispersion
Box 6.6 Overdispersion
If the variability of a set of data exceeds the variability expected under some reference model
we call it overdispersed (relative to that reference). Counts, for example, may exhibit more
variability than is permissible under a Binomial or Poisson probability model. Overdispersion
is a potential problem in statistical models where the first two moments of the response distri-
bution are linked and means and variances are functionally dependent. In Table 6.2 the scale
parameter < is not present for the discrete distributions and data modeled under these distri-
butions are potentially overdispersed. McCullagh and Nelder (1989, p. 124, p. 193) suggest
that overdispersion may be the norm in practice, rather than the exception. In part this is due
to the fact that users resort to a small number of probability distributions to model their data.
Almost automatically one is led to the Binomial distribution for count variables with a natural
denominator and to the Poisson distribution for counts without a natural denominator. One
remedy of the overdispersion problem lies in choosing a proper distribution that permits more
variability than these standard models such as the Beta-Binomial in place of the Binomial
model and the Negative Binomial in place of the Poisson model. Overdispersion can also be
caused by an improper choice of covariates and effects to model the data. This effect was
obvious for the seed germination data modeled in §6.4.4. When temperature and/or
concentration effects were omitted the ratio of the deviance and its degrees of freedom
exceeded the benchmark value of one considerably (Table 6.12). Such cases of overdisper-
sion must be addressed by altering the set of effects and covariates, not by postulating a
different probability distribution for the data. In what follows we assume that the mean of the
responses has been modeled correctly, but that the data nevertheless exhibit variability in
excess of our expectation under a certain reference distribution.
Overdispersion is a problem foremost because it affects the estimated precision of the
parameter estimates. In §A6.8.2 it is shown that the scale parameter < is of no consequence in
estimating " and can be dropped from the estimating equations. The (asymptotic) variance-
"
covariance matrix of the maximum likelihood estimates is given by aFw V" Fb where V is a
diagonal matrix containing the variances Varc]3 d 2a.3 b<. Extracting the scale parameter <
we can simplify:
s <aFw Diaga"Î2a.3 bbFb . "
Var"
If under a given probability model the scale parameter < is assumed to be " but overdis-
persion exists aVarc]3 d 2a.3 bb, the variability of the estimates is larger than what is
assumed under the model. The precision of the parameter estimates is overstated, standard
error estimates are too small and as a result, test statistics are inflated and : -values are too
small. Covariates and effects may be declared significant even when they are not. It is thus
important to account for overdispersion present in the data and numerous approaches have
been developed to that end. The following four categories are sufficiently broad to cover
many overdispersion mechanisms and remedies.
• Extra scale parameters are added to the variance function of the generalized linear
model. For Binomial data one can assume Varc] d 981a" 1b instead of the
nominal variability Varc] d 81a" 1b. If 9 " the model is overdispersed relative
to the Binomial and if 9 " it is underdispersed. Underdispersion is far less likely
and a far less serious problem in data analysis. For count data one can model overdis-
persion relative to the Poissona-b distribution as Varc] d -9, Varc] d
-a" 9bÎ9, or Varc] d - -# Î9. These models have some stochastic foundation
in certain mixing models (see below and §A6.8.6). Models with a multiplicative over-
dispersion parameter such as Varc] d 981a" 1b for Binomial and Varc] d 9-
for Poisson data can be handled easily with proc genmod of The SAS® System. The
overdispersion parameter is then estimated from Pearson or deviance residuals by the
method discussed in §6.4.3 (pscale and dscale options of model statement).
• Positive autocorrelation among observations leads to overdispersion in sums and
averages. Let ^3 be an arbitrary random variable with mean . and variance 5 # and
assume that the ^3 are equicorrelated, Covc^3 ß ^4 d 35 # a3 Á 4. Assume further that
3 !. We are interested in modeling ] !83" ^3 , the sum of the ^3 . The mean and
variance of this sum follow from first principles as
Ec] d 8.
and
8 8
Varc] d 8Varc^3 d #""Covc^3 ß ^4 d 85 # 8a8 "b5 # 3
3" 43
85 # a" a8 "b3b 85 # .
If the elements in the sum were uncorrelated, Varc] d 85 # but the positive autocor-
relation thus leads to an overdispersed sum, relative to the model of stochastic inde-
and possible oversdispersion. Adding a random effect / with mean ! and variance 5 #
in the exponent, turns the linear predictor into a mixed linear predictor. The resulting
model can also be reckoned conditionally. Ec] l/d expe"! "" B /f, / µ a!ß 5 # b.
If we assume that ] l/ is Poisson-distributed, the resulting unconditional distribution
will be overdispersed relative to a Poissonaexpe"! "" Bfb distribution. Unless the
distribution of the random effects is chosen carefully, the marginal distribution may
not be in the exponential family and maximum likelihood estimation of the parameters
may be difficult. Numerical methods for maximizing the marginal (unconditional) log-
likelihood function via linearization, quadrature integral approximation, importance
sampling and other devices exist, however. The nlmixed procedure of The SAS®
System is designed to fit such models (§8).
6.7 Applications
The first two applications in this chapter model Binomial outcomes. In §6.7.1 a simple
logistic regression model with a single covariate is fit to model the mortality rate of insect
larvae exposed to an insecticide. Of particular interest is the estimation of the PH&! , the
insecticide dosage at which the probability of a randomly chosen larva to succumb to the
insecticide exposure is !Þ&. In §6.7.2 a field experiment with Binomial outcomes is examined.
Sixteen varieties are arranged in a randomized complete block design and the number of
plants infested with the Hessian fly is recorded. Interesting aspects of this experiment are
varying binomial sample sizes among the experimental units which invalidates the variance-
stabilizing arcsine transformation and possible overdispersion. Yield density models that were
examined earlier as nonlinear regression models in §5.8.7 are revisited in §6.7.3. Rather than
relying on inverse or logarithmic transformation we treat the yield responses as Gamma-
distributed random variables and apply generalized linear model techniques. §6.7.4 and
§6.7.5 are dedicated to the analysis of ordinal data. In both cases the treatment structure is a
simple two-way factorial. The analysis in §6.7.5 is further complicated by the fact that experi-
mental units were measured repeatedly over time. The analysis of contingency tables is a
particularly fertile area for the deployment of generalized linear models. A special class of
models, log-linear models for square contingency tables, are discussed in §6.7.6. These
models allow estimation of the agreement or disagreement between interpreters of the same
material. Generalized linear models can be successfully employed when the outcome of
interest is not a mean, but a dispersion parameter, for example a variance. In §6.7.7 we use
Gamma regression to model the variability between deoxynivalenol (vomitoxin) probe
samples from truckloads of wheat kernels as a function of the toxin load. The final applica-
tion (§6.7.8) considers count data and demonstrates that the Poisson distribution is not
necessarily a suitable model for such data. In the presence of overdispersion, the Negative
Binomial distribution is a more reasonable model. We show how to fit models with Negative
Binomial responses with the nlmixed procedure of The SAS® System.
Plots of the logits of the sample proportions against the concentrations and the log"!
concentrations are shown in Figure 6.12. The relationship between sample logits and concen-
trations is clearly not linear. It appears at least quadratic and would suggest a generalized
linear model
The quadratic trend does not ensure that the logits are monotonically increasing in B and a
more reasonable model posits a linear dependence of the logits on the log"! concentration,
logita1b "! "" log"! eBf. [6.51]
Model [6.51] is a classical logistic regression model with a single covariate alog"! eBfb. The
analysis by Mead et al. (1993) uses a probit link, and the results are very similar to those from
a logistic analysis, because of the similarity of the two link functions.
3 3
2 2
Logit of sample proportion
0 0
-1 -1
-2 -2
-3 -3
Figure 6.12. Logit of sample proportions against insecticide concentration and logarithm of
concentration. A linear relationship between the logit and the log"! -concentration is
reasonable.
The key relationship in this experiment is the dependence of the probability 1 that a larva
is killed on the insecticide concentration. Once this relationship is modeled other quantities of
interest can be estimated. In bioassay and dose-response studies one is often interested in esti-
mating dosages that produce a certain response. For example, the dosage lethal to a randomly
selected larva with probability !Þ& (the so-called PH&! ). In model [6.51] we can establish
more generally that if B! denotes the dosage with mortality rate ! ! ", then from
logita!b "! "" log"! eB! f
In the case of PH&! , for example, ! !Þ&, logita!Þ&b !, and log"! eB!Þ& f "! Î"" . For
this particular ratio, fiducial intervals were developed by Finney (1978, pp. 80-82) based on
work by Fieller (1940) under the assumption that " s ! and "
s " are Gaussian-distributed. These
intervals are also developed in our §A6.8.5. For an estimate of the dosage B!Þ& on the original,
rather than the log"! scale, these intervals do not directly apply. We prefer obtaining standard
errors for the quantities log"! eB! f and B! based on Taylor series expansions. If the ratio
s " ÎeseÐ"
" s " Ñ is sufficiently large, the Taylor series based standard errors are very accurate
(§6.4.2).
The first step in our analysis is to fit model [6.51] and to determine whether the relation-
ship between mortality probability and insecticide concentration is sufficiently strong. The
data step and proc genmod statements for this logistic regression problem are as follows.
data kills;
input concentration kills;
trials = 20;
logc = log10(concentration);
datalines;
0.375 0
0.75 1
1.5 8
3.0 11
6.0 16
12.0 18
24.0 20
;;
run;
The proc genmod output (Output 6.4) indicates a deviance of %Þ'#!' based on & degrees
of freedom [8a1b ( groups minus two estimated parameters ("! , "" )]. The deviance/df ratio
is close to one and we conclude that overdispersion is not a problem for these data. The
s ! "Þ($!& and "
parameter estimates are " s " %Þ"'&".
Model Information
Data Set WORK.KILLS
Distribution Binomial
Link Function Logit
Response Variable (Events) kills
Response Variable (Trials) trials
Observations Used 7
Number Of Events 74
Number Of Trials 140
The Wald test for L! : "" ! has test statistic [ %!Þ)" and the hypothesis is clearly
rejected. There is a significant relationship between larva mortality and the log"! insecticide
concentration. The positive slope estimate indicates that mortality probability increases with
the log"! concentration. For example, the probabilities that a randomly selected larva is killed
at concentrations B "Þ&% and B '% are
"
!Þ#'*
s s " log"! e"Þ&f
" exp " ! "
"
!Þ)"*.
s s " log"! e'f
" exp " ! "
The note that The scale parameter was held fixed at the end of the proc genmod output
indicates that no extra scale parameters were estimated.
How well does this logistic regression model fit the data? To this end we calculate the
generalized V # measures discussed in §6.4.5. The log likelihood for the full model containing
a concentration effect is shown in the output above as 6.s 0 ß < à y 6 .
s0 ß "à y –&!Þ!"$$.
The log likelihood for the null model is obtained as 6a. s! ß "à yb *'Þ)""* with the
statements (output not shown)
proc genmod data=kills;
model kills/trials = / dist=binomial link=logit;
run;
is then
#
V# " exp a &!Þ!"$$ *'Þ)""*b !Þ%)(&.
"%!
This value does not appear very large but it should be kept in mind that this measure is not
bounded by ". Also notice that the denominator in the exponent is 8 "%!, the total number
#
of observations, rather than 8Ð1Ñ (, the number of groups. The rescaled measure V is
#
obtained by dividing V with
# #
maxV# " exp 6a.
s ß "à yb " exp *'Þ)""* !Þ(%*,
8 ! "%!
#
hence V !Þ%)(&Î!Þ(%* !Þ'&!). With almost #Î$ of the variability in mortality propor-
tions explained by the log"! concentration and a >9,= ratio for the slope parameter of
>9,= %Þ"'&"Î!Þ'&#! 'Þ$)) we are reasonably satisfied with the model fit and proceed to
an estimation of the dosages that are lethal to &!% or )!% of the larvae. Based on the esti-
mates of "! and "" as well as [6.52] we obtain the point estimates
To obtain standard errors and confidence intervals for these four quantities proc nlmixed is
used because of its ability to obtain standard errors for nonlinear functions of parameter esti-
mates by first-order Taylor series. As starting values in the nlmixed procedure we use the
converged iterates of proc genmod. The df=5 option was added to the nlmixed statement to
make sure that proc nlmixed uses the same degrees of freedom for the determination of :-
values as proc genmod. The complete SAS® code including the estimation of the lethal
dosages on the log"! and the original scale follows.
proc nlmixed data=kills df=5;
parameters intcpt=-1.7305 b=4.165;
p = 1/(1+exp(-intcpt - b*logc));
model kills ~ binomial(trials,p);
estimate 'LD50' -intcpt/b;
estimate 'LD50 original' 10**(-intcpt/b);
estimate 'LD80' (log(0.8/0.2)-intcpt)/b;
estimate 'LD80 original' 10**((log(0.8/0.2)-intcpt)/b);
run;
Output 6.5.
The NLMIXED Procedure
Specifications
Description Value
Data Set WORK.KILLS
Dependent Variable kills
Distribution for Dependent Variable Binomial
Optimization Technique Dual Quasi-Newton
Integration Method None
Dimensions
Description Value
Observations Used 7
Observations Not Used 0
Total Observations 7
Parameters 2
Parameters
intcpt b NegLogLike
-1.7305 4.165 9.50956257
Iteration History
Iter Calls NegLogLike Diff MaxGrad Slope
1 4 9.50956256 8.157E-9 0.00056 -0.00004
2 7 9.50956255 1.326E-8 0.000373 -6.01E-6
3 8 9.50956254 4.875E-9 5.963E-7 -9.76E-9
Parameter Estimates
Standard
Parameter Estimate Error DF t Value Pr > |t| Alpha
intcpt -1.7305 0.3741 5 -4.63 0.0057 0.05
b 4.1651 0.6520 5 6.39 0.0014 0.05
Additional Estimates
Standard
Label Estimate Error DF t Value Pr > |t| Lower Upper
LD50 0.4155 0.06085 5 6.83 0.0010 0.2716 0.5594
LD50 original 2.6030 0.3647 5 7.14 0.0008 1.7406 3.4655
LD80 0.7483 0.07944 5 9.42 0.0002 0.5605 0.9362
LD80 original 5.6016 1.0246 5 5.47 0.0028 3.1788 8.0245
The nlmixed procedure converges after three iterations and reports the same parameter
estimates and standard errors as proc genmod (Output 6.5). The log likelihood value reported
by proc nlmixed a *Þ&b does not agree with that of the genmod procedure a &!Þ!"$b. The
procedures differ with respect to the inclusion/exclusion of constants in the likelihood calcu-
lations. Differences in log likelihoods between nested models will be the same for the two
procedures. The null model log likelihood reported by proc nlmixed (code and output not
shown) is &'Þ$. The log likelihood difference between the two models is thus &'Þ$
*Þ& *'Þ)" &!Þ!" %'Þ) in either procedure.
1.0
0.8
Probability of kill
0.6
0.4
0.2
LD50 LD80
0.0
Figure 6.13. Predicted probabilities and observed proportions (dots) in logistic regression
model for insecticide kills. Estimated dosages lethal to &!% and )!% of the larvae are also
shown.
The table of Additional Estimates shows the output for the four estimate statements.
The point estimates for the lethal dosages agrees with the manual calculation above and the
standard errors are obtained from a first-order Taylor series expansion. The values in the
columns Lower and Upper are asymptotic *&% confidence intervals for the estimated
quantities.
Figure 6.13 shows the predicted probabilities to kill a randomly selected larva as a
seB!Þ& f
function of the log"! concentration. The observed proportions are overlaid and the log
seB!Þ) f dosages are shown.
and log
33.3
9 1 8 4 2 11 5 16
29.6
5 13 12 16 13 8 3 10
25.9
22.2 11 7 14 6 15 6 14 7
Latitude (m)
18.5 15 3 10 2 9 4 12 1
14.8 10 9 4 1 11 16 5 2
11.1
11 12 2 3 6 1 15 12
7.4
13 15 8 5 4 10 3 9
3.7
14 16 7 6 7 13 8 14
0.0
0.0 3.7 7.4 11.1 14.8 18.5 22.2 25.9 29.6 33.3
Longitude (m)
Figure 6.14. Design layout in Hessian fly experiment. Field plots are $Þ(7 $Þ(7. The area
of the squares is proportional to the sample proportion of damaged plants; numbers indicate
the variety. Block boundaries are shown as solid lines. Data used with permission of the
International Biometric Society.
A generalized linear model for this experiment can be set up with a linear predictor that
represents the experimental design and a link function for the probability that a randomly
selected plant is damaged by Hessian fly infestation. Choosing a logit link function this model
becomes
134
logita134 b ln (34 . 73 34 , [6.53]
" 134
where 73 is the effect of the 3th variety and 34 is the effect of the 4th block.
Of interest are comparisons of the treatment effects adjusted for the block effects. For
example, one may want to test the hypothesis that varieties 3 and 3w have equal probability to
be damaged by infestations, i.e, L! :13 13w . These probabilities are not the same as the
block-variety specific probabilities 134 in model [6.53]. In a linear model ]34 . 73 34
/34 these comparisons are based on the least squares means of the treatment effects. If . s, s7 3 ,
and s34 denote the respective least squares estimates and there are 4 "ß âß % blocks as in this
example, the least squares mean for treatment 3 is calculated as
"
.
s s7 3 as
3 s
3# s
3$ s
3% b .
s s7 3 s
3Þ . [6.54]
% "
A similar approach can be taken in the generalized linear model. If . s, s7 3 , and s
34 denote the
converged IRLS estimates of the parameters in model [6.53]. The treatment specific linear
predictor is calculated as the same estimable linear function as in the standard model:
s(3Þ .
s s7 3 s
3Þ Þ
The estimate of the marginal probability that variety 3 is damaged by the Hessian fly is then
obtained by inverting the link function, 1 s3Þ "Îa" expe s(3Þ fb. Hypothesis tests can be
based on a comparison of the s(3Þ , which are linear functions of the parameter estimates, or the
1
s3Þ , which are nonlinear functions of the estimates.
The proc genmod code to fit the Binomial proportions with a logit link in a randomized
complete block design follows. The ods exclude statement suppresses the printing of various
default tables. The lsmeans entry / diff; statement requests the marginal linear predictors
s(3Þ for the varieties (entries) as well as all pairwise tests of the form L! : (3Þ (3w Þ . Because
lsmeandiffs is included in the ods exclude statement, the lengthy table of differences
(3Þ (3w Þ is not included on the printed output. The ods output lsmeandiffs=diff; statement
saves the "'"&Î# "#! pairwise comparisons in the SAS® data set diff that is available
for post-processing after proc genmod concludes.
ods exclude ParmInfo ParameterEstimates lsmeandiffs;
proc genmod data=HessFly;
class block entry;
model z/n = block entry / link=logit dist=binomial type3;
lsmeans entry /diff;
ods output lsmeandiffs=diff;
run;
From the LR Statistics For Type 3 Analysis table we glean a significant variety
(entry) effect a: !Þ!!!"b (Output 6.6). It should not be too surprising that the sixteen
varieties do not have the same tendency to be damaged by the Hessian fly. The Least
Squares Means table lists the marginal linear predictors for the varieties which can be
converted into damage probabilities by inverting the logit link function. For variety ", for
example, this estimated marginal probability is 1 s"Þ "Îa" expe "Þ%)'%fb !Þ)"& and
s)Þ "Îa" expe!Þ"'$*fb !Þ%&*.
for variety ) this probability is only 1
Output 6.6.
The GENMOD Procedure
Model Information
Chi-
Source DF Square Pr > ChiSq
block 3 4.27 0.2337
entry 15 132.62 <.0001
Standard Chi-
Effect entry Estimate Error DF Square Pr > ChiSq
entry 1 1.4864 0.3921 1 14.37 0.0002
entry 2 1.3453 0.3585 1 14.08 0.0002
entry 3 0.9963 0.3278 1 9.24 0.0024
entry 4 0.0759 0.2643 1 0.08 0.7740
entry 5 1.3139 0.3775 1 12.12 0.0005
entry 6 0.5758 0.3180 1 3.28 0.0701
entry 7 0.8608 0.3302 1 6.80 0.0091
entry 8 -0.1639 0.2975 1 0.30 0.5816
entry 9 0.0960 0.2662 1 0.13 0.7183
entry 10 0.8413 0.3635 1 5.36 0.0206
entry 11 0.0313 0.2883 1 0.01 0.9136
entry 12 0.0423 0.2996 1 0.02 0.8876
entry 13 -2.0941 0.5330 1 15.44 <.0001
entry 14 -1.0185 0.3538 1 8.29 0.0040
entry 15 -0.6303 0.2883 1 4.78 0.0288
entry 16 -1.4645 0.3713 1 15.56 <.0001
accomplish that (Output 6.7). The statements variety = entry+0; and _variety = _entry+0;
convert the values for entry into numeric format, since they are stored as character variables
by proc genmod. Entry ", for example, differs significantly from entries %, ), *, "", "#, "$, "%,
"&, and "'. Notice that the data set variable entry was renamed to variety to produce Output
6.7. The positive Estimate for the comparison of variety 1 and _variety 4, for example,
indicates that entry " has a higher damage probability than entry %. Similarly, the negative
Estimate for the comparison of variety 4 and _variety 5 indicates a lower damage
probability of variety % compared to variety &.
Output 6.7.
Chi
Obs variety _variety Estimate Std Err DF Square Pr>Chi
This analysis of the Hessian fly experiment seems simple enough. A look at the Criteria
For Assessing Goodness Of Fit table shows that not all is well, however. The deviance of
the fitted model with "#$Þ*&& exceeds the degrees of freedom a%&b #Þ(-fold. In a proper
model the deviance is expected to be about as large as its degrees of freedom. We do not
advocate formal statistical tests of the deviance/df ratio unless data are grouped. Usually the
modeler interprets the ratio subjectively to decide whether a deviation of the ratio from one is
reason for concern. First we notice that ratios in excess of one indicate a potential overdisper-
sion problem and then inquire how overdispersion could arise. Omitting important variables
from the model leads to excess variability since the linear predictor does not account for
important effects. In an experimental design the modeler usually builds the linear predictor
from the randomization, treatment, and blocking protocol. In a randomized complete block
design, (34 . 73 34 is the appropriate linear predictor since all other systematic effects
should have been neutralized by randomization. If all necessary effects are included in the
model, overdispersion could arise from positive correlations among the observations. Two
levels of correlations must be considered here. First, it was assumed that the counts on each
experimental unit follow the Binomial law which implies that the 834 Bernoullia134 b variables
are independent. In other words, the probability of a plant being damaged does not depend on
whether neighboring plants on the same experimental unit are infected or not. This seems
quite unlikely. We expect infestations to appear in clusters and the ^34 may not be Binomial-
distributed. Instead, a probability model that allows for overdispersion relative to the
Binomial, for example, the Beta-Binomial model could be used. Second, there are some
doubts whether the counts of neighboring units are independent as assumed in the analysis.
There may be spatial dependencies among grid cells in the sense that units near each other are
more highly correlated than units further apart. This assumption is reasonable if, for example,
propensity for infestation is linked to a soil variable that varies spatially. In other cases, such
spatial correlations induced by a spatially varying covariate have been confirmed. Randomi-
zation of the varieties to experimental units neutralizes such spatial dependencies. On
average, each treatment is affected by these effects equally and the overall effect is balanced
out. However, the variability due to these spatial effects is not removed from the data. To that
end, blocks need to be arranged in such a way that experimental units within a block are
homogeneous. Stroup et al. (1994) note that combining adjacent experimental units into
blocks in agricultural variety trials can be at variance with an assumption of homogeneity
within blocks when more than eight to twelve experimental units are grouped. Spatial trends
will then be removed only incompletely and this source of overdispersion prompted Gotway
and Stroup (1997) to analyze the Hessian fly data with a model that takes into account the
spatial dependence among counts of different experimental units explicitly. We will return to
such models and the Hessian fly data in §9.
A quick fix for overdispersed data that does not address the real cause of the overdis-
persion problem is to estimate a separate scale parameter 9 in models that would not contain
such a parameter otherwise. In the Hessian fly example, this is accomplished by adding the
dscale or pscale option to the model statement in proc genmod. The former estimates the
overdispersion parameter based on the deviance, the latter based on Pearson's statistic (see
§6.4.3). The variance of a count ^34 is then modeled as Varc^34 d 9834 134 a" 134 b rather
than Varc^34 d 834 134 a" 134 b. From the statements
a new data set of treatment differences is obtained. After post-processing of the diff data set
one obtains Output 6.8. Fewer entries are now found significantly different from entry " than
in the analysis that does not account for overdispersion. Also notice that the estimates of the
treatment differences has not changed. The additional overdispersion parameter is a multipli-
cative parameter that has no effect on the parameter estimates, only on their standard errors.
Using the dscale estimation method the overdispersion parameter is estimated as 9 s
"#$Þ*&&Î%& #Þ(&%' which is the ratio of deviance and degrees of freedom in the model
fitted initially. All standard errors in the preceding partial output are È#Þ(&%' larger than the
standard errors in the analysis without the overdispersion parameter.
Output 6.8.
Chi
Obs variety _variety Estimate Std Err DF Square Pr>Chi
are representatives of this class of models. Here, B denotes the plant (seeding) density. A
standard nonlinear regression approach is then to model, for example, ]3
"Îa"! "" B3 b /3 where the /3 are independent random errors with mean ! and variance 5 # .
Figure 6.3 (p. 309) suggests that the variability is not homogeneous in these data, however.
Whereas one could accommodate variance heterogeneity in the nonlinear model by using
weighted nonlinear least squares, Figure 6.3 alerts us to a more subtle problem. The standard
deviation of the barley yields seems to be related to the mean yield. Although Figure 6.3 is
quite noisy, it is not unreasonable to assume that the standard deviations are proportional to
the mean (a regression through the origin of = on C). Figure 6.15 displays the reciprocal yields
and the reciprocal of the sample means across the three blocks against the seeding density. An
inverse quadratic relationship
"
"! "" B "# B#
Ec] d
as suggested by the Holliday model is reasonable. This model has linear predictor (
"! "" B "# B# and reciprocal link function. For the random component we choose ] not to
be a Gaussian, but a Gamma random variable. Gamma random variables are non-negative
(such as yields), and their standard deviation is proportional to their mean. The Gamma distri-
butions are furthermore not symmetric about the mean but right-skewed (see Figure 6.2, p.
308). The canonical link of a Gamma random variable is the reciprocal link, which provides
further support to use this model for yield density investigations where inverse polynomial
relationships are common. Unfortunately, the inverse link does not guarantee that the predic-
ted means are non-negative since the linear predictor is not constrained to be positive. As an
alternative link function for Gamma-distributed random variables, the log link can be used.
0.5
0.4
0.3
Inverse Yields
0.2
0.1
0.0
0 20 40 60 80 100 120
Barley Seeding Density
Figure 6.15. Inverse yields and inverse replication averages against seeding density. Dis-
connected symbols represent observations from blocks " to $, the connected symbols the
sample averages. An inverse quadratic relationship is reasonable.
Before fitting a generalized linear model with Gamma errors we must decide whether to
fit the model to the $! observations from the three blocks or to the "! block averages. In the
former case, we must include block effects in the full model and can then test whether it is
reasonable that some effects do not vary by blocks. The full model we consider here has a
linear predictor
(34 "!4 ""4 B34 "#4 B#34 [6.55]
where the subscript 4 identifies the blocks. Combined with a reciprocal link function this
model will fit a separate inverse quadratic to data from each block. The proc genmod code to
fit this model with Gamma errors follows. The noint option was added to the model state-
ment to prevent the addition of an overall intercept term "! . The link=power(-1) option
invokes the reciprocal link.
ods exclude ParameterEstimates;
proc genmod data=barley;
class block;
model bardrwgt = block block*seed block*seed*seed /
noint link=power(-1) dist=gamma type3;
run;
Output 6.9.
The GENMOD Procedure
Model Information
Data Set WORK.BARLEY
Distribution Gamma
Link Function Power(-1)
Dependent Variable BARDRWGT
Observations Used 30
The full model has a log likelihood of 6Ð. sà yÑ "!#Þ'%* and the LR Statistics
s0 ß <
For Type 3 Analysis table shows that the effect which captures separate quadratic effects
for each block is not significant a: !Þ"#*%, Output 6.9b. To see whether a common
quadratic effect is sufficient, we fit the model as
ods exclude obstats;
proc genmod data=barley;
class block;
model bardrwgt = block block*seed seed*seed /
noint link=power(-1) dist=gamma type3 obstats;
ods output obstats=stats;
run;
and obtain a log likelihood of "!#Þ))"* (Output 6.10). Twice the difference of the log
likelihoods, A #a "!#Þ'%* "!#Þ))"*b !Þ%'&, is not significant and we conclude
that the quadratic effects need not be varied by blocks aPra;## !Þ%'&b !Þ(*$b. The
common quadratic effect of seeding density is significant at the &% level a: !Þ!##(b and
will be retained in the model (Output 6.10). The obstats option of the model statement
requests a table of the linear predictors, predicted values, and various residuals to be calcu-
lated for the fitted model.
Model Information
Data Set WORK.BARLEY
Distribution Gamma
Link Function Power(-1)
Dependent Variable BARDRWGT
Observations Used 30
The ods exclude obstats; statement in conjunction with the ods output
obstats=stats; statement prevents the printing of these statistics to the output window and
saves the results in a SAS® data set (named stats here). The seeding densities were divided
by "! prior to fiting of this model to allow sufficient significant digits to be displayed in the
Analysis Of Parameter Estimates table.
The fitted barley yields for the three blocks are shown in Figure 6.16. If the model is cor-
rect, yields will attain a maximum at a seeding density around )!. While the seeding density
of highest yield depends little on the block, the maximum yield attained varies considerably.
40
Block 3
Fitted Barley Yields
30
Block 2
Block 1
20
10
0 20 40 60 80 100 120
Barley Seeding Density
Figure 6.16. Fitted barley yields in the three blocks based on a model with inverse linear
effects varied by blocks and a common inverse quadratic effect.
The stats data set with the output from the obstats option contains several residual
diagnostics, such as raw residuals, deviance residuals, Pearson residuals, and their standard-
ized versions. We plotted the Pearson residuals
C34 .
s34
s<34 ,
É 2 .
s34
where 2. s34 is the variance function evaluated at the fitted mean in Figure 6.17 (open
circles). From Table 6.2 (p. 305) the variance function of a Gamma random variable is simply
the square of the mean and the Pearson residuals take on the form
C34 . s34
s<34 . [6.56]
.s34
For the observation C"" #Þ!( from block " at seeding density $ the Pearson residual is
s<"" a#Þ!( "!Þ$#%bÎ"!Þ$#% !Þ(**. Also shown as closed circles are the studentized
residuals from fitting the model
"
]34 /34
"!4 ""4 B34 "# B#34
as a nonlinear model with symmetric and homoscedastic errors (in proc nlin). If the mean in-
creases with seeding density and the variation of the data is proportional to the mean we
expect the variation in the nonlinear regression residuals to increase with seeding density.
This effect is obvious in Figure 6.17. The assumption of homoscedastic errors underpinning
the nonlinear regression analysis is not tenable; therefore, the Gamma regression is preferred.
The tightness of the Pearson residuals in the Gamma regression model at seeding density ((
is due to the fact that these values are close to the density producing the maximum yield
(Figure 6.16). Since this critical density is very similar from block to block, but the maximum
yields differ greatly, the denominators in [6.56] shrink the raw residuals C34 . s34 most for
those blocks with high yields.
2
Student or Pearson Residual
-1
-2
0 20 40 60 80 100 120
Seeding Density
Figure 6Þ17. Pearson residuals (open circles) from generalized linear Gamma regression
model and studentized residuals from nonlinear Gaussian regression model (full circles).
according to the procedures described by Walters et al. (1997). These included '# navy, '&
black, && kidney, and )# pinto bean-breeding lines plus checks and controls. The visual
appearance of the processed beans was determined subjectively by a panel of "$ judges on a
seven point hedonic scale (" = very undesirable, % = neither desirable nor undesirable, ( =
very desirable). The beans were presented to the panel of judges in a random order at the
same time. Prior to evaluating the samples, all judges were shown examples of samples rated
as satisfactory (%). Concern exists if certain judges, due to lack of experience, are unable to
correctly rate canned samples. From attribute-based product evaluations inferences about the
effects of experience can be drawn from the psychology literature. Wallsten and Budescu
(1981), for example, report that in the evaluation of a personality profile consisting of
fourteen factors, experienced clinical psychologists utilized four to seven factors, whereas
psychology graduate students tended to use only the two or three most salient factors. Prior to
the bean canning quality rating experiment it was postulated that less experienced judges rate
more severely than more experienced judges but also that experience should have little or no
effect for navy beans for which the canning procedure was developed. Judges are stratified
for purposes of analysis by experience ( & years, & years). The counts by canning
quality, judges experience, and bean-breeding line are listed in Table 6.17.
Table 6.17. Bean rating data. Kindly made available by Dr. Jim Kelly, Department of
Crop and Soil Sciences, Michigan State University. Used with permission.
Black Kidney Navies Pinto
Score & ys & ys & ys & ys & ys & ys & ys & ys
" "$ $# ( "! "! ## "$ #
# *" () $# $" &' &" #* "(
$ "#$ "#% "$' *' )% "!( *" ')
% (# "## "!" "!% )% *) "!* "#%
& #% $" %( (" &" &# '! "!*
' # $ ' ") #% $( #& ()
( ! ! " ! " & " "#
A proportional odds model for the ordered canning scores is fit with proc genmod below.
The contrast statements test the effect of the judges' experience separately for the bean lines.
These contrasts correspond to interaction slices by bean lines. The estimate statements
calculate the linear predictors needed to derive the probabilities to rate each line in category $
or less and category % or less depending on judges experience.
ods exclude ParameterEstimates ParmInfo;
proc genmod data=beans;
class class exper;
model score = class exper class*exper /
link=cumlogit dist=multinomial type3;
contrast 'Experience effect for Black'
exper 1 -1 class*exper 1 -1 0 0 0 0 0 0 ;
There is a significant interaction between bean lines (class) and judge experience
(Output 6.11). The results of comparing judges with more and less than & years experience
will depend on the bean line. The contrast slices address this interaction (Output 6.12). The
ratings distributions for experienced and less experienced judges is clearly not significantly
different for navy beans and at the &% level not significantly different for black beans
(: !Þ!*'%). There are differences betwen the ratings for kidney and pinto beans, however
(: !Þ!!&" and : !Þ!!!").
Output 6.11.
The GENMOD Procedure
Model Information
Data Set WORK.BEANS
Distribution Multinomial
Link Function Cumulative Logit
Dependent Variable SCORE
Observations Used 2795
Response Profile
Ordered Ordered
Level Value Count
1 1 109
2 2 385
3 3 829
4 4 814
5 5 445
6 6 193
7 7 20
Similar calculation for the other groups lead to the probability distributions shown in Figure
6.18.
Well-documented criteria and a canning procedure specifically designed for navy beans
explains the absence of differences due to the judges' experience for navy beans
(: !Þ($*"). Black beans in general were of poor quality and low ratings dominated, creat-
ing very similar probability distributions for experienced and less experienced judges (Figure
6.18). For kidney and pinto beans experienced and inexperienced judges classify control
quality with similar odds, probably because they were shown such quality prior to judging.
Experienced judges have a tendency to assign higher quality scores than less experienced
judges for these two commercial classes.
Output 6.12.
Contrast Estimate Results
Standard
Label Estimate Error Confidence Limits
Contrast Results
Chi-
Contrast DF Square Pr > ChiSq Type
Pr(Score < 4)
Pr(Score = 4)
Pr(Score > 4)
Black (n.s.) Kidney (**) Navy (n.s.) Pinto (**)
1.0
0.9
0.8 0.21 0.24
Category Probability
0.7
0.31 0.31
0.30 0.33
0.6 0.33
0.5
0.4 0.30
Figure 6.18. Predicted probability distributions for the % # interaction in bean rating
experiment. Categories " to $ as well as categories & to ( are amalgamated to emphasize
deviations from satisfactory ratings (score = 4).
Management $ Management %
Quality R" R# Total Quality R" R# Total
Poor ! ! ! Poor " ! "
Average * # "" Average "# % "'
Good ( "% #" Good $ "" "%
Excellent ! ! ! Excellent ! " "
Total "' "' $# Total "' "' $#
Of particular interest were the determination of the water injection effect, the subsurface
effect, and the comparison of injection vs. surface applications. These are contrasts among the
levels of the factor Management Practice and it first needs to be determined whether the
factor interacts with Application Rate.
We fit a proportional odds model to these data containing Management and nitrogen
Application Rate effects and their interaction as well as a continuous covariate to model the
temporal effects. This is not the most efficient method of accounting for repeated measures;
we address repeated measures data structures in more detail in §7. Inclusion of the time
variable significantly improves the model fit over a model containing only main effects and
interactions of the experimental factors, however. The basic proc genmod statements are
ods exclude ParameterEstimates ParmInfo;
proc genmod data=mgtN rorder=data;
class mgt nitro;
model resp = mgt nitro mgt*nitro date /
link=cumlogit dist=multinomial type3;
run;
Output 6.13.
The GENMOD Procedure
Model Information
Data Set WORK.MGTN
Distribution Multinomial
Link Function Cumulative Logit
Dependent Variable resp
Observations Used 128
Response Profile
Ordered Ordered
Level Value Count
1 Poor 43
2 Average 49
3 Good 35
4 Excellen 1
Supplementing nitrogen surface application with water injection does not alter the turf
quality significantly. However, the rating distribution of the average of the nitrogen injection
treatments is significantly different from the turf quality obtained with nitrogen application
and supplemental water injection (PV "!*Þ&"ß : !Þ!!!"). Similarly, the average injec-
tion treatment leads to a significantly different rating distribution than the average surface
application.
Output 6.14.
Contrast Results
Chi-
Contrast DF Square Pr > ChiSq Type
To determine these rating distributions for the marginal Management Practice effect and
the marginal Rate effect, we obtain the least squares means and convert them into probabili-
ties by inverting the link function. Unfortunately, proc genmod does not permit a lsmeans
statement in combination with the multinomial distribution, i.e., for ordinal response. The
marginal treatment means can be constructed with estimate statements, however. Adding to
the genmod code the statements
produces linear predictors from which the marginal probability distributions for rate #Þ&
g m# and for surface application with no supplemental water injection can be obtained.
Adding similar statements for the second application rate and the other three management
practices we obtain the linear predictors in Output 6.15.
Output 6.15.
Contrast Estimate Results
Standard
Label Estimate Error Alpha Confidence Limits
Output 6.16.
Contrast Estimate Results
Standard Chi-
Label Estimate Error Alpha Square Pr > ChiSq
Table 6.20 shows the marginal probability distributions and indicates significant differences
among the treatment levels. The surface applications lead to poor turf quality with high
probability. Their ratings are at most average in over *!% of the cases. When nitrogen is
injected into the soil the rating distributions shift toward higher categories. The #&' nozzle
((Þ' cm depth of injection) leads to good turf quality in over #Î$ of the cases. The nozzle that
injects nitrogen up to "#Þ( cm has a higher probability of average ratings, compared to the
#&' nozzle, probably because the nitrogen is placed closer to the roots. Although there are no
significant differences in the rating distributions between the two surface applications
(: !Þ#'$%) and the ratings distributions of the two injection treatments (: !Þ"##%), the
two groups of treatments clearly separate. Based on the results of this analysis one would
recommend nitrogen injection of & g m# .
rating the same #$' experimental units by two different raters on an ordinal scale from " to &.
For example, "# units were rated in category # by Rater # and in category " by Rater ". An
obvious question is whether the ratings of the two interpreters are independent. Should we
tackle this by modeling the Rater # results as a function of the Rater " results or vice versa?
There is no response variable or explanatory variable here, only two categorical variables
(Rater 1 with five categories and Rater # with five categories) and a cross-tabulation of #$'
outcomes.
Table 6.21 is a very special contingency table since the row and column variable have the
same categories. We refer to such tables as matched-pairs tables. Some of the models dis-
cussed and fitted in this subsection are specifically designed for matched-pairs tables, others
(such as the independence model) apply to any contingency table. The interested reader can
find more details on the fitting of generalized linear models to contingency tables in the
monographs by Agresti (1990) and Fienberg (1980).
A closer look at the data table suggests that the ratings are probably not independent. The
highest counts appear on the diagonal of the table. If Rater " assigns an experimental unit to
category 3, then there seems to be a high likelihood that Rater # also assigns the unit to
category 3. If we reject the notion of independence, for which we need to develop a statistical
test, our interest will shift to determining how the two rating schemes depend on each other.
Is there more agreement between the ratings in the table than is expected by chance? Is there
more disagreement in the table than expected by chance? Is there structure to the disagree-
ment; for example, does Rater " systematically assign values to higher categories?
To develop a model for independence of the ratings let \ denote the column variable, ]
the row variable, and R34 the count observed in row 3, column 4 of the contingency table. Let
M and N denote the number of rows and columns. In a square table such as Table 6.21 we
necessarily have M N . The independence model does not apply to square or matched-pairs
tables alone and we discuss it more generally here. The generic layout of the two-way contin-
gency table we are referring to is shown in Table 6.6 (p. 318). Recall that 834 denotes the
observed count in row 3 and column 4 of the table, 8ÞÞ denotes the total sample size and 83Þ ,
8Þ4 are the marginal totals. Under the Poisson sampling model where the count in each cell is
the realization of a Poissona-34 b random variable, the row and column totals are
PoissonÐ-3Þ !N4" -34 Ñ and PoissonÐ-Þ4 !3"
M
-34 Ñ variables, and the total sample size is a
MßN
PoissonÐ-ÞÞ !3ß4 -34 Ñ random variable. The expected cell count -34 under independence is
then related to the marginal expected counts by
-3Þ -Þ4
-34 .
-ÞÞ
Taking logarithms leads to a generalized linear model with log link for Poisson random
variables and linear predictor
lne-34 f lne-ÞÞ f lne-3Þ f lne-Þ4 f . !3 "4 . [6.57]
We think of !3 and "4 as the (marginal, main) effects of the row and column variables and
independence implies the absence of the a!" b34 interaction between the two variables.
There exists, of course, a well-known test for independence of categorical variables in
contingency tables based on the Chi-square distribution. It is sometimes referred to as
Pearson's Chi-square test. If 834 is the observed count in cell 3ß 4 and /34 83Þ 8Þ4 Î8ÞÞ is the
expected count under independence, then
M N
a834 /34 b#
\ # "" [6.58]
3" 4"
/34
perform the Chi-square analysis of independence (option chisq). The options nocol,
nororow, and nopercent suppress the printing of column, row, and cell percentages. The
expected option requests a printout of the expectedfrequencies under independence. None of
the expected frequencies is less than " and exactly #! frequencies a)!%b exceed & ÐOutput
6Þ17).
The Chi-squared approximation holds and the Pearson test statistic is \ # "!$Þ"!)*
a: !Þ!!!"b. There is significant disagreement between the observed counts and the counts
expected under an independence model. The hypothesis of independence is rejected. The
likelihood ratio test with test statistic A *&Þ$&(( leads to the same conclusion. The calcula-
tion of A is discussed below. The expected cell counts on the diagonal of the contingency
table reflect the degree of chance agreement and are interpreted as follows: if the counts are
distributed completely at random to the cells conditional on preserving the marginal row and
column totals, one would expect this degree of agreement between the ratings.
The independence model is rarely the best-fitting model for a contingency table and the
modeler needs to consider other models incorporating dependence between the row and
column variable. This is certainly the case here since the notion of independence has been
clearly rejected. This requires the use of statistical procedures that can fit other models than
independence, such as proc genmod.
Output 6.17.
The FREQ Procedure
rater1 rater2
Frequency|
Expected | 1| 2| 3| 4| 5| Total
---------+--------+--------+--------+--------+--------+
1 | 10 | 6 | 4 | 2 | 2 | 24
| 2.8475 | 4.678 | 6.4068 | 6.3051 | 3.7627 |
---------+--------+--------+--------+--------+--------+
2 | 12 | 20 | 16 | 7 | 2 | 57
| 6.7627 | 11.11 | 15.216 | 14.975 | 8.9364 |
---------+--------+--------+--------+--------+--------+
3 | 1 | 12 | 30 | 20 | 6 | 69
| 8.1864 | 13.449 | 18.419 | 18.127 | 10.818 |
---------+--------+--------+--------+--------+--------+
4 | 4 | 5 | 10 | 25 | 12 | 56
| 6.6441 | 10.915 | 14.949 | 14.712 | 8.7797 |
---------+--------+--------+--------+--------+--------+
5 | 1 | 3 | 3 | 8 | 15 | 30
| 3.5593 | 5.8475 | 8.0085 | 7.8814 | 4.7034 |
---------+--------+--------+--------+--------+--------+
Total 28 46 63 62 37 236
We start by fitting the independence model in proc genmod to show the equivalence to
the Chi-square analysis in proc freq.
title1 'Independence model for ratings';
ods exclude ParameterEstimates obstats;
proc genmod data=rating;
class rater1 rater2;
model number = rater1 rater2 /link=log error=poisson obstats;
ods output obstats=stats;
run;
The ods output obstats=stats; statement saves the observation statistics table which
contains the predicted values (variable pred). The proc freq code following proc genmod
tabulates the predicted values for the independence model.
In the Criteria For Assessing Goodness Of Fit table we find a model deviance of
*&Þ$&(( with "' degrees of freedom (Output 6.18). This is twice the difference between the
log likelihood of a full model containing Rater " and Rater # main effects and their inter-
action and the (reduced) independence model shown here.
The full model (code and output not shown) has a log likelihood of $')Þ%($( and the
deviance of the independence model becomes #a$')Þ%($( $#!Þ(*%*b *&Þ$&( which is
of course the likelihood ratio statistic for testing the absence of Rater " Rater # interactions
and identical to the likelihood ratio statistic reported by proc freq above. Similarly, the
Pearson residual Chi-square statistic of "!$Þ"!)* in Output 6.18 is identical to the Chi-Square
statistic calculated by proc freq. That the independence model fits these data poorly is also
conveyed by the "overdispersion" factor of &Þ*' (or 'Þ%%). Some important effects are
unaccounted for, these are the interactions between the ratings.
Output 6.18.
Independence model for ratings
Model Information
Data Set WORK.RATING
Distribution Poisson
Link Function Log
Dependent Variable number
Observations Used 25
Algorithm converged.
From the Analysis of Parameter Estimates table the predicted values can be con-
structed (Output 6.19). The predicted count in cell "ß ", for example is obtained from the
estimated linear predictor
s("" "Þ&%)$ !Þ##$" !Þ#()( "Þ!%'&
and the inverse link function
scR"" d /"Þ!%'& #Þ)%('.
E
Output 6.19.
Predicted cell counts under model of independence
Frequency|1 |2 |3 |4 |5 | Total
---------+--------+--------+--------+--------+--------+
1 | 2.8475 | 4.678 | 6.4068 | 6.3051 | 3.7627 | 24
---------+--------+--------+--------+--------+--------+
2 | 6.7627 | 11.11 | 15.216 | 14.975 | 8.9364 | 57
---------+--------+--------+--------+--------+--------+
3 | 8.1864 | 13.449 | 18.419 | 18.127 | 10.818 | 69
---------+--------+--------+--------+--------+--------+
4 | 6.6441 | 10.915 | 14.949 | 14.712 | 8.7797 | 56
---------+--------+--------+--------+--------+--------+
5 | 3.5593 | 5.8475 | 8.0085 | 7.8814 | 4.7034 | 30
---------+--------+--------+--------+--------+--------+
Total 28 46 63 62 37 236
Simply adding a general interaction term between row and column variable will not solve
the problem because the model
lne-34 f . !3 "4 a!" b34 [6.59]
is saturated, that is, it fits the observed data perfectly. Just as in the case of a general two-
way layout (e.g., a randomized block design) adding interactions between the factors depletes
the degrees of freedom. The saturated model has a deviance of exactly ! and 834
exp. s! s 4 Ð!"
s3 " s Ñ34 . The deviation from independence must be structured in some
way to preserve degrees of freedom. We distinguish three forms of structured interactions:
• association that focuses on structured patterns of dependence between \ and ] ;
• agreement that focuses on the counts on the main diagonal;
• disagreement that focuses on the counts in off-diagonal cells.
Modeling association requires that the categories of \ and ] are ordered; agreement and
disagreement can be modeled with nominal and/or ordered categories. It should be noted that
cell counts can show strong association but weak agreement, for example, if one rater
consistenly assigns outcomes to higher categories than the other rater.
Table 6.22. Multiplicative factors exp# ?3 @4 for # !Þ!& and centered scores
(centered scores ?3 and @4 shown in parentheses)
@4
" # $ %
?3 ?3 Ð "Þ&Ñ Ð !Þ&Ñ Ð!Þ&Ñ Ð"Þ&Ñ
" Ð "Þ&Ñ "Þ"# "Þ!% !Þ*' !Þ)*
# Ð !Þ&Ñ "Þ!% "Þ!" !Þ*) !Þ*'
$ Ð!Þ&Ñ !Þ*' !Þ*) "Þ!" "Þ!$
% Ð"Þ&Ñ !Þ)* !Þ*' "Þ!$ "Þ"#
The multiplicative terms are symmetric about the center of the table and increase along
the diagonal toward the corners of the table. At the same time the expected counts are decre-
mented relative to an independence model toward the upper right and lower left corners of the
table. The linear-by-linear association model assumes that high (low) values of \ pair more
frequently with high (low) values of ] than is expected under independence. At the same
time high (low) values of \ pair less frequently with low (high) values of ] . In cases where
it is more difficult to assign outcomes to categories in the middle of the scale than to extreme
categories the linear-by-linear association model will tend to fit the data well. For the model
fit and the predicted values it does not matter whether the scores are centered or not. Because
of the convenient interpretation in terms of multiplicative effects as shown in Table 6.22 we
prefer to work with centered scores.
The linear-by-linear association model with centered scores is fit to the rater agreement
data using proc genmod with the statements
title3 'Uniform association model for ratings';
data rating; set rating;
sc1_centered = rater1-3; sc2_centered = rater2-3;
run;
ods exclude obstats;
proc genmod data=rating;
class rater1 rater2;
model number = rater1 rater2 sc1_centered*sc2_centered /
link=log error=poisson type3 obstats;
The deviance of this model is much improved over the independence model (Output
# !Þ%%&& and the likeihood-ratio test
6.20). The estimate for the association parameter is s
for L! :# ! shows that the addition of the linear-by-linear association significantly
improves the model. The likelihood ratio Chi-square statistic of '(Þ%& equals the difference
between the two model deviances a*&Þ$& #(Þ*!b.
Output 6.20.
Uniform association model for ratings
Model Information
Algorithm converged.
Chi-
Source DF Square Pr > ChiSq
rater1 4 57.53 <.0001
rater2 4 39.18 <.0001
sc1_cente*sc2_center 1 67.45 <.0001
rater1 rater2
Frequency|1 |2 |3 |4 |5 | Total
---------+--------+--------+--------+--------+--------+
1 | 8.5899 | 8.0587 | 5.1371 | 1.8786 | 0.3356 | 24
---------+--------+--------+--------+--------+--------+
2 | 11.431 | 16.742 | 16.662 | 9.5125 | 2.6533 | 57
---------+--------+--------+--------+--------+--------+
3 | 6.0607 | 13.858 | 21.532 | 19.192 | 8.3574 | 69
---------+--------+--------+--------+--------+--------+
4 | 1.6732 | 5.9728 | 14.488 | 20.16 | 13.706 | 56
---------+--------+--------+--------+--------+--------+
5 | 0.2455 | 1.3683 | 5.1816 | 11.257 | 11.948 | 30
---------+--------+--------+--------+--------+--------+
Total 28 46 63 62 37 236
Comparing the predicted cell counts to those of the independence model, it is seen how
the counts increase in the upper left and lower right corner of the table and decrease toward
the upper right and lower left corners.
The fit of the linear-by-linear association model is dramatically improved over the inde-
pendence model at the cost of only one additional degree of freedom. The model fit is not
satisfactory, however. The :-value for the model deviance of PrÐ;#"& #(Þ*!Ñ !Þ!## indi-
cates that a significant discrepancy between model and data remains. The interactions
between Rater " and Rater # category assignments must be structured further.
Before proceeding with modeling structured interaction as agreement we need to point
out that the linear-by-linear association model requires that scores be assigned to the ordered
categories. This introduces a subjective element into the analysis, different modelers may
assign different sets scores. Log-multiplicative models with predictor . !3 "4 #93 =4
have been developed where the category scores 93 and =4 are themselves parameters to be
estimated. For more information about these log-multiplicative models see Becker (1989,
1990a, 1990b) and Goodman (1979b).
excess counts on the diagonal. Let D34 be a indicator variable such that
" 34
D34
! otherwise.
Interactions are modeled with a single degree of freedom term. A positive $ indicates that
more counts fall on the main diagonal than would be expected under a random assignment of
counts to the table (given the marginal totals). The agreement parameter $ can also be made
to vary with categories. This nonhomogeneous agreement model is defined as
lne-34 f . !3 "4 $3 D34 .
The separate agreement parameters $3 will saturate the model on the main diagonal, that is
predicted and observed counts will agree perfectly for the diagonal cells. In proc genmod the
homogeneous and nonhomogeneous agreement models are fitted easily by defining an
indicator variable for the diagonal (output not shown).
data ratings; set ratings; diag = (rater1 = rater2); run;
Disagreement models place emphasis on cells off the main diagonal. For example, the
model
" l3 4l "
lne-34 f . !3 "4 $ D34 ß D34
! otherwise
adds an additional parameter $ to all cells adjacent to the main diagonal and the model
" 34 " " 34"
lne-34 f . !3 "4 $ D34 $ -34 ß D34 ß -34
! otherwise ! otherwise
adds two separate parameters, one for the first band above the main diagonal a$ b and one for
the first band below the main diagonal a$ b. For this and other disagreement structures see
Tanner and Young (1985).
When categories are ordered, agreement and association parameters can be combined. A
linear-by-linear association model with homogeneous agreement, for example, becomes
" 34
lne-34 f . !3 "4 # ?3 @4 $ D34 ß D34
! otherwise.
Table 6.23 lists the deviances and :-values for various models fit to the data in Table 6.21.
Table 6.23. Model deviances for various log-linear models fit to data in Table 6.21
Linear-by- Non-
linear Homog. homog. Deviance
Model assoc. agreemt. agreemt. .0 H HÎ.0 :-value
ó "' *&Þ$'† &Þ*' !Þ!!!"
ô ü "& #(Þ*! "Þ)' !Þ!###
õ ü "& %$Þ** #Þ*$ !Þ!!!"
ö ü "" $'Þ)& $Þ$& !Þ!!!"
÷ ü ü "% "'Þ'" "Þ") !Þ#(('
ø ü ü "! "&Þ'& "Þ&' !Þ""!"
†
The independence model
8
" #
W# "]3 ] .
8 " 3"
0.4
0.3
Density of S2
0.2
0.1
0.0
0 2 4 6 8 10
S2
where B is the mean DON concentration on the 3th truck, and 53# expe"! "" lneB3 ff is the
mean of a Gamma random variable with scale parameter ! Ð8 "ÑÎ# %Þ& (since each
sample variance is based on 8 "! probe samples per truck). Notice that the mean can also
be expressed in the form
The proc genmod statements to fit this model (Output 6.21) are
proc genmod data=don;
model donvar = logmean / link=log dist=gamma scale=4.5 noscale;
run;
-2
-3
-2 -1 0 1 2
log DON concentration
The scale=4.5 option sets the Gamma scale parameter ! to a8 "bÎ# %Þ& and the
noscale option prevents the scale parameter from being fit iteratively by maximum likeli-
hood. The combination of the two options fixes ! %Þ& throughout the estimation process so
that only the parameters "! and "" are being estimated.
Model Information
Data Set WORK.FITTHIS
Distribution Gamma
Link Function Log
Dependent Variable donvar
Observations Used 9
Missing Values 1
Fixing the scale parameter at a certain value imposes a constraint on the model since in
the regular Gamma regression model the scale parameter would be estimated (see the yield-
density application in §6.7.3, for example). The genmod procedure calculates a test whether
this constraint is reasonable and lists it in the Lagrange Multiplier Statistics table
(Output 6.21). Based on the :-value of !Þ%(&# we conclude that estimating the scale
parameter rather than fixing it at %Þ& would not improve the model. Fixing the parameter is
reasonable.
The estimates for the intercept and slope are "s ! "Þ#!'& and "s " !Þ&)$#,
respectively, from which the variance at any DON concentration can be estimated. For
example, we expect the probe-to-probe variation on a truck with a deoxynivalenol
concentration of & parts per million to be
s # expe "Þ#!'& !Þ&)$#lne&ff !Þ('& ppm# .
5
Figure 6.21 displays the predicted variances and approximate *&% confidence bounds for
the predicted values. The model does not fit the data perfectly. The DON variance on the
truck with an average concentration of 'Þ% ppm is considerably off the trend. The model fits
rather well for concentrations up to & ppm, however. It is noteworthy how the confidence
bands widen as the DON concentration increases. Two effects are causing this. First, the
probe-to-probe variances are not homoscedastic; according to the Gamma regression model
"
VarW3# expe#"! #"" lnaB3 bf
%Þ&
increases sharply with B. Second, the data are very sparse for larger concentrations.
2.0
Probe-To-Probe Variance (ppm2)
1.5
1.0
0.5
0.0
0 2 4 6 8 10 12
DON concentration (ppm)
Figure 6Þ21Þ Predicted values for probe-to-probe variance from Gamma regression (solid
line). Dashed lines are asymptotic *&% confidence intervals for the mean variance at a given
DON concentration.
Table 6.24. Poppy count data from Mead, Curnow, and Hasted (1993, p. 144)†
Treatment Block 1 Block 2 Block 3 Block 4
E &$) %## $(( $"&
F %$) %%# $"* $)!
G (( '" "&( &#
H ""& &( "!! %&
I "( $" )( "'
J ") #' (( #!
†
Used with permission.
Since these are count data, one could assume that the responses are Poisson-distributed
and notice that for mean counts greater than "& #!, the Poisson distribution is closely
approximated by a Gaussian distribution (Figure 6.22). The temptation to analyze these data
by standard analysis of variance assuming Gaussian errors is thus understandable. From
Figure 6.22 it can be inferred, however, that the variance of the Gaussian distribution
approximating the Poisson mass function is linked to the mean. The distribution approxi-
mating the Poisson when the average counts are small will have a smaller variance than the
approximating distribution for large counts, thereby violating the homogeneous variance
assumption of the standard analysis of variance.
0.08
Density or mass function
0.06
0.04
0.02
0.00
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38
Figure 6.22. Poissona#!b probability mass function (bars) overlaid with probability density
function of a Gaussiana. #!ß 5 # #!b random variable.
Mead et al. (1993, p. 145) highlight this problem by examining the mean square error
estimate from the analysis of variance, which is 5 s # #ß '&$. An off-the-cuff confidence
interval for the mean poppy count for treatment J leads to
#ß '&$
CJ #eseaCJ b $&Þ#& #Ê $&Þ#& &"Þ&! c "'Þ#&ß )$Þ(&d.
%
Based on this confidence interval one would not reject the idea that the mean poppy count for
treatment J could be "!, say. This is a nonsensical result. The variability of the counts is
smaller for treatments with small average counts and larger for treatments with large average
counts. The analysis of variance mean square error is a pooled estimator of the residual varia-
bility being too high for treatments such as J and too small for treatments such as E.
An analysis of these data based on a generalized linear model with Poisson-distributed
outcomes and a linear predictor that incorporates block and treatment effects is more
reasonable. Using the canonical log link, the following proc genmod statements fit this GLM.
ods exclude ParameterEstimates;
proc genmod data=poppies;
class block treatment;
model count = block treatment / link=log dist=Poisson;
run;
Output 6.22.
The GENMOD Procedure
Model Information
Algorithm converged.
The troubling statistic in Output 6.22 is the model deviance of #&'Þ#* compared to its
degrees of freedom a"&b. The data are considerably overdispersed relative to a Poisson distri-
bution. A value of "(Þ!) should give the modeler pause (since the target ratio is "). A
possible reason for the considerable overdispersion could be an incorrect linear predictor. If
these data stem from an experimental design with proper blocking and randomization pro-
cedure, no other effects apart from block and treatment effects should be necessary. Ruling
out an incorrect linear predictor, (positive) correlations among poppy counts could be the
cause of the overdispersion. Mead et al. (1993) explain the overdispersion by the fact that
whenever there is one poppy in an experimental unit, there are almost always several, hence
poppies are clustered and not distributed completely at random within (and possibly across)
experimental units.
The overdispersion problem can be fixed if one assumes that the counts have mean and
variance
Ec] d -
Varc] d 9-,
fit the same model as above but estimate an extra overdispersion parameter as
s #&'Þ#* "(Þ!)
<
"&
and report its square root as Scale in the table of parameter estimates. This approach to
accommodating overdispersion has several disadvantages in our opinion.
• It adds a parameter to the variance of the response which is not part of the Poisson
distribution. The variance Varc] d 9- is no longer that of a Poisson random variable
with mean -. The analysis is thus no longer a maximum likelihood analysis for a
Poisson model. It is a quasi-likelihood analysis in the sense of McCullagh (1983) and
McCullagh and Nelder (1989).
• The parameter estimates are not affected by the inclusion of the extra overdispersion
parameter 9. Only their standard errors are. For the data set and model considered
here, the standard errors of all parameter estimates will be È"(Þ!) %Þ"$$ times
larger in the model with Varc] d 9-. The predicted mean counts will remain the
same, however.
• The addition of an overdispersion parameter does not induce correlations among the
outcomes. If data are overdispersed because positive correlations among the obser-
vations are ignored, a multiplicative overdispersion parameter is the wrong remedy.
random variable reckoned over the possible values of a randomly varying parameter is larger
than the conditional variability one obtains if the parameter is fixed at a certain value.
This mixing approach has particular appeal if the marginal distribution of the response
variable remains a member of the exponential family. In this application we can assume that
the poppy counts for treatment 3 in block 4 are distributed as Poissona-34 b, but that -34 is a
random variable. Since the mean counts are non-negative we choose a probability distribution
for -34 that has nonzero density on the positive real line. If the mean of a Poissona-b variable
is distributed as Gammaa.ß !b, it turns out that the marginal distribution of the count follows
the Negative Binomial distribution which is a member of the exponential family (Table 6.2, p.
305). The details on how the marginal Negative Binomial distribution is derived are provided
in §A6.8.6.
Since Release 8.0 of The SAS® System, Negative Binomial outcomes can be modeled
with proc genmod using the dist=negbin option of the model statement. The canonical link of
this distribution is the log link. Starting with Release 8.1, the Negative Binomial distribution
is coded into proc nlmixed. Using an earlier version of The SAS® System (Release 8.0), we
can also fit count data with this distribution in proc nlmixed by coding the log likelihood
with SAS® programming statements and using the general() formulation of the model
statement. For the poppy count data these statements are as follows.
b = exp(linp)/k;
ll = lgamma(count+k) - lgamma(k) - lgamma(count + 1) +
k*log(1/(b+1)) + count*log(b/(b+1));
model count ~ general(ll);
run;
The parameters statement assigns starting values to the model parameters. Although
there are four block and six treatment effects, only three block and five treatment effects need
to be coded due to the sum-to-zero constraints on block and treatment effects. Because proc
nlmixed determines residual degrees of freedom by a different method than proc genmod we
added the df=14 option to the proc nlmixed statement. This will ensure the same degrees of
freedom as in the Poisson analysis minus one degree of freedom for the additional parameter
of the Negative Binomial that determines the degree of overdispersion relative to the Poisson
model. The several lines of if ... then ...; else ...; statements that follow the
parameters statement set up the linear predictor as a function of block effects (bl1âbl3)
and treatment effects (tAâtE). The statements
b = exp(linp)/k;
ll = lgamma(count+k) - lgamma(k) - lgamma(count + 1) +
k*log(1/(b+1)) + count*log(b/(b+1));
code the Negative Binomial log likelihood. We have chosen a log likelihood based on a
parameterization in A6.8.6. The model statement finally instructs proc nlmixed to perform a
maximum likelihood analysis for the variable count where the log likelihood is determined
by the variable ll. This analysis results in maximum likelihood estimates for Negative Bi-
nomial responses with linear predictor
(34 . 73 34 .
The parameterization used in this proc nlmixed code expresses the mean response as simply
expe(34 f for ease of comparison with the Poisson analysis. The abridged nlmixed output
follows.
Specifications
Description Value
Data Set WORK.POPPIES
Dependent Variable count
Distribution for Dependent Variable General
Optimization Technique Dual Quasi-Newton
Integration Method None
Fit Statistics
Description Value
-2 Log Likelihood 233.1
AIC (smaller is better) 253.1
BIC (smaller is better) 264.9
Log Likelihood -116.5
AIC (larger is better) -126.5
BIC (larger is better) -132.4
Parameter Estimates
Standard
Parameter Estimate Error DF t Value Pr > |t|
Minus twice the log likelihood for this model is #*&Þ% (output not shown) and the likeli-
hood ratio test statistic to test for equal average poppy counts among the treatments is
A #*&Þ% #$$Þ" '#Þ$.
The :-value for this statistic is PrÐ;#& '#Þ$Ñ !Þ!!!". Hence not all treatments have the
same average poppy count and we can proceed with pairwise comparisons based on treatment
averages. These averages can be calculated and compared with the estimate statement of the
nlmixed procedure. For example, to compare the predicted counts for treatments E and F , E
and G , and H and I we add the statements
estimate 'count(A)-count(B)' exp(intcpt+0.25*bl1+0.25*bl2+0.25*bl3+tA) -
exp(intcpt+0.25*bl1+0.25*bl2+0.25*bl3+tB);
to the nlmixed code. The linear predictors for each treatment take averages over the block
effects prior to exponentiation. The table added by these three statements to the proc nlmixed
output shown earlier follows.
Output 6.24.
Additional Estimates
Standard
Label Estimate Error DF t Value Pr > |t|
“The new methods occupy an altogether higher plane than that in which
ordinary statistics and simple averages move and have their being.
Unfortunately, the ideas of which they treat, and still more, the many
technical phrases employed in them, are as yet unfamiliar. The arithmetic
they require is laborious, and the mathematical investigations on which
the arithmetic rests are difficult reading even for experts... This new
departure in science makes its appearance under conditions that are
unfavourable to its speedy recognition, and those who labour in it must
abide for some time in patience before they can receive sympathy from the
outside world.” Sir Francis Galton.
7.1 Introduction
7.2 The Laird-Ware Model
7.2.1 Rationale
7.2.2 The Two-Stage Concept
7.2.3 Fixed or Random Effects
7.3 Choosing the Inference Space
7.4 Estimation and Inference
7.4.1 Maximum and Restricted Maximum Likelihood
7.4.2 Estimated Generalized Least Squares
7.4.3 Hypothesis Testing
7.5 Correlations in Mixed Models
7.5.1 Induced Correlations and the Direct Approach
7.5.2 Within-Cluster Correlation Models
7.5.3 Split-Plots, Repeated Measures, and the Huynh-Feldt Conditions
7.6 Applications
7.6.1 Two-Stage Modeling of Apple Growth over Time
• A linear mixed effects model contains fixed and random effects and is linear
in these effects. Models for subsampling designs or split-plot-type models
are mixed models.
A distinction was made in §1.7.5 between fixed, random, and mixed effects models based on
the number of random variables and fixed effects involved in the statistical model. A mixed
effects model — or mixed model for short — contains fixed effects as well as at least two
random variables (one of which is the obligatory model error). Mixed models arise quite fre-
quently in designed experiments. In a completely randomized design with subsampling, for
example, > treatments are assigned at random to <> experimental units. A random subsample
of 8 observations is then drawn from every experimental unit. This design is practical if an
experimental unit is too large to be measured in its entirety, for example, a field plot contains
twelve rows of a particular crop but only three rows per plot can be measured and analyzed.
Soil samples are often randomly divided into subsamples prior to laboratory analysis. The
statistical model for such a design can be written as
]345 . 73 /34 .345
[7.1]
3 "ß ÞÞÞß >à 4 "ß ÞÞÞß <à 5 "ß ÞÞÞß 8,
where /345 µ a!ß 5 # bß .345 µ Ð!ß 5.# Ñ are zero mean random variables representing the experi-
mental and observational errors, respectively. If the treatment effects 73 are fixed, i.e., the
levels of the treatment factor were predetermined and not chosen at random, this is a mixed
model.
In the past fifteen years, mixed linear models have risen to great importance in statistical
modeling because of their tremendous flexibility. As will be demonstrated shortly, mixed
models are more general than standard regression and classification models and contain the
latter. In the analysis of designed experiments mixed models have been in use for a long time.
They arise naturally through the process of randomization as shown in the introductory
example. In agricultural experiments, the most important traditional mixed models are those
for subsampling and split-plot type designs (Example 7.1).
Data from subsampling and split-plot designs are clustered structures (§2.4.1). In the for-
mer, experimental units are clusters for the subsamples and in the latter whole-plots are clus-
ters for the subplot treatments. In general, mixed models arise very naturally in situations
where data are clustered or hierarchically organized. This is by no means restricted to de-
signed experiments with splits or subsampling. Longitudinal studies and repeated measures
Replicate I Replicate II
M1 M2
M3 M3
M2 M1
Figure 7.1. Randomized complete block design with four blocks (= replicates) for
management strategies M" to M$.
This experiment involves two separate stages of randomization. The management types
are assigned at random to the fields and independently thereof the crops are assigned to
the plots within a field. The variability associated with the fields should be independent
from the variability associated with the plots. Ignoring the management types and
focusing on the crop alone, the design is a randomized complete block with $% "#
blocks of size $ and the model is
5 "ß ÞÞß $, where 934 is the effect of the 34th block, !5 is the effect of the 5 th crop type,
and .345 is the experimental error associated with the plots.
Replicate I Replicate II
M1 corn soy wheat wheat corn soy
M2
M3 soy wheat corn wheat corn soy
M3
wheat corn soy corn soy wheat
M2 M1
Plot
Replicate III Replicate IV
soy corn wheat M1 soy wheat corn
M2
corn soy wheat wheat soy corn
M1 M2
soy wheat corn corn soy wheat
M3 M3
Field
Figure 7.2. Split-plot design for management strategies aM" to M$b and crop types.
Both RCBDs are analysis of variance models with a single error term. The mixed model
comes about when the two randomizations are combined into a single model, letting
934 34 73 /34 ,
]345 . 34 73 /34 !5 a7!b35 .345 . [7.3]
The two random variables depicting experimental error on the field and plot level are
/34 and .345 , respectively, and the fixed effects are 34 for the 4th replicate, 73 for the 3th
management strategy, !5 for the 5 th crop type, and a7!b35 for their interaction. This is a
classical split-plot design where the whole-plot factor management strategy is arranged
in a randomized block design with four replication (blocks) and the subplot factor crop
type has three levels.
The distinction between longitudinal and repeated measures data adopted here is as
follows: if data are collected repeatedly on experimental material to which treatments were
applied initially, the data structure is termed a repeated measure. Data collected repeatedly
over time in an observational study are termed longitudinal. A somewhat different distinction
between repeated measures and longitudinal data in the literature states that it is assumed that
cluster effects do not change with time in repeated measures models, while time as an
explanatory variable of within-cluster variation related to growth, development, and aging
assumes a center role with longitudinal data (Rao 1965, Hayes 1973). In designed experi-
ments involving time the assumption of time-constant cluster effects is often not tenable.
Treatments are applied initially and their effects may wear off over time, changing the
relationship among treatments as the experiment progresses. Treatment-time interactions are
important aspects of repeated measures experiments and the investigation of trends over time
and how they change among treatments can be the focus of the investigation.
The appeal of mixed models lies in their flexibility to handle diverse forms of hierarchi-
cally organized data. Depending on application and circumstance the emphasis of data
analysis will differ. In designed experiments treatment comparisons, the estimation of treat-
ment effects and contrasts, and the estimation of sources of variability come to the fore. In re-
peated measures analyses investigating the interactions of treatment and time and modeling
the response trends over time play an important role in addition to treatment comparisons. For
longitudinal data emphasis is on developing regression-type models that account for cluster-
to-cluster and within-cluster variation and on estimating the mean response for the population
average and/or for the specific clusters. In short, designed experiments with clustered data
structure emphasize the between-cluster variation, longitudinal data analyses emphasize the
within-cluster variation, and repeated measures analyses place more emphasis on either one
depending on the goals of the analysis.
Mixed models are an efficient vehicle to separate between-cluster and within-cluster
variation that is essential in the analysis of clustered data. In a completely randomized design
(CRD) with subsampling,
]345 . 73 /34 .345 ,
where /34 denotes the experimental error aII b associated with the 4th replication of the 3th
treatment and .345 is the observational (subsampling) error among the 5 subsamples from an
experimental unit, between-cluster variation of units treated alike is captured by Varc/34 d and
within-cluster variation by Varc.345 d. To gauge whether treatments are effective the mean
square due to treatments should be compared to the magnitude of variation among experimen-
tal units treated alike, Q W aII b, not the observational error mean square. Analysis based on
a model with a single error term, ]34 . 73 %345 , assuming the %345 are independent,
would be inappropriate and could lead to erroneous conclusions about treatment performance.
We demonstrate this effect with the application in §7.6.3.
Not recognizing variability on the cluster and within-cluster level is dangerous from
another point of view. In the CRD with subsampling the experimental errors /34 are uncorrela-
ted as are the observational errors .345 owing to the random assignment of treatments to
experimental units and the random selection of samples from the units. Furthermore, the /34
and .345 are not correlated with each other. Does that imply that the responses ]34 are not
correlated? Some basic covariance operations provide the answer,
Covc]345 ß ]345w d Covc/34 .345 ß /34 .345w d Covc/34 ß /34 d Varc/34 d Á !.
While observations from different experimental units are uncorrelated, observations from the
same cluster are correlated. Ignoring the hierarchical structure of the two error terms by put-
ting /34 .345 %345 and assuming the %345 are independent, ignores correlations among the
experimental outcomes. Since Covc]345 ß ]345w d Varc/34 d !, the observations are positively
correlated. If correlations are ignored, :-values will be too small (§2.5.2), even if the %345 re-
presented variability of experimental units treated alike, which it does not. In longitudinal and
repeated measures data, correlations enter the data more directly. Since measurements are
collected repeatedly on subjects or units, these measurements are likely autocorrelated.
Besides correlations among the observations, clustered data provide another challenge for
the analyst who must decide whether the emphasis is cluster-specific or population-average
inference. In a longitudinal study, for example, a natural focus of investigation are trends over
Example 7.2. Gregoire, Schabenberger, and Barrett (1995) analyze data from a longitu-
dinal study of natural grown Douglas fir (Pseudotsuga menziesii (Mirb.) Franco) stands
(plots) scattered throughout the western Cascades and coastal range of Washington and
Oregon in northwestern United States. Plots were visited repeatedly between 1970 and
1982 for a minimum of ' and a maximum of "! times. Measurement intervals for a
given plot varied between " and $ years. The works of Gregoire (1985, 1987) discuss
the data more fully.
45
40
35
Stand height (m)
30
25
20
15
10
10 15 20 25 30 35 40 45 50 55 60
Stand age (years)
Figure 7.3. Height-age profiles of ten Douglas fir (Pseudotsuga menziesii) stands. Data
kindly provided by Dr. Timothy G. Gregoire, School of Forestry and Environmental
Studies, Yale University. Used with permission.
Figure 7.3 shows the height of the socially dominant trees vs. the stand age for ten of
the '& stands. Each stand depicted represents a cluster of observations. The stands
differed in age at the onset of the study. The height development over the range of years
during which observations were taken is almost linear for the ten stands. However, the
slopes of the trends differ, as well as the maximum heights achieved. This could be due
to increased natural variability in height development with age or to differences in
micro-site conditions. Growing sites may be more homogeneous among younger than
among older stands, a feature often found in man-made forests. Figure 7.4 shows the
sample means for each stand (cross-hairs) and the population-averaged trend derived
from a simple linear regression model
where L34 is the 4th height for the 3th stand. To predict the height of a stand at a given
age not in the data set, the population-average trend would be used. However, to predict
the height of a stand for which data was collected, a more precise and accurate predic-
tion should be possible if information about the stand's trend relative to the population
trend is utilized. Focusing on the population average only, residuals are measured as
deviations between observed values and the population trend. Cluster-specific predic-
tions utilize smaller deviations between observed values and the specific trend.
45
40
35
Stand height (m)
30
Population Average
25
20
15
10
10 15 20 25 30 35 40 45 50 55 60
Stand age (years)
Figure 7.4. Population-average trend and stand- (cluster-) specific means for Douglas
fir data. The dotted line represents population-average prediction of heights at a given
age.
fits a separate slope ""3 for each stand. A total of eleven fixed effects have to be estimated:
ten slopes and one intercept. If the intercepts are also varied by stands, the fixed effects
regression model
L34 "!3 ""3 +1/34 /34
45
40
Stand height at initial visit (m)
35
30
25
20
15
10
10 15 20 25 30 35 40 45 50 55 60
Stand age (years)
Figure 7.5. Observed stand heights and ages at initial visit and linear trend.
Early approaches to modeling of longitudinal and repeated measures data employed this
philosophy: to separate the data into as many subsets as there are clusters and fit a model
separately to each cluster. Once the individual estimates "s !" ß âß "
s !ß"! and "
s "" ß âß "
s "ß"! were
obtained, the population-average estimates were calculated as some weighted average of the
cluster-specific intercepts and slopes. This two-step approach is inefficient as it ignores infor-
mation contributed by other clusters in the estimation process and leads to parameter prolif-
eration. While the data points from cluster $ do not contribute information about the slope in
cluster ", the information in cluster $ nevertheless can contribute to the estimation of the var-
iance of observations about the cluster means. As the complexity of the individual trends and
the cluster-to-cluster heterogeneity increases, the approach becomes impractical. If the num-
ber of observations per cluster is small, it may turn out to be actually impossible. If, for
example, the individual trends are quadratic, and only two observations were collected on a
particular cluster, the model cannot be fit to that cluster's data. If the mean function is non-
linear, fitting separate regression models to each cluster is plagued with numerical problems
as nonlinear models require fairly large amounts of information to produce stable parameter
estimates. It behooves us to develop an approach to data analysis that allows cluster-specific
and population-average inference simultaneously without parameter proliferation. This is
achieved by allowing effects in the model to vary at random rather than treating them as
fixed. These ideas are cast in the Laird-Ware model.
• The Laird-Ware model is a two-stage model for clustered data where the
first stage describes the cluster-specific response and the second stage
captures cluster-to-cluster heterogeneity by randomly varying parameters
of the first-stage model.
• Most linear mixed models can be cast as Laird-Ware models, even if the
two-stage concept may not seem natural at first, e.g., split-plot designs.
7.2.1 Rationale
Although mixed model procedures were developed by Henderson (1950, 1963, 1973), Gold-
berger (1962), and Harville (1974, 1976a, 1976b) prior to the seminal article of Laird and
Ware (1982), it was the Laird and Ware contribution that showed the wide applicability of
linear mixed models and provided a convenient framework for parameter estimation and in-
ference. Although their discussion focused on longitudinal data, the applicability of the Laird-
Ware model to other clustered data structures is easily recognized. The basic idea is as fol-
lows. The probability distribution for the measurements within a cluster has the same general
form for all clusters, but some or all of the parameters defining this distribution vary ran-
domly across clusters. First, define Y3 to be the a83 "b vector of observations for the 3th
cluster. In a designed experiment, Y3 represents the data vector collected from a single experi-
mental unit. Typically, in this case, additional subscripts will be needed to identify replica-
tions, treatments, whole-plots, sub-plots, etc. For the time being, it is assumed without loss of
generality that the single subscript 3 identifies an individual cluster. For the leftmost
(youngest) douglas fir stand in Figure 7.3, for example, "! observations were collected at
ages "%, "&, "', "(, "), "*, ##, #$, #%, #&. The measured heights for this sixth stand in the
data set are assembled in the response vector
Ô ]'" × Ô ""Þ'! ×
Ö ]'# Ù Ö "#Þ&! Ù
Ö Ù Ö Ù
Ö ]'$ Ù Ö "$Þ%" Ù
Ö Ù Ö Ù
Ö ]'% Ù Ö "$Þ&* Ù
Ö Ù Ö Ù
Ö ]'& Ù Ö "&Þ"* Ù
Y' Ö ÙÖ Ù.
Ö ]'' Ù Ö "&Þ)# Ù
Ö Ù Ö Ù
Ö ]'( Ù Ö ")Þ#$ Ù
Ö ]') Ù Ö "*Þ&( Ù
Ù Ö
Ö Ù
Ö ]'* Ù Ö #!Þ%$ Ù
Õ ]'ß"! Ø Õ #!Þ') Ø
The Laird-Ware model assumes that the average behavior of the clusters is the same for
all clusters, varied only by cluster-specific explanatory variables. In matrix notation this is
where X3 is an a83 :b design or regressor matrix and " is a a: "b vector of regression
coefficients. Observe that clusters share the same parameter vector ", but clusters can have
different values of the regressor variables. If X3 contains a column of measurement times, for
example, these do not have to be the same across clusters. The Laird-Ware model easily
accommodates unequal spacing of measurements. Also, the number of cluster elements, 83 ,
can vary from cluster to cluster. Clusters with the same set of regressors X3 do not elicit the
same response as suggested by the common parameter " .
To allow clusters to vary in the effect of the explanatory variables on the outcome we can
put EcY3 l b3 d X3 a" b3 b X3 " X3 b3 . The b3 in this expression determine how much
the 3th cluster population-average response X3 " must be adjusted to capture the cluster-
specific behavior X3 " X3 b3 . In a practical application not all of the explanatory variables
have effects that vary among clusters and we can add generality to the model by putting
EcY3 l b3 d X3 " Z3 b3 where Z3 is an a83 5 b design or regressor matrix. In this formula-
tion not all the columns in X3 are repeated in Z3 and on occasion one may place explanatory
variables in Z3 that are not part of X3 (although this is much less frequent than the opposite
case where the columns of Z3 are a subset of the columns of X3 ).
The expectation was reckoned conditionally, because b3 is a vector of random variables.
We assume that b3 has mean 0 and variance-covariance matrix D. Laird and Ware (1982)
term this a two-stage model. The first stage specifies the conditional distribution of Y3 , given
the b3 , as
Y3 lb3 µ KaX3 " Z3 b3 ß R3 b. [7.6]
In the second stage it is assumed that the b3 have a Gaussian distribution with mean 0 and
variance matrix D, b3 µ Ka0ß Db. The random effects b3 are furthermore assumed indepen-
dent of the errors e3 . The (marginal) distribution of the responses then is also Gaussian:
Y3 µ KaX3 "ß R3 Z3 DZw3 b. [7.7]
Model [7.8] is a classical mixed linear model. It contains a fixed effect mean structure given
by X3 " and a random structure given by Z3 b3 e3 . The b3 are sometimes called the random
effects if Z3 is a design matrix consisting of !'s and "'s, and random coefficients if Z3 is a
regressor matrix. We will refer to b3 simply as the random effects.
The extent to which clusters vary about the population-average response is expressed by
the variability of the b3 . If D 0, the model reduces to a fixed effects regression or classifica-
Taking expectations over the distribution of the random effects, one arrives at
EcY3 d EcEcY3 lb3 dd EcX3 " Z3 b3 d X3 "
[7.10]
VarcY3 d R3 Z3 DZw3 V3 .
The marginal variance follows from the standard result by which unconditional variances can
be derived from conditional expectations:
Varc] d EcVarc] l\ dd VarcEc] l\ dd.
Applying this to the mixed model [7.8] under the assumption that Covcb3 ß e3 d 0 leads to
EcVarcY3 lb3 dd VarcEcY3 lb3 dd R3 Z3 DZw3 .
In contrast to [7.9], [7.10] expresses the marginal or population-average mean and variance of
cluster 3.
The Laird-Ware model [7.8] is quite general. If the design matrix for the random effects
is absent, Z3 0, the Laird-Ware model reduces to the classical linear regression model.
Similarly, if the random effects do not vary, i.e.,
Varcb3 d D 0,
all random effects must be exactly b3 ´ 0 since Ecb3 d 0 and the model reduces to a linear
regression model
Y3 X3 " e3 Þ
If the fixed effects coefficient vector " is zero, the model becomes a random effects model
Y3 Z3 b3 e3 .
To motivate the latter consider the following experiment.
Example 7.3. Twenty laboratories are randomly selected from a list of laboratories pro-
vided by the Association of Official Seed Analysts (AOSA). Each laboratory receives %
bags of "!! seeds each, selected at random from a large lot of soybean seeds. The
laboratories perform germination tests on the seeds, separately for each of the bags and
report the results back to the experimenter. A statistical model to describe the variability
of germination test results must accommodate laboratory-to-laboratory differences and
inhomogeneities in the seed lot. The results from two different laboratories may differ
even if they perform exactly the same germination tests with the same precision and
where ]34 is the germination percentage reported by the 3th laboratory for the 4th "!!
seed sample it received. . is the overall germination percentage of the seedlot. !3 is a
random variable with mean ! and variance 5!# measuring the lab-specific deviation
from the overall germination percentage. /34 is a random variable with mean ! and
variance 5 # measuring intralaboratory variability due to the four samples within a
laboratory.
Since apart from the grand mean . all terms in the model are random, this is a random
effects model. In terms of the components of the Laird-Ware model we can define a
cluster to consist of the four samples sent to a laboratory and let Y3 c]3" ß âß ]3% dw .
Then our model for the 3th laboratory is
where ]34 denotes the yield per plant of variety 3 grown at density B34 . The parameters !, " ,
and ) were initially assumed to vary among the varieties in a deterministic manner, i.e., were
fixed. We could also cast this model in the mixed model framework. Since we are concerned
with linear models in this chapter, we concentrate on the inverse plant yield ]34" and its rela-
tionship to plant density as for the data shown in Table 5.15 and Figure 5.34 (p. 288). It
seems reasonable to assume that ] " is linearly related to density for any of the three
varieties. The general model is
]34" !3 "3 B34 /34 .
In this formulation ! and " are the population parameters and ,"3 , ,#3 measure the degree to
which the population-averaged intercept a!b and slope a" b must be modified to accommodate
the 3th variety's response. These are the cluster effects. The second stage constitutes the
assumption that ,"3 and ,#3 are randomly drawn from a universe of possible values for the
intercept and slope adjustment. In other words, it is assumed that ,"3 and ,#3 are random
variables with mean zero and variances 5"# and 5## , respectively. Assume 5## ! for the
moment. A random variable whose variance is zero is a constant that takes on its mean value,
which in this case is zero. If 5## ! the model reduces to
]34" ! ,"3 " B34 /34 ,
stating that varieties differ in the relationship between inverse plant yield and plant density
only in their intercept, not their slope. This is a model with parallel trends among varieties.
Imagine there are $! varieties a3 "ß âß $!b. The test of slope equality if the "3 are fixed
effects is based on the hypothesis L! : "" "# â "$! , a twenty-nine degree of freedom
hypothesis. In the mixed model setup the test of slope equality involves only a single param-
eter, L! : 5## !. Even in this nonlongitudinal setting, the two-stage concept is immensely
appealing if we view varietal differences as random disturbances about a conceptual average
variety.
We have identified a population-average model for relating inverse plant yield to density,
E]34" ! " B34 ,
and how to modify the population average with random effects to achieve a cluster-specific
(= variety-specific) model parsimoniously. For five hypothetical varieties Figure 7.6 shows
the flexibility of the mixed model formulation for model [7.11] under the following
assumptions:
• 5"# 5## !. This is a purely fixed effects model where all varieties share the same
dependency on plant density (Figure 7.6 a).
• 5## !. Varieties vary in intercept (Figure 7.6b).
• 5"# Á !, 5## Á !. Varieties differ in slope and intercept (Figure 7.6 d).
20
20
0
0
0 10 20 30 0 10 20 30
Plant Density Plant Density
c) d)
60 3 parameters: α, β, σ22 60 4 parameters: α, β, σ21, σ22
Inverse Plant Yield
40 40
20
20
0
0
0 10 20 30 0 10 20 30
Plant Density Plant Density
Figure 7.6. Fixed and mixed model trends for five hypothetical varieties. Purely fixed effects
model (a), randomly varying intercepts (b), randomly varying slopes (c), randomly varying
intercepts and slopes (d). The population-averaged trend is shown as a dashed line in panels
(b) to (d). The same differentiation in cluster-specific effects as in (d) with a purely fixed
effects model would have required "! parameters. Number of parameters cited excludes
Varc/34 d 5 # .
If both intercept and slope vary at random among varieties, we have Z3 X3 . If only the
intercepts vary, Z3 is the first column of X3 . If only the slopes vary at random among
where 34 a4 "ß #b are the whole-plot block effects, 73 a3 "ß #b are the whole-plot treatment
effects, /34 are the whole-plot experimental errors, !5 a5 "ß #b are the sub-plot treatment
effects, a7!b35 are the interactions, and /345 denotes the sub-plot experimental errors. Using
matrices and vectors the model can be expressed as follows:
Ô . ×
Ö 3" Ù
Ô ]""" × Ô " " ! " ! " ! " ! ! ! ×Ö Ù " ! ! !× Ô /""" ×
Ö 3# Ù Ô
Ö ]""# Ù Ö " " ! " ! ! " ! " ! ! ÙÖ Ù Ö" ! ! !Ù Ö /""# Ù
Ö Ù Ö ÙÖ 7" Ù Ö Ù Ö Ù
Ö ]"#" Ù Ö " ! " " ! " ! " ! ! ! ÙÖ Ù Ö! " ! ! ÙÔ /"" × Ö /"#" Ù
Ö Ù Ö ÙÖ 7# Ù Ö Ù Ö Ù
Ö ]"## Ù Ö " ! " " ! ! " ! " ! ! ÙÖ Ù Ö! " ! ! ÙÖ /"# Ù Ö /"## Ù
Ö ÙÖ ÙÖ !" Ù Ö ÙÖ Ù Ö Ù,
Ö ]#"" Ù Ö " " ! ! " " ! ! ! " ! ÙÖ Ù Ö! ! " ! Ù /#" Ö/ Ù
Ö Ù Ö ÙÖ !# Ù Ö ÙÕ Ø Ö #"" Ù
Ö ]#"# Ù Ö " " ! ! " ! " ! ! ! " ÙÖ Ù Ö! ! " ! Ù /## Ö /#"# Ù
Ö Ù Ö ÙÖ a7 !b"" Ù Ö Ù Ö Ù
]##" " ! " ! " " ! ! ! " ! Ö Ù ! ! ! " /##"
Õ ]### Ø Õ " Ö a7 !b"# Ù Õ
! " ! " ! " ! ! ! " ØÖ Ù ! ! ! "Ø Õ /### Ø
a7 !b#"
Õ a7 !b Ø
##
or
Y X" Zb e.
This is a mixed model with four clusters of size two. The horizontal lines delineate observa-
tions that belong to the same cluster (whole-plot). In the notation of the Laird-Ware model we
identify for the first whole-plot, for example,
Ô Z" 0 0 0 ×
Ö 0 Z# 0 0 Ù
ZÖ Ù.
0 0 Z$ 0
Õ 0 0 0 Z% Ø
This seems like a very tedious exercise. Fortunately, computer software such as proc mixed
of The SAS® System handles the formulation of the X and Z matrices. What the user needs to
know is which effects of the model are fixed (part of X), and which effects are random (part
of Z).
In the previous examples we focused on casting the models in the mixed model frame-
work by specifying X3 , Z3 , and b3 . Little attention was paid to the variance-covariance
matrices D and R3 . In split-plot and subsampling designs these matrices are determined by the
randomization and sampling protocol. In repeated measures and longitudinal studies the
modeler must decide whether random effects/coefficients in b3 are independent (D diagonal)
or not and must decide on the structure of R3 . With 83 observations per cluster and if all
observations within a cluster are correlated and have unequal variances there are 83 variances
and 83 a83 "bÎ# covariances to be estimated in R3 . To reduce the number of parameters in D
and R3 these matrices are usually parameterized and highly structured. In §7.5 we examine
popular parsimonious parametric structures. The next example shows how starting from a
simple model accommodating the complexities of a real study leads to a mixed model where
the modeler makes subsequent adjustments to the fixed and random parts of the model,
including the D and R matrices, always with an eye toward parsimony of the final model.
Example 7.4. Soil nitrate levels and their dependence on the presence or absence of
mulch shoot are investigated on bare soils and under alfalfa management. The treatment
structure of the experiment is a # # factorial of factors cover (alfalfa/none (bare soil))
and mulch (shoots applied/shoots not applied). Treatments are arranged in three
complete blocks each accommodating four plots. Each plot receives one of the four
possible treatment combinations. There is a priori evidence that the two factors do not
interact. The basic linear statistical model for this experiment is given by
]345 . 33 !4 "5 /345 ,
where 3 "ß ÞÞÞß $ indexes the blocks, !4 are the effects of shoot application, "5 the
effects of cover type, and the experimental errors /345 are independent and identically
distributed random variables with mean ! and variance 5 # . The variance of an individ-
ual observation is Varc]345 d 5 # .
In order to reduce the costs of the study soil samples are collected on each plot at four
randomly chosen locations. The variability of an observation ]3456 , where 6 "ß ÞÞÞß %
The revised model must accommodate the two sources of random variation across plots
and within plots. This is accomplished by adding another random effect
]3456 . 33 !4 "5 /345 03456 ,
where 03456 µ !ß 5:# . This is a mixed model with fixed part . 33 !4 "5 and
random part /345 03456 . It is reasonable to assume by virtue of randomization that the
two random effects are independent and also that
Covc03456 ß 03456w d !.
There are no correlations of the measurements within a plot. The D matrix of the model
in Laird-Ware form will be diagonal.
It is imperative to the investigators to study changes in nitrate levels over time. To this
end soil samples at the four randomly chosen locations within a plot are collected in
five successive weeks. The data now have an additional repeated measurement structure
in addition to a subsampling structure. First, the fixed effects part must be modified to
accommodate systematic changes in nitrate levels over time. Treating time as a
continuous variable, coded as the number of days > since the initial measurement, the
fixed effects part can be revised as
Ec]34567 d . 33 !4 "5 # >37 ,
where >37 is the time point at which all plots in block 3 were measured. If the measure-
ment times differ across plots, the variable > would receive subscript 345 instead. The
random effects structure is now modified to (a) incorporate the variability of
measurements at the same spatial location over time; (b) account for residual temporal
autocorrelation among the repeated measurements.
A third random component, 134567 µ Ð!ß 5># Ñ, is added so that the model becomes
]34567 . 33 !4 "5 # >37 /345 03456 134567
With five measurements over time there are "! unique correlations per sampling loca-
tion: CorrÒ]3456" ß ]3456# Ó, CorrÒ]3456" ß ]3456$ Ó,â, CorrÒ]3456% ß ]3456& Ó. Furthermore, it is rea-
sonable that measurements should be more highly correlated the closer together they
were taken in time. Choosing a correlation model that depends explicitly on the time of
measurement can be accomplished with only a single parameter. The temporal corre-
lation model chosen is
Corrc134567 ß 134567w d expe $ l>37 >37w lf,
If in the context of repeated measures data each soil sample location within a plot is
considered a cluster, 5># describes the within-cluster heterogeneity and 5 # 5:# the
between-cluster heterogeneity.
Ô . 3$ !# "# ×
Ô" " ! " ! >"" ×
Ö 3" 3$ Ù
Ö" " ! " ! >"# Ù Ö Ù
Ö Ù Ö 3# 3$ Ù
X""#$ Ö" " ! " ! >"$ Ùß " Ö Ù
Ö Ù Ö !" !# Ù
" " ! " ! >"% Ö Ù
Õ" " "
" ! " ! >"& Ø Õ
" #
Ø
#
When appealing to the two-stage concept one assumes that some effects or coefficients of the
population-averaged model vary randomly from cluster to cluster. This requires in theory that
there is a population or universe of coefficients from which the realizations in the data can be
drawn (Longford 1993). In the onion plant density example, it is assumed that intercepts
and/or slopes in the universe of varieties vary at random around the average value of ! and/or
". Conceptually, this does not cause much difficulty if the varieties were selected at random
and stochastic variation between clusters can be reasoned. In many cases the clusters are not
selected at random and the question whether an effect is fixed or random is not clear.
Imagine, for example, that the same plant density experiment is performed at various loca-
tions. If locations were predetermined, rather than randomly selected, can we still attribute
differences in variety performance from location to location to stochastic effects, or are these
fixed effects? Some modelers would argue that location effects are deterministic, fixed effects
because upon repetition of the experiment the same locations would be selected and the same
locational effects should operate on the outcome. Others consider locations as surrogates of
different environments and consider environmental effects to be stochastic in nature.
Repetition of the experiment even at the same locations will produce different outcomes due
to changes in the environmental conditions and locational effects should thus be treated as
random. A similar discrepancy of opinion applies to the nature of seasonal effects. Are the
effects of years considered fixed or random? The years in which an experiment is conducted
are most likely not a random sample from a list of possible years. Experiments are conducted
when experimental areas can be secured, funds, machinery, and manpower are available.
According to the acid-test that declares factors as fixed if their levels were pre-determined,
The cited test that determines factors as random if their levels are selected by a random
mechanism falls under the first criterion. The sampling mechanism itself makes the effect ran-
dom (Kempthorne 1975). Searle (1971, p. 383) subscribes to the second criterion, that of con-
fining inferences to the levels at hand. If one is interested in conclusions about varietal per-
formance for the specific years and locations in a multiyear, multilocation variety trial,
location and year effects would be fixed. If conclusions are to be drawn about the population
of locations at which the experiment could have been conducted in particular years, location
effects would be random and seasonal effects would be fixed. Finally, if inferences are to per-
tain to all locations in any season, then both factors would be random. We agree with the
notion implied by Searle's (and Eisenhart's second) criterion that it very much depends on the
context whether an effect is considered random or not. Robinson (1991) concludes similarly
when he states that “The choice of whether a class of effects is to [be] treated as fixed or ran-
dom may vary with the question which we are trying to answer.” His criterion, replacing both
(i) and (ii) above, is to ask whether the effects in question come from a probability distribu-
tion. If they do, they are random, otherwise they are fixed. Robinson's criterion does not
appeal to any sample or inference model and is thus attractive. It is noteworthy that Searle et
al. (1992, p. 16) placed more emphasis on the random sampling mechanism than Searle
(1971, p. 383). The latter reference reads
“In considering these points the important question is that of inference: are inferences going to be
drawn from these data about just these levels of the factor? "Yes" then the effects are considered
as fixed effects. "No" then, presumably, inferences will be made not just about the levels
occurring in the data but about some population of levels of the factor from which those in the data
are presumed to have come; and so the effects are considered as being random.”
We emphasize that for the purpose of analysis it is often reasonable to consider effects as
random for some questions, and as fixed for others within the same investigation. Assume
locations were selected at random from a list of possible locations in a variety trial, so that
there is no doubt that they are random effects. One question of interest is which variety is
highest yielding across all possible locations. Another question may be whether varieties A
and B show significant yield differences at the particular locations used. Under Eisenhart's
second criterion one should treat location effects as random for the first analysis and as fixed
for the second analysis, which would upset Eisenhart's first criterion. Fortunately, mixed
models provide a way out of this dilemma. Within the same analysis we can choose with
respect to the random effects different inference spaces, depending on the question at hand
(see §7.3). Even if an effect is random conclusions can be drawn pertaining only to the factor
levels actually used and the effects actually observed (Figure 7.7).
Effect on Effect on
outcome outcome
stochastic deterministic
Random Fixed
Other arguments have been brought to bear to solve the fixed vs. random debate more or
less successfully. We want to dispense with two of these. The fact that the experimenter does
not know with certainty how a particular treatment will perform at a given location does not
imply a random location effect. This argument would necessarily lead to all effects being con-
sidered random since prior to the experiment none of the effects is known with certainty.
Another line of argument considers those effects random that are not under the experimenter's
control, such as block and environmental effects. Under this premise the only fixed effects
model is that of a completely randomized design and all treatment factors would be fixed.
These criteria are neither practical nor sensible. Considering block and other experimental
effects (apart from treatment effects) as random even if their selection was deterministic pro-
vided their effect on the outcome has a stochastic nature, yields a reasonable middle-ground
in our opinion. Treatment factors are obviously random only when the treatments are chosen
by some random mechanism, for example, when entries are selected at random for a variety
trial from a larger list of possible entries. If treatment levels are predetermined, treatment
effects are fixed. The interested reader can find a wonderful discourse of these and other
issues related to analysis of variance in general and the fixed/random debate in Kempthorne
(1975).
where the !3 a3 "ß âß +b are random environmental effects and 74 a4 "ß âß >b are the
fixed treatment effects, e.g., entries (genotypes) in a variety trial. There are 5 replications of
each environment entry combination and a!7 b34 represents genotype environment inter-
action. We observe that because the !3 are random variables, the interaction is also a random
quantity. If all effects in [7.12] were fixed, inferences about entry performance would apply
to the particular environments (locations) that are selected in the study, but not to other
environments that could have been chosen. This inference space is termed the narrow space
by McLean, Sanders, and Stroup (1991). The narrow inference space can also be chosen in
the mixed effects model to evaluate and compare the entries. If genotype performance is of
interest in the particular environments in which the experiment was performed and for the
particular genotype environment interaction, the narrow inference space applies. On other
occasions one might be interested in conclusions about the entries that pertain to the universe
of all possible environments that could have been chosen for the study. Entry performance is
then evaluated relative to potential environmental effects and random genotype environ-
ment interactions. McLean et al. (1991) term this the broad inference space and conclude
that it is the appropriate reference for inference if environmental effects are hard to specify.
The broad inference space has no counterpart in fixed effects models.
A third inference space, situated between the broad and narrow spaces has been termed
the intermediate space by McLean et al. (1991). Here, one appeals to specific levels of some
random effects, but to the universe of all possible levels with respect to other random effects.
In model [7.12] an intermediate inference space applies if one is interested in genotype per-
formance in specific environments but allows the genotype environment interaction to
vary at random from environment to environment. For purposes of inferences, one would fix
!3 and allow a!7 b34 to vary. When treatment effects 74 are fixed, the interaction is a random
effect since the environmental effects !3 are random. It is our opinion that the intermediate in-
ference space is not meaningful in this particular model. When focusing on a particular
environmental effect the interaction should be fixed at the appropriate level too. If, however,
the treatment effects were random, too, the intermediate inference space where one focuses
on the performance of all genotypes in a particular environment is meaningful.
In terms of testable hypotheses or "estimable" functions in the mixed model we are con-
cerned with linear combinations of the model terms. To demonstrate the distinction between
the three inference spaces, we take into account the presence of b, not just ", in specifying
these linear combinations. An "estimable" function is now written as
"
A" Mb L ,
b
where L cAß Md. Since estimation of parameters should be distinguished from prediction
of random variables, we use quotation marks. Setting M to 0, the function becomes A" , an
estimable function because A" is a matrix of constants. No reference is made to specific ran-
dom effects and thus the inference is broad. By selecting the entries of M such that Mb repre-
sents averages over the appropriate random effects, the narrow inference space is chosen and
one should refer to A" Mb as a predictable function. An intermediate inference space is
constructed by averaging some random effects, while setting the coefficients of M pertaining
to other random effects to zero. We illustrate these concepts with an example from Milliken
and Johnson (1992, p. 285).
Example 7.8. Machine Productivity. Six employees are randomly selected from the
work force of a company that has plans to replace the machines in one of its factories.
Three candidate machine types are evaluated. Each employee operates each of the
machines in a randomized order. Milliken and Johnson (1992, p. 286) chose the mixed
model
]345 . !3 74 a!7 b34 /345 ,
where 74 represents the fixed effect of machine type 4 a4 "ß âß $b, !3 the random
effect of employee 3 a3 "ß âß 'à !3 µ KÐ!ß 5!# bÑ, a!7 b34 the machine employee
#
interaction aa random effect with mean ! and variance 5!7 b, and /345 represents experi-
mental errors associated with employee 3 operating machine 4 at the 5 th time. The out-
come ]345 was a productivity score. The data for this experiment appear in Table 23.1
of Milliken and Johnson (1992) and are reproduced on the CD-ROM.
If we want to estimate the mean of machine type ", for example, we can do this in a
broad, narrow, or intermediate inference space. The corresponding expected values are
Broad: Ec]3"5 d . 7"
" ' " '
Narrow: Ec]3"5 l!3 ß a!7 b3" d . 7" "!3 "a!7 b3"
' 3" ' 3"
" '
Intermediate: Ec]3"5 l!3 d . 7" "!3
' 3"
from the mixed model analysis. In §7.4 the necessary details of this estimation and
prediction process are provided. For now we take for granted that the estimates and
BLUPs can be obtained in The SAS® System with the statements
The /s option on the model statement prints the estimates of all fixed effects, the /s
option on the random statement prints the estimated BLUPs for the random effects. The
latter are reproduced from SAS® output in Table 7.1.
The solutions for the random effects sum to zero across the employees. Hence, when
the solutions are substituted into the formulas above for the narrow and intermediate
means and averaged, we have, for example,
" '
s s7 " "!
. s3 .
s s7 " .
' 3"
The broad, narrow and intermediate estimates of the means will not differ. Provided
that the D matrix is nonsingular, this will hold in general for linear mixed models. Our
prediction of the average production score for machine " does not depend on whether
we refer to the six employees actually used in the experiment or the population of all
company employees from which the six were randomly selected. So wherein lies the
difference? Although the point estimates do not differ, the variability of the estimates
will differ greatly. Since the mean estimate in the intermediate inference space,
" '
s s7 " "!
. s3 ,
' 3"
involves the random quantities !s3 , its variance will exceed that of the mean estimate in
the narrow inference space, which is just . s s7 " . By the same token the variance of
estimates in the broad inference space will exceed that of the estimates in the interme-
diate space. By appealing to the population of all employees, the additional uncertainty
that stems from the random selection of employees must be accounted for. The pre-
cision of broad and narrow inference will be identical if the random effects variance 5!#
is !, that is, there is no heterogeneity among employees with respect to productivity
scores.
Estimates and predictions of various various quantities obtained with proc mixed of
The SAS® System are shown in the next table from which the impact of choosing the
inference space on estimator/predictor precision can be inferred.
When appealing to the narrow or intermediate inference spaces coefficients for the ran-
dom effects that are being held fixed are added after the | in the estimate statements. If
no random effects coefficients are specified, the M matrix in the linear combination
A" Mb is set to zero and the inference space is broad. Notice that least squares
means calculated with the lsmeans statement of the mixed procedure are always broad.
The abridged output follows.
Output 7.1.
The Mixed Procedure
Model Information
Data Set WORK.PRODUCTIVITY
Dependent Variable score
Covariance Structure Variance Components
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Containment
Estimates
Standard
Label Estimate Error DF t Value Pr >|t|
(1) Mach. 1 Mean (Broad) 52.3556 2.4858 10 21.06 <.0001
(2) Mach. 1 Mean (Interm) 52.3556 1.5394 10 34.01 <.0001
(3) Mach. 1 Mean (Narrow) 52.3556 0.2266 10 231.00 <.0001
(4) Mach. 2 Mean (Broad) 60.3222 2.4858 10 24.27 <.0001
(5) Mach. 2 Mean (Interm) 60.3222 1.5394 10 39.19 <.0001
(6) Mach. 2 Mean (Narrow) 60.3222 0.2266 10 266.15 <.0001
(7) Mac. 1 vs. Mac. 2 (Broad) -7.9667 2.1770 10 -3.66 0.0044
(8) Mac. 1 vs. Mac. 2 (Narrow)-7.9667 0.3205 10 -24.86 <.0001
(9) Person 1 BLUP 60.9064 0.3200 10 190.32 <.0001
(10) Person 2 BLUP 57.9951 0.3200 10 181.22 <.0001
(11) Person 1 - Person 2 2.9113 0.4524 36 6.43 <.0001
The table of Covariance Parameter Estimates displays the estimates of the variance
components of the model; 5s #! ##Þ)&), 5
s #!7 "$Þ*!*, 5
s # !Þ*#&. Because the data
are balanced and proc mixed estimates variance-covariance parameters by restricted
maximum likelihood (by default), these estimates coincide with the method-of-moment
estimates derived from expected mean squares and reported in Milliken and Johnson
(1992, p. 286).
It is seen from the table of Estimates that the means in the broad, intermediate, and
narrow inference spaces are identical; for example, &#Þ$&&' is the estimate for the mean
production score of machines of type ", regardless of inference space. The standard
errors of the three estimates are largest in the broad inference space and smallest in the
narrow inference space. The same holds for differences of the means. Notice that if one
would analyze the data as a fixed effects model, the estimates for (1) through (8)
would be identical. Their standard errors would be incorrect, however. The next to the
last two estimates are predictions of random effects ((9) and (10)) and the prediction
of the difference of two random effects (11). If one would incorrectly specify the
model as a fixed effects model, the estimates and their standard errors would be
incorrect for (9) - (11).
contains a fair number of unknown quantities that must be calculated from data. Parameters
of the model are " , D, and R3 ; these must be estimated. The random effects b3 are not param-
eters, but random variables. These must be predicted in order to calculate cluster-specific
trends and perform cluster-specific inferences. If "s is an estimator of " and s
b3 is a predictor
of b3 , the population-averaged prediction of Y3 is
s
s3 X 3 "
Y
and the cluster-specific prediction is calculated as
Y s Z3 s
s3 X 3 " b3 .
Henderson (1950) derived estimating equations for " and b3 known as the mixed model equa-
tions. We derive the equations in §A7.7.1 and their solutions in §A7.7.2. Briefly, the mixed
model equations are
"
s
" Xw R" X Xw R" Z Xw R" y
s w " w
" " w " , [7.13]
b ZR X ZR ZB ZR y
where the vectors and matrices of the individual clusters were properly stacked and arranged
to eliminate the subscript 3 and B is a block-diagonal matrix whose diagonal blocks consist of
the matrix D (see §A7.7.1 for details). The solutions are
s Xw V" X" Xw V" y
" [7.14]
s s ).
b BZw V" Ðy X" [7.15]
The estimate "s is a generalized least squares estimate. Furthermore, the predictor s
b is the best
linear unbiased predictor (BLUP) of the random effects b (§A7.7.2.). Properties of these ex-
pressions are easily established. For example,
"
s ",
E" s Xw V" X ,
Var" Es
b 0 .
Since b is a random variable, more important than evaluating VarÒs bÓ is the variance of the
prediction error s
b b, which can be derived after some tedious calculations (Harville 1976a,
Laird and Ware 1982) as
"
Vars
b b B BZw V" ZB BZw V" XXw V" X Xw V" ZB. [7.16]
The problem is solved by first profiling " out of the equation. To this end " in [7.18] is re-
"
placed with aXw V())" Xb Xw V() )" y and the resulting expression is minimized with respect
to ). If derivatives of the profiled log-likelihood with respect to one element of ) depend on
other elements of ), the process is iterative. On occasion, some or all covariance parameters
can be estimated in noniterative fashion. For example if R 5 # R , B 5 # B , with R and
B known, then
" w
s aZB Zw R b" y X"
s .
s#
5 y X"
X
Upon convergence of the algorithm, the final iterate s
)Q is the maximum likelihood estimate
of the covariance parameters, and
"
s Q Xw VÐs
" )Q Ñ" X Xw VÐs
) Q Ñ" Y [7.19]
is the maximum likelihood estimator of "s . Since maximum likelihood estimators (MLE) have
certain optimality properties; for example, they are asymptotically the most efficient estima-
tors, substituting the MLE of ) in the generalized least squares estimate for " has much
appeal. The predictor for the random effects is calculated as
bQ BÐs
s )Q ÑZw VÐs s Q .
) Q Ñ" y X" [7.20]
s and 5
and the values " s # that maximize ¿a" ß 5 # lyb necessarily minimize
8
"
#6" ß 5 # à y °" ß 5 # à y "lne#1f ln5 # aC3 " b# .
3"
5#
Setting derivatives with respect to " and 5 # to zero leads to two equations
" 8
óÀ ` °" ß 5 # à yÎ` " "aC3 " b ´ !
5 # 3"
8
8 "
ôÀ ` °" ß 5 # à yÎ` 5 # #
" % aC3 " b# ´ !.
5 3"
5
Solving ó yields " s 8" !8 C3 C . Substituting for " in ô and solving yields
3"
8 #
s Q 8" !3" aC3 C b . The estimate of " is the familiar sample mean, but the estimate of
5 #
the variance parameter is not the sample variance =# a8 "b" !83" aC3 C b# . Since the
sample variance is an unbiased estimator of 5 # under random sampling from any distribution,
s #Q has bias
we see that 5
8" # 8" # "
s #Q 5 # E
E5 W 5# 5 5# 5# .
8 8 8
If " were known there would be only one estimating equation (ô) and the MLE for 5 # would
be
" 8 " 8
s#
5 "a]3 " b# "a]3 Ec]3 db# ,
8 3" 8 3"
Ô ]" ] ×
Ö ] ] Ù
Ua8""b Ö
Ö
# Ù
Ù [7.21]
ã
Õ ]8" ] Ø
Applying Theorem 8.3.4 in Graybill a1969, p. 190b, the inverse of this matrix turns out to
have a surprisingly simple form,
Ô# " â "×
" " Ö" # â "Ù "
VarcUd #Ö Ù # aI8" J8" b.
5 ã ä ã 5
Õ" " â #Ø
Also, lPl "Î8 and some algebra shows that Uw P" U !83" Ð]3 ] Ñ# , the residual sum of
squares. The likelihood for U is called the restricted likelihood of Y because U is restricted to
have mean 0. It can be written as
l5 # Pl½ " w "
¿ 5 # à u exp u P u
a #1 b a8"bÎ# #5 #
and is no longer a function of the mean " . Minus twice the restricted log likelihood becomes
#65 # à u °5 # à u a8 "blne#1f lne8f
8
a C3 C b # [7.22]
a8 "bln5 # " .
3"
5#
Setting the derivative of °a5 # à ub with respect to 5 # to zero one obtains the estimating equa-
tion that implies the residual maximum likelihood estimate:
` °a5 # à ub a8 "b " 8
´ ! Í "aC3 C b#
` 5# 5# 5 % 3"
8
"
s #V
5 "aC3 C b# .
8 " 3"
The REML estimator for 5 # is the sample variance and hence unbiased.
The choice of U in [7.21] corresponds to a particular matrix K such that U KY. We
can express U formally as KY where
K c I8" 0a8""b d c J8" 1a8""b d.
If EcYd X" , K needs to be chosen such that KY contains no term in " . This is equivalent
to removing the mean and considering residuals. The alternative name of residual maximum
likelihood derives from this notion. Fortunately, as long as K is chosen to be of full row rank
and KX 0, the REML estimates do not depend on the particular choice of error contrasts.
In the simple constant mean model ]3 " /3 we could define an orthogonal contrast matrix
Ô" " ! â ! ! ×
Ö" " # â ! ! Ù
Ca8"8b Ö Ù
ã ã
Õ" " " â " a8 #b Ø
½
and a diagonal matrix DÐ8"8"Ñ DiagÖa3 3# b ×. Letting K DC and U KY, then
EcU d 0, VarcU d 5 # DCCw D I8" , u w u !83" aC3 C b# and minus twice the log
likelihood of U is
" 8
°5 # ;u a8 "blne#1f a8 "bln5 # "aC3 Cb# . [7.23]
5 # 3"
Apart from the constant lne8f this expression is identical to [7.22] and minimization of either
function will lead to the same REML estimator of 5 # . For more details on REML estimation
see Harville (1974) and Searle et al. (1992, Ch. 6.6). Two generic methods for constructing
the K matrix are described in §A7.7.3.
REML estimates of variance components and covariance parameters have less bias than
maximum likelihood estimates and in certain situations (e.g., certain balanced designs) are
unbiased. In a balanced completely randomized design with subsampling, fixed treatment
effects and 8 subsamples per experimental unit, for example, it is well-known that the
observational error mean square and experimental error mean square have expectations
EcQ W aSI bd 59#
EcQ W aII bd 59# 85/# ,
where 59# and 5/# denote the observational and experimental error variances, respectively. The
ANOVA method of estimation (Searle et al. 1992, Ch. 4.4) equates mean squares to their
expectations and solves for the variance components. From the above equations we derive the
estimators
s #! Q W aSI b
5
"
s #/ eQ W aII b Q W aSI bf.
5
8
These estimators are unbiased by construction and identical to the REML estimators in this
case (for an application see §7.6.3). A closer look at 5 s #/ shows that this quantity could
possibly be negative. Likelihood estimators must be values in the parameter space. Since
s /# ! is considered only a solution to the likelihood estimation problem, but
5/# !, a value 5
not a likelihood estimate. Unfortunately, to retain unbiasedness, one has to allow for the
possibility of a negative value. One should choose maxÖ5 s #/ ß !× as the REML estimator in-
stead. While this introduces some bias, it is the appropriate course of action. Corbeil and
Searle (1976) derive solutions for the ML and REML estimates for four standard classifica-
tion models when data are balanced and examine the properties of the solutions. They call the
solutions "ML estimators" or "REML estimators," acknowledging that ignoring the positivity
requirement does not produce true likelihood estimators. Lee and Kapadia (1984) examine the
bias and variance of ML and REML estimators for one of Corbeil and Searle's models for
which the REML solutions are unbiased. This is the balanced two-way mixed model without
interaction,
]34 . !3 "4 /34 ß a3 "ß âß +à 4 "ß âß , b.
Here, !3 could correspond to the effects of a fixed treatment factor with + levels and "4 to the
random effects of a random factor with , levels, "4 µ KÐ!ß 5,# Ñ. Observe that there is only a
single observation per combination of the two factors, that is, the design is nonreplicated. The
experimental errors are assumed independent Gaussian with mean ! and variance 5 # . Table
7.3, adapted from Lee and Kapadia (1984), shows the bias, variance, and mean square error
of the maximum likelihood and restricted maximum likelihood estimators of 5 # and 5,# for
+ 'ß , "!. ML estimators have the smaller variability throughout, but show non-negli-
gible negative bias, especially if the variability of the random effect is small relative to the
error variability. In terms of the mean square error (Variance + Bias# ), REML estimators of
5 # are superior to ML estimators but the reverse is true for estimates of 5,# . Provided 5,#
accounts for at least &!% of the response variability, REML estimators are essentially un-
biased, since then the probability of obtaining a negative solution for 5,# tends quickly to zero.
Returning to mixed models of the Laird-Ware form, it must be noted that the likelihood
for KY in REML estimation does not contain any information about the fixed effects ".
REML estimation will produce estimates for ) only. Once these estimates have been obtained
we again put the substitution principle to work. If s
)V is the REML estimate of ) , the fixed
effects are estimated as
"
s V Xw VÐs
" )V Ñ" X Xw VÐs
)V Ñ" y, [7.24]
bV BÐs
s )V ÑZw VÐs s V .
) V Ñ" y X" [7.25]
Because the elements of ) were estimated, [7Þ24] is no longer a generalized least squares
(GLS) estimate. Because the substituted estimate s
)V is not a maximum likelihood estimate,
[7Þ24] is also not a maximum likelihood estimate. Instead, it is termed an Estimated GLS
(EGLS) estimate.
Table 7.3. Bias (F ), variance (Z +<), and mean square error aQ WI b of ML and REML
estimates in a balanced, two-way mixed linear model without replication†
(fixed factor A has ', random factor B has "! levels)
5#
ML REML
Varc"4 dÎVarc]34 d F Z +< Q WI F Z +< Q WI
!Þ" !Þ!** !Þ!#( !Þ!$( !Þ!"! !Þ!$$ !Þ!$%
!Þ$ !Þ!(" !Þ!"( !Þ!## !Þ!"" !Þ!#" !Þ!##
!Þ& !Þ!&! !Þ!!* !Þ!"" !Þ!!! !Þ!"" !Þ!""
!Þ( !Þ!$! !Þ!!$ !Þ!!% !Þ!!! !Þ!!% !Þ!!%
!Þ* !Þ!"! !Þ!!! !Þ!!! !Þ!!! !Þ!!! !Þ!!!
5,#
ML REML
Varc"4 dÎVarc]34 d F Z +< Q WI F Z +< Q WI
!Þ" !Þ!!" !Þ!"! !Þ!"! !Þ!"! !Þ!"# !Þ!"#
!Þ$ !Þ!#* !Þ!$" !Þ!$# !Þ!!" !Þ!$* !Þ!$*
!Þ& !Þ!&! !Þ!'# !Þ!'% !Þ!!! !Þ!(' !Þ!('
!Þ( !Þ!(! !Þ"!" !Þ"!' !Þ!!! !Þ"!' !Þ"#&
!Þ* !Þ!*! !Þ"&" !Þ"&* !Þ!!! !Þ")( !Þ")(
†
Adapted from Table 1 in Lee and Kapadia (1984). With permission of the International Biometric
Society.
Because REML estimation is based on the likelihood principle and REML estimators
have lower bias than maximum likelihood estimators, we prefer REML for parameter estima-
tion in mixed models over maximum likelihood estimation and note that it is the default
method of the mixed procedure in The SAS® System.
8 " 8
"
s"
s IKPW "Xw3 V
"
w s " w s" w s"
3 X3 "X3 V3 Y3 X V X X V Y. [7.27]
3" 3"
The ML [7.19], REML [7.24], and EGLS [7.27] estimators of the fixed effects are of the
same general form, and they differ only in how V is estimated. EGLS is appealing when V
can be estimated quickly, preferably with a noniterative method. Vonesh and Chinchilli
(1997, Ch. 8.2.4) argue that in applications with a sufficient number of observations and
when interest lies primarily in ", little efficiency is lost. Two basic noniterative methods are
outlined in §A7.7.4 for the case where within cluster observations are uncorrelated and
homoscedastic, that is, R3 5 # I. The first method estimates D and 5 # by the method of
moments and predicts the random effects with the usual formulas such as [7.15] substituting
s for V. The second method estimates the random effects b3 by regression methods first and
V
s from the s
calculates an estimate D b3 .
Of concern is a testable hypothesis of the same form as in the fixed effects linear model,
namely L! : A" d. For cases (i) and (ii) we develop in §A7.7.5 that for Gaussian random
effects b3 and within-cluster errors e3 exact tests exist. Briefly, in case (i) the statistic
" "
s dÑw AXw V" X Aw ÐA"
[ ÐA " s dÑ [7.28]
is distributed under the null hypothesis as a Chi-squared variable with <aAb degrees of
freedom, where <aAb denotes the rank of the matrix A. Similarly, in case (ii) we have
" "
s dÑw AXw V" X Aw ÐA"
[ ÐA " s dÑÎ5 # µ ;# .
<aAb
then
" "
s dÑw AaXw V" Xb Aw ÐA"
ÐA " s dÑ
J9,= [7.30]
s#
< aA b 5
s #4
"
J9,= #
,
s4
ese"
is distributed as a > random variable with X <aXb degrees of freedom. A "!!a" !b%
The problematic case is (iii), where the marginal variance-covariance matrix is unknown
and more than just a scalar constant must be estimated from the data. The proposal is to
replace V with VÐs
)Ñ in [7.28] and 5s # VÐs
) Ñ in [7.30] and to use as test statistics
" "
s dÑw AXw VÐs
[ ÐA " s dÑ
)Ñ" X Aw ÐA" [7.32]
all units are measured at the same time intervals. The linear mixed model for this experiment
is
]345 . 73 /34 >5 a7 >b35 .345 , [7.34]
where 73 measures the effect of treatment 3 and /34 are independent experimental errors
associated with replication 4 of treatment 3. The terms >5 and a7 >b35 denote time effects and
treatment time interactions. Finally, .345 are random disturbances among the serial
measurements from an experimental unit. It is assumed that these disturbances are serially
correlated according to
l>5 >5w l
Corrc.345 ß .345w d exp . [7.35]
9
This is known as the exponential correlation model (§7.5). The parameter 9 determines the
strength of the correlation of two disturbances l>5 >5w l time units apart. Observations from
different experimental units were assumed to be uncorrelated in keeping with the random
assignment of treatments to experimental units. We simulated the experiments for various
values of 9 (Table 7.4).
The treatment time cell mean structure is shown in Figure 7.8. The treatment means
were chosen so that there was no marginal time effect and no marginal treatment effect. Also,
there was no difference of the treatments at times #, %, or &. In each of the "ß !!! realizations
of the experiment we tested the hypotheses
ó L! : no treatment main effect
ô L! : no time main effect
õ L! : no treatment effect at time #
ö L! : no treatment effect at time %.
These null hypotheses are true and at the &% significance level the nominal Type-I error rate
of the tests should be !Þ!&. An appropriate test procedure will be close to this nominal rate
when average rejection rates are calculated across the "ß !!! repetitions.
The following tests were performed:
• The exact Chi-square test based on [7.28] where the true values of the covariance para-
meters were used. These values were chosen as Varc/34 d 5/# ", Varc.345 d
5.# ", and 9 according to Table 7.4;
• The asymptotic Chi-square test based on [7.32] where the restricted maximum likeli-
hood estimates of 5/# , 5.# , and 9 were substituted;
• The asymptotic J test based on [7Þ33] where the restricted maximum likelihood
estimates of 5/# , 5.# , and 9 were substituted;
• The J test based on Kenward and Roger (1997) employing a bias correction in the esti-
s Ó coupled with a degree of freedom adjusted J test.
mation of VarÒ"
2 Tx 2
Tx 3, Tx 4
0
-1
-2 Tx 1
1 2 3 4 5
Time t
Figure 7.8. Treatment time cell means in repeated measures simulation based on model
[7.34].
The proc mixed statements that produce these tests are as follows.
/* Analysis with correct covariance parameter estimates */
/* Exact Chi-square test [7.28] */
proc mixed data=sim noprofile;
class rep tx t;
model y = tx t tx*t / Chisq ;
random rep(tx);
repeated /subject=rep(tx) type=sp(exp)(time);
/* First parameter is Var[rep(tx)] */
/* Second parameter is Var[e] */
/* Last parameter is range, passed here as a macro variable */
parms (1) (1) (&phi) / hold=1,2,3;
by repetition;
run;
/* Kenward-Roger F Tests */
proc mixed data=sim;
class rep tx t;
model y = tx t tx*t / ddfm=KenwardRoger;
random rep(tx);
repeated /subject=rep(tx) type=sp(exp)(time);
by repetition;
run;
Table 7.5. Simulated Type-I error rates for exact and asymptotic Chi-square and J tests and
Kenward-Roger adjusted J -test (KR-J denotes J9,= in Kenward-Roger test and
KR-.0 the denominator degrees of freedom for KR-J )
Exact ;# Asymp. ;# Asymp. J
9 L! [7.28] [7.32] [7Þ33] KR- J KR-.0
"Î$ ó !Þ!&" !Þ"$! !Þ!&" !Þ!&& )Þ"
ô !Þ!%* !Þ!*# !Þ!'( !Þ!&' #%Þ*
õ !Þ!&" !Þ"!$ !Þ!(( !Þ!&" "#Þ%
ö !Þ!&' !Þ"!" !Þ!)& !Þ!&& "#Þ%
" ó !Þ!&& !Þ"#' !Þ!%) !Þ!&" )Þ$
ô !Þ!%* !Þ!*" !Þ!'( !Þ!&% #&Þ'
õ !Þ!'! !Þ!)& !Þ!'* !Þ!&& ")Þ"
ö !Þ!&& !Þ!)" !Þ!'( !Þ!&$ ")Þ"
# ó !Þ!&% !Þ"!' !Þ!%' !Þ!&& )Þ'
ô !Þ!%* !Þ!*" !Þ!'( !Þ!&' #'Þ#
õ !Þ!&& !Þ!(& !Þ!&* !Þ!&% ##Þ$
ö !Þ!%* !Þ!($ !Þ!&) !Þ!&& ##Þ$
$ ó !Þ!&$ !Þ"!( !Þ!$* !Þ!%* )Þ)
ô !Þ!%* !Þ!*# !Þ!(! !Þ!'! #'Þ'
õ !Þ!%* !Þ!(& !Þ!&$ !Þ!%) #%Þ&
ö !Þ!%) !Þ!'* !Þ!&" !Þ!%) #%Þ&
% ó !Þ!&' !Þ"!& !Þ!$) !Þ!&! *Þ!
ô !Þ!%* !Þ!*! !Þ!(! !Þ!'" #(Þ"
õ !Þ!&$ !Þ!') !Þ!&# !Þ!%* #&Þ)
ö !Þ!%# !Þ!'' !Þ!%( !Þ!%% #&Þ)
The results are displayed in Table 7.5. The exact Chi-square test maintains the nominal
Type-I error rate, as it should. The fluctuations around !Þ!& are due to simulation variability.
Increasing the number of repetitions will decrease this variability. When REML estimators
are substituted for ), the asymptotic Chi-square test performs rather poorly. The Type-I errors
are substantially inflated; in many cases they are more than doubled. The asymptotic J test
performs better, but the Type-I errors are typically somewhat inflated. Notice that with
increasing strength of the serial correlation (increasing 9) the inflation is less severe for the
asymptotic test, as was also noted by Kenward and Roger (1997). The bias and degree of
freedom adjusted Kenward-Roger test performs extremely well. The actual Type-I errors are
very close to the nominal error rate of !Þ!&. Even with sample sizes as small as '! one should
consider this test as a suitable procedure if exact tests of the linear hypothesis do not exist.
The tests discussed so far are based on the exact (V()) known) or asymptotic (VÐ)Ñ
s . Tests of any model parameters, " or ),
unknown) distribution of the fixed effect estimates "
can also be conducted based on the likelihood ratio principle, provided the hypothesis being
tested is a simple restriction on a"ß )b so that the restricted model is nested within the full
sQ ß s µ µ
model. If Ð" )Q Ñ are the maximum likelihood estimate in the full model and Ð"Q ß )Q Ñ are
the maximum likelihood estimates under the restricted model, the likelihood-ratio test statistic
sQ ß s µ µ
A 6" )Q à y 6"Q ß )Q à y [7.36]
has an asymptotic ;# distribution with degrees of freedom equal to the number of restrictions
imposed. In REML estimation the hypothesis would be imposed on the covariance parameters
only (more on this below) and the (residual) likelihood ratio statistic is
µ
A 6s
)V à y 6 )V à y.
Likelihood ratio tests are our test of choice to test hypotheses about the covariance
parameters (provided the two models are nested). For example, consider again model [7.34]
with equally spaced repeated measurements. The correlation of the random variables .345 can
also be expressed as
Corrc.345 ß .345w d 3l>5 >5w l , [7.37]
where 3 expe "Î9f in [7.35]. This is known as the first-order autoregressive correlation
model (see §7.5.2 for details about the genesis of this model). For a single replicate the
correlation matrix now becomes
Ô" 3 3# 3$ 3% ×
Ö3 " 3 3# 3$ Ù
Ö Ù
Corrcd34 d Ö 3# 3 " 3 3# Ù.
Ö $ Ù
3 3# 3 " 3
Õ 3% 3$ 3# 3 "Ø
A test of L! :3 ! would address the question of whether the within-cluster disturbances (the
repeated measures errors) are independent. In this case the mixed effects structure would be
identical to that from a standard split-plot design where the temporal measurements comprise
the sub-plot treatments. To test L! :3 ! with the likelihood ratio test the model is fit first
with the autoregressive structure and then with an independence structure a3 !b. The
difference in their likelihood ratios is then calculated and compared to the cutoffs from a Chi-
squared distribution with one degree of freedom. We illustrate the process with one of the
repetitions from the simulation experiment. The full model is fit with the mixed procedure of
The SAS® System:
proc mixed data=sim;
class rep tx t;
model y = tx t tx*t;
random rep(tx);
repeated /subject=rep(tx) type=ar(1);
run;
The repeated statement indicates that the combinations of replication and treatment
variable, which identifies the experimental unit, are the clusters which are considered
independent. All observations that share the same replication and treatment values are con-
sidered the within-cluster observations and are correlated according to the autoregressive
model (type=ar(1), Output 7.2). Notice the table of Covariance Parameter Estimates
s #/ "Þ#(', 5
which reports the estimates of the covariance parameters 5 s #. !Þ!'&), and
3 !Þ#$&$. The table of Fit Statistics shows the residual log likelihood 6Ðs
s ) V à yÑ
#*Þ$ as Res Log Likelihood.
Output 7.2.
The Mixed Procedure
Model Information
Dimensions
Covariance Parameters 3
Columns in X 30
Columns in Z 12
Subjects 1
Max Obs Per Subject 60
Observations Used 60
Observations Not Used 0
Total Observations 60
Fit Statistics
Res Log Likelihood -29.3
Akaike's Information Criterion -32.3
Schwarz's Bayesian Criterion -33.0
-2 Res Log Likelihood 58.6
Observe that the reduced model has only two covariance parameters, 5/# and 5.# . Its
µ
residual log likelihood is 6Ð)V à yÑ #*Þ'&&*. Since one parameter was removed, the
likelihood ratio test compares
A #a!#*Þ$ a #*Þ'&bb !Þ(
against a ;#" distribution. The :-value for this test is PrÐ;#" !Þ(Ñ !Þ%!#) and L! :3 !
cannot be rejected. The :-value for this likelihood-ratio test can be conveniently computed
with The SAS® System:
Model Information
Data Set WORK.SIM
Dependent Variable y
Covariance Structure Variance Components
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Containment
Dimensions
Covariance Parameters 2
Columns in X 30
Columns in Z 12
Subjects 1
Max Obs Per Subject 60
Observations Used 60
Observations Not Used 0
Total Observations 60
Fit Statistics
Res Log Likelihood -29.6
Akaike's Information Criterion -31.6
Schwarz's Bayesian Criterion -32.1
-2 Res Log Likelihood 59.3
In the preceding test we used the restricted maximum likelihood objective function for
testing L! :3 !. This is justified since the hypothesis was about a covariance parameter. As
discussed in §7.4.1, the restricted likelihood contains information about the covariance
parameters only, not about the fixed effects " . To test hypotheses about " based on the likeli-
hood ratio principle one should fit the model by maximum likelihood. With the mixed
procedure this is accomplished by adding the method=ml option to the proc mixed statement,
e.g.,
proc mixed data=sim method=ml;
class rep tx t;
model y = tx t tx*t;
random rep(tx);
repeated /subject=rep(tx) type=ar(1);
run;
At times the full and reduced models are not nested, that is, the restricted model cannot
be obtained from the full model by simply constraining or setting to zero some of its param-
eters. For example, to compare whether the correlations between the repeated measurements
follow the exponential model
l>5 >5w l
Corrc.345 ß .345w d exp
9
$
$ l>5 >5w l " l>5 >5w l
Corrc.345 ß .345w d " M al>5 >5w l 9b,
# 9 # 9
one cannot nest one model within the other. A different test procedure is needed. The method
commonly used relies on comparing overall goodness-of-fit statistics of the competing
models (Bozdogan 1987, Wolfinger 1993a). The most important ones are Akaike's informa-
tion criterion (AIC, Akaike 1974) and Schwarz' criterion (Schwarz 1978). Both are functions
of the (restricted) log likelihood with penalty terms added for the number of covariance
parameters. In some releases of proc mixed a smaller value of the AIC or Schwarz' criterion
indicates a better fit. Notice that there are other versions of these two criteria where larger
values indicate a better fit. One cannot associate degrees of significance or :-values with
these measures. They are interpreted in a greater/smaller is better sense only. For the two
models fitted above AIC is –$#Þ$ for the model with autoregressive error terms and –$"Þ' for
the model with independent error terms. The AIC criterion leads to the same conclusion as
the likelihood-ratio test. The independence model fits this particular set of data better. We
recommed to use the AIC or Schwarz criterion for models with the same fixed effects terms
but different, non-nested covariance structures.
In terms of matrices and vectors, Y3 X3 " Z3 b3 e3 , the model for the 3th cluster is
written as
Notice that the Z3 matrix is the first column of the X3 matrix (a random intercept). If the error
terms /34 are homoscedastic and uncorrelated and Covc,3 ß /34 d !, the marginal variance-
covariance matrix of the observations from the 3th cluster is then
Ô"× Ô" ! ! !×
Ö"Ù #Ö ! " â !Ù
VarcY3 d 5,# Z3 Zw3 #
5 I 5,# Ö Ùc " " â "d 5 Ö Ù
ã ã ã ä ã
Õ"Ø Õ! ! â "Ø
# #
Ô 5, 5 5,# â 5,# ×
#
Ö 5, 5,# 5# â 5,# Ù
Ö Ù 5,# J 5 # I.
ã ã ä ã
Õ 5# 5,# â 5,# 5 # Ø
,
the ratio between the variability among clusters and the variability of an observation. VarcY3 d
can then also be expressed as
Ô" 3 â 3×
5,# # Ö 3 " â 3Ù
VarcY3 d 5 Ö Ù. [7.40]
ã ã ä ã
Õ3 3 â "Ø
Ô" 3 â 3×
Ö3 " â 3Ù
R3 0 Ö Ù, [7.41]
ã ã ä ã
Õ3 3 â "Ø
then the model also has a compound-symmetric correlation structure. Such a model could
arise with clusters of size four if one draws blood samples from each leg of a heifer and 3
measures the correlation of serum concentration among the samples from a single animal. No
leg takes precedence over any other legs; they are exchangeable, with no particular order. In
this model the within-cluster variance-covariance matrix was targeted directly to capture the
correlations among the observations from the same cluster; we therefore call it direct
modeling of the correlations. Observe that if 0 5,# 5 # there is no difference in the
marginal variability between [7.40] and [7.41]. In §7.6.3 we examine a subsampling design
and show that modeling a random intercept according to [7.40] and modeling the
exchangeable structure [7.41] directly leads to the same inference.
There is also a mixture approach where some correlations are induced through random
effects and the variance-covariance matrix R3 of the within-cluster errors is also structured. In
the linear mixed model with random intercepts above ([7.38]) assume that the measurements
from a cluster are repeated observations collected over time. It is then reasonable to assume
that the measurements from a given cluster are serially correlated, i.e., R3 is not a diagonal
matrix. If it is furthermore sensible to posit that observations close together in time are more
highly correlated than those far apart, one may put
Notice that with 3 ! the within-cluster correlations approach zero with increasing tem-
poral separation and the marginal correlations approach 5,# ÎÐ5,# 5 # Ñ. Decaying correlations
with increasing separation are typically reasonable. Because mixed models induce correla-
tions through the random effects one can achieve marginal correlations that are functions of a
chosen metric quite simply. If correlations are to depend on time, for example, simply include
a time variable as a column of Z. The resulting marginal correlation structure may not be
meaningful, however. We illustrate with an example.
Example 7.5. The growth pattern of an experimental soybean variety is studied. Eight
plots are seeded and the average leaf weight per plot is assessed at weekly intervals fol-
lowing germination. If > measures time since seeding in days, the average growth is
where ]34 is the average leaf weight per plant on plot 3 measured at time >34 . The double
subscript for the time variable > allows measurement occasions to differ among plots.
Figure 7.9 shows data for the growth of soybeans on the eight plots simulated after
Figure 1.2 in Davidian and Giltinan (1995). In this experiment, a plot serves as a cluster
and the eight trends differ in their linear gradients (slopes). It is thus reasonable to add
random coefficients to "" . The mixed model becomes
]34 "! a"" ,"3 b>34 "# >#34 /34. [7.42]
35
30
25
Leaf weight per plant (g)
20
15
10
10 20 30 40 50 60 70 80
Days after seeding
Figure 7.9. Simulated soybean leaf weight profiles fitted to data from eight experi-
mental plots.
If the within-cluster errors are independent, reasonable when the leaf weight is obtained
from a random sample of plants from plot 3 at time >34 , the model in matrix formulation
is Y3 X3 " Z3 b3 e3 , Varce3 d 5 # I, with quantities defined as follows:
The covariance structure depends on the time variable which is certainly a meaningful
metric for the correlations. Whether the correlations are suitable functions of that metric
can be argued. Between any two leaf weight measurements on the same plot we have
5,# >34 >34w
Corrc]34 ß ]34w d Þ
É5,# >#34 5 # 5,# >#34w 5 #
For illustration let 5 # 5,# " and the repeated measurements be coded >3" ",
>3# #, >3$ $, and so forth. Then Corr[]3" ß ]3# ] #ÎÈ"! !Þ'$, Corrc]3" ß ]3$ d
$ÎÈ#! !Þ'(, and Corrc]3" ß ]3% d %ÎÈ$% !Þ'). The correlations are not decaying
with temporal separation, they increase. Also, two time points equally spaced apart do
not have the same correlation. For example, Corrc]3# ß ]3$ d 'ÎÈ&! !Þ)& which
exceeds the correlation between time points " and #. If one would code time as
>3" !ß >3# "ß >3$ #, â, the correlations of any observations with the first time
point would be uniformly zero.
If correlations are subject to modeling, they should be modeled directly on the within-
cluster level, i.e., through the R3 matrix. Interpretation of the correlation pattern should then
be confined to cluster-specific inference. Trying to pick up correlations by choosing columns
of the Z3 matrix that are functions of the correlation metameter will not necessarily lead to a
meaningful marginal correlation model, or one that can be interpreted with ease.
where ! is a vector of parameters determining the correlations among and the variances of
the elements of e3 . In previous notation we labeled ) the vector of covariance parameters in
VarcY3 d Z3 DZw3 R3 . Hence ! contains only those covariance parameters that are not con-
tained in Varcb3 d D. In many applications ! will be a two-element vector, containing one
parameter to model the within-cluster correlations and one parameter to model the within-
cluster variances.
We would be remiss not to mention approaches which account for serial correlations but
make no assumptions about the structure of Varce3 d. One is the multivariate repeated measures
approach (Cole and Grizzle 1966, Crowder and Hand 1990, Vonesh and Chinchilli 1997),
sometimes labeled multivariate analysis of variance (MANOVA). This approach is restrictive
if data are unbalanced or missing or covariates are varying with time. The mixed model
approach based on the Laird-Ware model is more general in that it allows clusters of unequal
sizes, unequal spacing of observation times or locations and missing observations. The
MANOVA approach essentially uses an unstructured variance-covariance matrix, which is
one of the structures open to investigation in the Laird-Ware model (Jennrich and Schluchter
1986).
An a8 8b covariance matrix contains 8a8 "bÎ# unique elements and if a cluster
contains ' measurements, say, up to '(Î# #" parameters need to be estimated in addition
to the fixed effects parameters and variances of any random effects. Imposing structure on the
variance-covariance matrix beyond an unstructured model requires far fewer parameters. In
the remainder of this section it is assumed for the sake of simplicity that a cluster contains
83 % elements and that measurements are collected in time. Some of these structures will
reemerge when we are concerned with spatial data in §9.
The simplest covariance structure is the independence structure,
Ô" ! ! !×
# #Ö ! " ! !Ù
Varce3 d 5 I 5 Ö Ù. [7.43]
! ! " !
Õ! ! ! "Ø
Apart from the scalar 5 # no additional parameters need to be estimated. The compound-
symmetric or exchangeable structure
Ô" 3 3 3×
Ö3 " 3 3Ù
R 3 a! b 5 # Ö Ùß ! 5 # ß 3 , [7.44]
3 3 " 3
Õ3 3 3 "Ø
It appears that only one random effect has been specified, the whole-plot experimental
error rep(A). Proc mixed will add a second, residual error term automatically, which corre-
sponds to the sub-plot experimental error. The default containment method of assigning
degrees of freedom will ensure that J tests are formulated correctly, that is, whole-plot
effects are tested against the whole-plot experimental error variance and sub-plot effects and
interactions are tested against the sub-plot experimental error variance. If comparisons of the
treatment means are desired, we recommend to add the ddfm=satterth option to the model
statement. This will invoke the Satterthwaite approximation where necessary, for example,
when comparing whole-plot treatments at the same level of the sub-plot factor (see §7.6.5 for
an example):
proc mixed data=yourdata;
class rep A B;
model y = rep A B A*B / ddfm=satterth;
random rep(A);
run;
As detailed in §7.5.3 this error structure will give rise to a marginal compound-
symmetric structure. The direct approach of modeling a split-plot design is by specifying the
compound-symmetric model through the repeated statement:
proc mixed data=yourdata;
class rep A B;
model y = A B A*B / ddfm=satterth;
repeated / subject=rep(A) type=cs;
run;
In longitudinal or repeated measures studies where observations are ordered along a time
scale, the compound symmetry structure is often not reasonable. Correlations for pairs of time
points are not the same and hence not exchangeable. The analysis of repeated measures data
as split-plot-type designs assuming compound symmetry is, however, very common in
practice. In section §7.5.3 we examine under which conditions this is an appropriate analysis.
The first, rather crude, modification is not to specify anything about the structure of the
correlation matrix and to estimate all unique elements of R3 a!b. This unstructured variance-
covariance matrix can be expressed as
#
Ô 5" 5"# 5"$ 5"% ×
Ö 5#" 5## 5#$ 5#% Ù
R 3 a! b Ö
Ö 5$"
Ù [7.45]
5$# 5$# 5$% Ù
Õ 5%" 5%# 5%$ 5%# Ø
Here, 534 543 is the covariance between observations 3 and 4 within a cluster. This is not a
parsimonious structure; there are 83 a83 "bÎ# parameters that need to be estimated. In
many repeated measures data sets where the sequence of temporal observations is small, in-
sufficient information is available to estimate all the correlations with satisfactory precision.
Furthermore, there is no guarantee that the correlations will decrease with temporal separa-
tion. If that is a reasonable stipulation, other models should be employed. The unstructured
model is fit by proc mixed with the type=un option of the repeated statement.
The large number of parameters in the unstructured parameterization can be reduced by
introducing constraints. For example, one may assume that all - -step correlations are
identical,
344w 34 if l4 4w l - .
This leads to banded, also called Toeplitz, structures if the diagonal elements are the same. A
Toeplitz matrix of order 5 has 5 " off-diagonals filled with the same element. A #-banded
Toeplitz parameterization is
Ô" 3" ! !×
Ö3 " 3" !Ù
R 3 a! b 5 # Ö " Ùß ! 5 # ß 3" , [7.46]
! 3" " 3"
Õ! ! 3" "Ø
Ô" 3" 3# !×
Ö3 " 3" 3# Ù
R 3 a! b 5 # Ö " Ùß ! 5 # ß 3" ß 3# . [7.47]
3# 3" " 3"
Õ! 3# 3" "Ø
A #-banded Toeplitz structure may be appropriate if, for example, measurements are taken at
weekly intervals, but correlations do not extend past a period of seven days. In a turfgrass
experiment where mowing clippings are collected weekly but a fast acting growth regulator is
applied every ten days, an argument can be made that correlations do not persist over more
than two measurement intervals.
Unstructured correlation models can also be banded by setting elements in off-diagonal
cells more than 5 " positions from the main diagonal to zero. The #-banded unstructured
parameterization is
#
Ô 5" 5"# ! ! ×
Ö 5#" 5## 5#$ ! Ù
R 3 a! b Ö
Ö !
Ù, ! 5 # ß âß 5 # ß 5"# ß 5#$ ß 5$% , [7.48]
5$# 5$# 5$% Ù " %
Õ ! ! 5%$ 5%# Ø
The "-banded unstructured matrix is appropriate for independent observations which differ in
Õ ! ! ! #Ø
5%
These structures are fit in proc mixed with the following options of the repeated statement:
type=Toep(2) /* 2-banded Toeplitz */
type=Toep(3) /* 3-banded Toeplitz */
type=un(2) /* 2-banded unstructured */
type=un(3) /* 3-banded unstructured */
type=un(1) /* 1-banded unstructured */.
We now turn to models where the correlations decrease with temporal separation of the
measurements. One of the more popular models is borrowed from the analysis of time series
data. Assume that a present observation at time >, ] a>b, is related to the immediately
preceding observation at time > " through the relationship
] a>b 3] a> "b /a>b. [7.51]
The /a>b's are uncorrelated, identically distributed random variables with mean ! and variance
5/# and 3 is the autoregressive parameter of the time-series model. In the vernacular of time
series analysis the /a>b are called the random innovations (random shocks) of the process.
This model is termed the first-order autoregressive (ARa"b) time series model since an out-
come ] a>b is regressed on the immediately preceding observation.
a) ρ = 0 b) ρ = 0.5
1 1
0 0
-1
-1
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
c) ρ = 1.0
-1
-2
0 10 20 30 40 50 60 70 80 90 100
Figure 7.10. Realizations of first-order autoregressive time series with 5/# !Þ#. a) white
noise process; b) 3 !Þ&; c) random walk.
Figure 7.10 shows three realizations of first-order autoregressive models with 3 !, !Þ&,
"Þ!, and mean !. As 3 increases series of positive and negative deviations from the mean
become longer, a sign of positive serial autocorrelation. To study the correlation structure
implied by this and other models we examine the covariance and correlation function of the
process (see §2.5.1 and §9). For a stationary ARa"b process we have " 3 " and the
function
G a5 b Covc] a>bß ] a> 5 bd
measures the covariance between two observations 5 time units apart. It is appropriately
called the covariance function. Note that G a!b Covc] a>bß ] a>bd Varc] a>bd is the var-
iance of an observation. Under stationarity, the covariances G a5 b depend on the temporal
separation only, not on the time origin. Time points spaced five units apart are correlated by
the same amount, whether the first time point was a Monday or a Wednesday. The
correlation function V a5 b is written simply as V a5 b G a5 bÎG a!b.
The covariance and correlation function of the AR(1) time series [7.51] are derived in
§A7.7.6. The elementary recursive relationship is G a5 b 3GÐ5 "Ñ 35 G a!b for 5 !.
Rearranging one obtains the correlation function as V a5 b G a5 bÎG a!b 35 . The auto-
regressive parameter thus measures the strength of the correlation of observations one time
unit apart; the lag-one correlation. For longer temporal separation the correlation is a power
of 3 where the exponent equals the number of temporal lags. Since the lags 5 are discrete, so
is the correlation function. For positive 3 the correlations step down every time 5 increases
(Figure 7.11), for negative 3 positive and negative correlations alternate, eventually con-
verging to zero. Negative correlations of adjoining observations are not the norm and are
usually indicative of an incorrect model for the mean function. Jones (1993, p. 54) cites as an
example a process with circadian rythm. If daily measurements are taken early in the morning
and late at night, it is likely that the daily rhythm affects the response and must be accounted
for in the mean function. Failure to do so may result in model errors with negative serial
correlation.
1.0
0.8
ρ = 0.7
Correlation C(k)/C(0)
0.6
0.4
ρ = 0.3
0.2
ρ = 0.1
0.0
0 1 2 3 4 5 6 7 8 9 10
Lag k
Figure 7.11. Correlation functions for first-order autoregressive processes with 3 !Þ"ß !Þ$ß
and !Þ(.
In terms of the four element cluster 3 and its variance-covariance matrix R3 a!b the ARa"b
process is depicted as
Ô" 3 3# 3$ ×
Ö3 " 3 3# Ù
R 3 a! b 5 # Ö # Ùß ! 5 # ß 3, [7.52]
3 3 " 3
Õ 3$ 3# 3 "Ø
where G a!b 5 # . The ARa"b model in longitudinal data analysis dates back to Potthoff and
Roy (1964) and is popular for several reasons. The model is parsimonious, the correlation
matrix is defined by a single parameter 3. The model is easy to fit to data. Numerical prob-
lems in iterative likelihood or restricted likelihood estimation of ! can often be reduced by
specifying an AR(1) correlation model rather than some of the more complicated models
below. Missing observations within a cluster are not a problem. For example, if it was
planned to take measurements at times "ß #ß $ß % but the third measurement was unavailable or
destroyed, the row and column associated with the third observation are simply deleted from
the correlation matrix:
Ô" 3 3$ ×
#
R3 a!b 5 3 " 3# Þ
Õ 3$ 3# "Ø
The AR(1) model is fit in proc mixed with the type=ar(1) option of the repeated statement.
The actual measurement times do not enter the correlation matrix in the AR(") process
with discrete lag, only information about whether a measurement occurred after or before
another measurement. It is sometimes labeled a discrete autoregressive process for this reason
and implicit is the assumption that the measurements are equally spaced. Sometimes there is
no basic interval at which observations are taken and observations are unequally spaced. In
this case the underlying metric of sampling within a cluster must be continuous. Note that
unequal spacing is not a sufficient condition to distinguish continuous from discrete proc-
esses. Even if the measurements are equally spaced, the underlying metric may still be
continuous. The leaf area of perennial flowers in a multiyear study collected at a few days
throughout the years at irregular intervals should not be viewed as discrete daily data with
most observations missing, but as unequally spaced observations collected in continuous time
with no observations missing. For unequally spaced time intervals the AR(1) model is not the
best choice. Assume measurements were gathered at days "ß #ß 'ß "". The AR(1) model
assumes that the correlation between the first and second measurement (spaced one day apart)
equals that between the second and third measurement (spaced four days apart). Although
there are two pairs of measurements with lag &, their correlations are not identical. The corre-
lation between the day " and day ' measurement is 3# , that between the day ' and the day ""
measurement is 3.
With unequally spaced data the actual measurement times should be taken into account in
the correlation model. Denote by >34 the 4th time at which cluster 3 was observed. We allow
these time points to vary from cluster to cluster. The continuous analog of the discrete AR(1)
process has correlation function
l>34 >34w l
G a5 bÎG a!b exp [7.53]
9
and is called the continuous AR(1) or the exponential correlation model (Diggle 1988, 1990;
Jones and Boadi-Boateng 1991; Jones 1993; Gregoire et al. 1995). The lag between two
observations is measured as the absolute difference of the measurement times >34 and
>34w ß 5 >34 >34w . Denoting G a!b 5 # yields the covariance function
l>34 >34w l
G a5 b 5 # exp . [7.54]
9
Ô " /l>3" >3# lÎ9 /l>3" >3$ lÎ9 /l>3" >3% lÎ9 ×
Ö / l>3# >3" lÎ9
" /l>3# >3$ lÎ9 /l>3# >3% lÎ9 Ù
R3 a!b 5 # Ö
Ö /l>3$ >3" lÎ9
Ù, ! 5 # ß 9. [7.55]
/l>3$ >3# lÎ9 " /l>3$ >3% lÎ9 Ù
Õ /l>3% >3" lÎ9 /l>3% >3# lÎ9 / l>3% >3$ lÎ9
" Ø
1.0
0.9
0.8
0.7
Correlation C(k)/C(0)
0.6
0.5
φ= 3
0.4
0.3 φ=2
φ = 1.5
0.2
φ=1
0.1
0.0
0 1 2 3 4 5 6
Lag = |tij - tij' |
The magnitude of 9 depends on the units in which time was measured. Since the tem-
poral lags and 9 appear in the exponent of the correlation matrix, numerical overflows or
underflows can occur depending on the temporal units used. For example, if time was
measured in seconds and the software package fails to report an estimate of 9, the iterative
estimation algorithms can be helped by rescaling the time variable into minutes or hours. In
proc mixed of The SAS® System this correlation model is fit with the type=sp(exp)(time)
option of the repeated statement. sp() denotes the family of spatial correlation models proc
mixed provides, sp(exp) is the exponential model. In the second set of parentheses are listed
the numeric variables in the data set that contain the coordinate information. In the spatial
context, one would list the longitude and latitude of the sample locations, e.g.,
type=sp(exp)(xcoord ycoord). In the temporal setting only one coordinate is needed, the
time of measurement.
A reparameterization of the exponential model is obtained by setting expe "Î9f 3:
G a5 bÎG a!b 3l>34 >34w l . [7.56]
SAS® calls this the power model. This terminology is unfortunate because the power model
in spatial data analysis is known as a different covariance structure (see §9.2.2). The specifi-
cation as type=sp(pow)(time) in SAS® as a spatial covariance structure suggests that
sp(pow) refers to the spatial power model. Instead, it refers to [7.56]. It is our experience that
numerical difficulties encountered when fitting the exponential model in the form [7.54] can
often be overcome by changing to the parameterization [7.56]. From [7.56] the close resem-
blance of the continuous and discrete AR(1) models is readily established. If observations are
equally spaced, that is
l>34 >34w l -l4 4 w l,
for some constant - , the continuous and discrete autoregressive correlation models produce
the same correlation function at lag 5 .
A second model for continuous, equally or unequally spaced observations, is called the
gaussian model (Figure 7.13). It differs from the exponential model only in the square of the
exponent,
Ð>34 >34w Ñ#
G a5 bÎG a!b exp . [7.57]
9#
The name must not imply that the gaussian correlation model deserves similar veneration
as the Gaussian probability model. Stein (1999, p. 25) points out that “Nothing could be far-
ther from the truth.” The practical range for the gaussian correlation model is 9È$. For the
same practical range as in the exponential model, correlations are more persistent over short
ranges, they decrease less rapidly (compare the model with 9 'ÎÈ$ in Figure 7.13 to that
with 9 # in Figure 7.12). Stochastic processes whose autocorrelation follows the gaussian
model are highly continuous and smooth (see §9.2.3). It is difficult to imagine physical proc-
esses of this kind. We use lowercase spelling when referring to the correlation model to avoid
confusion with the Gaussian distribution. From where does the model get its name? A sto-
chastic process with covariance function
G a5 b - exp " 5 # ,
" -
0 a=b exp =# Î%"
# È1"
which resembles in functional form the Gaussian probability mass function. The gaussian
model is fit in proc mixed with the type=sp(gau)(time) option of the repeated statement.
1.0
0.9
0.8
0.7
Correlation C(k)/C(0)
0.6
0.5 φ = 6*3-1/2
0.4 φ=2
0.3
φ = 1.5
0.2 φ=1
0.1
0.0
0 1 2 3 4 5 6 7
Lag = |tij - tij' |
Figure 7.13. Correlation functions in gaussian correlation model. The model with 9 'ÎÈ$
has the same practical range as the model with 9 # in Figure 7.12.
A final correlation model for data in continuous time with or without equal spacing is the
spherical model
Ú l>34 >34w l l>34 >34w l $
" $# "# l>34 >34w l 9
G a5 bÎG a!b Û 9 9 [7.58]
Ü! l>34 >34w l 9.
In contrast to the exponential and gaussian model the spherical structure has a true range. At
lag 9 the correlation is exactly zero and remains zero thereafter (Figure 7.14).
The spherical model is less smooth than the gaussian correlation model but more so than
the exponential model. The spherical model is probably the most popular model for
autocorrelated data in geostatistical applications (see §9.2.2). To Stein (1999, p. 52), this
popularity is a mystery that he attributes to the simple functional form and the “mistaken
belief that there is some statistical advantage in having the autocorrelation function being
exactly ! beyond some finite distance.” This correlation model is fit in proc mixed with the
type=sp(sph)(time) option of the repeated statement.
The exponential, gaussian, and spherical models for processes in continuous time as well
as the discrete AR(1) model assume stationarity of the variance of the within-cluster errors.
Models that allowed for heterogeneous variances as the unstructured models had many
parameters. A class of flexible correlation models which allow for nonstationarity of the
within-cluster variances, and changes in the correlations without parameter proliferation was
first conceived by Gabriel (1962) and is known as the ante-dependence models. Both
continuous and discrete versions of ante-dependence models exist, each in different orders.
1.0
0.9
0.8
0.7
Correlation C(k)/C(0)
0.6
0.5
φ= 9
0.4 φ=6
0.3
φ = 4.5
0.2
φ=3
0.1
0.0
0 1 2 3 4 5 6 7
Lag = |tij - tij' |
Figure 7.14. Spherical correlation models with ranges equal to the practical ranges for the
exponential models in Figure 7.12.
Following Kenward (1987) and Machiavelli and Arnold (1994), the discrete version of a
first order ante-dependence model can be expressed as
#
Ô 5" 5" 5# 3" 5" 5$ 3" 3# 5" 5% 3" 3# 3$ ×
Ö 5# 5" 3" 5## 5# 5$ 3# 5# 5% 3# 3$ Ù
R 3 a! b Ö
Ö 5$ 5" 3# 3"
Ù. [7.59]
5$ 5# 3# 5$# 5$ 5% 3$ Ù
Õ 5% 5" 3$ 3# 3" 5% 5# 3$ 3# 5% 5$ 3$ 5%# Ø
Zimmerman and Núñez-Antón (1997) termed it AD(1). For cluster size 83 the discrete AD(1)
model contains #83 " parameters, but offers nearly the flexibility of a completely
unstructured model with 83 a83 "bÎ# parameters. Zimmerman and Núñez-Antón (1997)
discuss extensions to accommodate continuous correlation processes in ante-dependence
models. Their continuous first-order model — termed structured ante-dependence model
(SAD(")) — is given by
Corrc/34 ß /34w d 30 a>34 ß-b0 >34w ß- ß 4w 4
[7.60]
Varc/34 d 5 # 1a>34 ß <bß 4 "ß ÞÞÞß 83 .
where 0 a>34 ß -b and 1a>34 ß <b are functions of the measurement times or locations which
depend on parameter vectors - and <. Zimmerman and Núñez-Antón advocate choosing 0 a•b
from the family of Box-Cox transformations (see §5.6.2)
>-34 "Î- -Á!
0 a>34 ß -b .
lne>34 f -!
If - " and 1a>34 ß <b "ß the power correlation model results. See Zimmerman and Núñez-
Antón (1997) for higher order ante-dependence models and Machiavelli and Arnold (1994)
for variable-order models. The SADab models alleviate some shortcomings of the stationary
continuous models. In growth studies, variability often increases with time. Heteroscedas-
ticity of the within-cluster residuals is already incorporated in ante-dependence models. Also,
equidistant observations do not necessarily have the same correlation as is implied by the
stationary models discussed above. The discrete first-order ante-dependence model can be fit
with proc mixed as type=ante(1).
a)
b)
Replicate I Replicate I
A1 B1 B3 B4 B2 A1 T1 T2 T3 T4
A3 B4 B2 B1 B3 A3 T1 T2 T3 T4
A2 B4 B3 B2 B1 A2 T1 T2 T3 T4
Figure 7.15. Replicate in split-plot design with three levels of the whole-plot and four levels
of the sub-plot treatment factor (a) and single block in repeated measures design with three
treatments and four re-measurements (b).
Both replicates have the same number of observations. For each whole-plot there are four
sub-plot treatments in the split-plot design and four remeasurements in Figure 7.15b. The
split-plot analysis proceeds by calculating the analysis of variance of the design and then
formulating appropriate test statistics to test for factor E main effects, F main effects, and
E F interactions followed by tests of treatment contrasts or other post-ANOVA pro-
cedures. The analysis of variance table for the split-plot design with < replicates is based on
the linear mixed model
]345 . 34 !3 /34 "5 a!" b35 /345 ,
where 34 a4 "ß âß <b are the whole-plot replication (block) effects, !3 a3 "ß âß +b are the
whole-plot treatment effects, /34 is the whole-plot experimental error, "5 a5 "ß âß , b are the
sub-plot treatment effects, a!" b35 are the interactions, and /345 denotes the sub-plot experi-
mental errors. Letting 5# denote the whole-plot experimental error variance and 5 # the sub-
plot experimental error variance, Table 7.6 shows the analysis of variance and expected mean
squares.
From the expected mean squares it is seen that the appropriate test statistics for main
effects and interactions tests are
First we observe that the sub-plot treatments are randomized to the whole-plots, but the
repeated measurements cannot be randomized. Time point X" occurs before X# which occurs
before X$ and so forth. Is this difference substantial enough to throw off the analysis, though?
To approach an answer it is worthwhile to study the correlation pattern that the split-plot de-
sign implies. Whole-plot errors are independent due to randomization of the whole-plot treat-
ments and so are sub-plot errors. Independent randomizations to whole- and sub-plots also
establish that Cov[/34 ß /345 Ó !. Since sub-plot errors are nested within whole-plots this is the
same setting as in the random intercept model of §7.5.1 and one arrives at a compound-
symmetric structure for the observations from the same whole-plot:
where 75 represents the time effects and a!7 b35 the treatment time interactions. A test of
L! : 7" ÞÞÞ 7> via a regular analysis of variance J test of form
Q WX 37/
J9,=
Q WIX 37/
is valid if the variance-covariance matrix of the "sub-plot" errors e34 c/34" ß ÞÞÞß /34> dw can be
expressed in the form
Varce34 d -I> # 1w> 1> # w ,
where # is a vector of parameters and - is a constantÞ Similarly, the J test for the whole-plot
w
factor L! :!" ÞÞÞ !+ is valid if the whole-plot errors e3 c/3"
ß ÞÞÞß /3+ d have a variance-
covariance matrix which can be expressed as
Varce3 d - I+ # 1w+ 1+ # w .
The variance-covariance matrix is then said to meet the Huynh-Feldt conditions. We note in
passing that a correct analysis via split-plot ANOVA requires that the condition is met for
every random term in the model (Milliken and Johnson 1992 , p. 325). Two special cases of
variance-covariance matrices that meet the Huynh-Feldt conditions are independence and
compound symmetry. The combination of - 5 # and # ! yields the independence struc-
w
ture 5 # I; the combination of - 5 # and # "# c5># ß ÞÞÞß 5># d yields a compound symmetric
structure 5 # I 5># J.
Analyzing repeated measures data with split-plot models implicitly assumes a compound
symmetric or Huynh-Feldt structure which may not be appropriate. If correlations decay over
time, for example, the compound symmetric model is not a reasonable correlation model.
Two different courses of action can then be taken. One relies on making adjustments to the
degrees of freedom for test statistics in the split-plot analysis, the other focuses on modeling
the variance-covariance structure of the within-cluster errors. We comment on the adjustment
method first.
Box (1954b) developed the measure % for the deviation from the Huynh-Feldt conditions.
% is bounded between a> "b" and ", where > is the number of repeated measurements. The
degrees of freedom for tests of Time main effects and Treatment Time interactions are
for the Time main effect is not compared against the critical value J!ß>"ß+a<"ba>"b but
against J!ß%a>"bß%+a<"ba>"b . Unfortunately, % is unknown. A conservative approach is to set %
equal to its lower bound. The critical value then would be J!ß"ß+a<"b for the Time main effect
test and J!ß+"ß+a<"b for the test of Treatment Time interactions. The reduction in test
power that results from the conservative adjustment is disconcerting. Less power is sacrificed
by estimating % from the data. The estimator is discussed in Milliken and Johnson (1992, p.
355). Briefly, let
"
1
s55w "C345 C 34Þ C 3Þ5 C 3ÞÞ C345w C 34Þ C 3Þ5w C 3ÞÞ .
+a< "b 3ß4
Then,
> #
"
s% a> "b "1 s#55w .
s55 "1 [7.61]
5" 5ß5 w
This adjustment factor for the degrees of freedom is less conservative than using the lower
bound of %, but more conservative than an adjustment proposed by Huynh and Feldt (1976).
The estimated Box epsilon [7.61] and the Huynh-Feldt adjustment are calculated by proc glm
of The SAS® System if the repeated statement of that procedure is used. [7.61] is labeled
Greenhouse-Geisser Epsilon on the proc glm output (Greenhouse and Geisser 1959). To
decide whether any adjustment is necessary, i.e., whether the dispersion of the data deviates
significantly from the Huynh-Feldt conditions, a test of sphericity can be invoked. This test is
available through the printE option of the repeated statement in proc glm. If the sphericity
test is rejected, a degree of freedom adjustment is deemed necessary.
This type of repeated measures analysis attempts to coerce the analysis into a split-plot
model framework and if that framework does not apply uses fudge factors to adjust the end
result (critical values or :-values). If the sphericity assumption (i.e., the Huynh-Feldt
conditions) are violated, the basic problem from our standpoint is that the statistical model
undergirding the analysis is not correct. We prefer a more direct, and hopefully more
intuitive, approach to modeling repeated measures data. Compound symmetry is a disper-
sion/correlation structure that comes about in the split-plot model through nested, indepen-
dent random components. In a repeated measures setting there is often a priori knowledge or
theory about the correlation pattern over time. In a two-year study of an annual crop it is
reasonable to assume that within a growing season measurements are serially correlated and
that correlations wear off with temporal separation. Secondly, it may be reasonable to assume
that the measurements at the beginning of the second growing season are independent of the
responses at the end of the previous season. Rather than relying on a compound symmetric or
Huynh-Feldt correlation structure, we can employ a correlation structure whose behavior is
consistent with these assumptions. Of the large number of such correlation models some were
discussed in §7.5. Equipped with a statistical package capable of fitting data with the chosen
correlation model and a method to distinguish the goodness-of-fit of competing correlation
models, the research worker can develop a statistical model that describes more closely the
structure of the data without relying on fudge factors.
7.6 Applications
The applications of linear mixed models for clustered data we entertain in this section are as
varied as the Laird-Ware model. We consider traditional growth studies with linear mixed
regression models as well as designed experiments with multiple random effects. §7.6.1 is a
study of apple growth over time where we are interested in predicting population-averaged
and apple-specific growth trends. The empirical BLUPs will play a key role in estimating the
cluster-specfic trends. Because measurements on individual apples were collected repeatedly
in time we also pay attention to the possibility of serial correlation in the model residuals.
This application is intended to underscore the two-stage concept in mixed modeling. §7.6.2 to
§7.6.5 are experimental situations where the statistical model contains multiple random terms
for different reasons. In §7.6.2 we analyze data from an on-farm trial where identical experi-
mental designs are laid out on randomly selected farms. The random selection of farms results
in a mixed model containing random experimental errors, random farm effects, and random
interactions. The estimated BLUPs are again key quantities on which inferences involving
particular farms are based.
Subsampling of experimental units also gives rise to clustered data structures and a
nesting of experimental and observational error sources. The liabilities of not recognizing the
subsampling structure are discussed in §7.6.3 along with the correct analysis based on a linear
mixed model. A very special case of an experimental design with a linear mixed model arises
when block effects are random effects, for example, when locations are chosen at random.
The on-farm experiment in §7.6.2 can be viewed as a design of that nature. If blocks are in-
complete in the sense that the size of the block cannot accommodate all treatments, mixed
model analyses are more powerful than fixed model analyses because of their ability to
recover treatment contrasts from comparisons across blocks. This recovery of interblock
information is straightforward with the mixed procedure in SAS® and we apply the
techniques to a balanced incomplete block design (BIB) in §7.6.4. Finally, a common method
for clustering data in experimental designs is the random assignment of treatments to experi-
mental units of different size. The resulting split-type designs that result have a linear mixed
model representation. Experimental units can be of different sizes if one group of units is ar-
ranged within a larger unit (splitting) or if units are arranged perpendicular to each other
(stripping). A combination of both techniques that is quite common in agricultural field ex-
periments is the split-strip-plot design that we analyze in §7.6.5.
Modeling the correlations among repeated observations directly is our preferred method
over inducing correlations through random effects or random coefficients. The selection of a
suitable correlation structure based on AIC and other fit criteria is the objective of analyzing a
factorial treatment structure with repeated measurements in §7.6.6. The model we arrive at is
a mixed model because it contains random effects for the experimental errors corresponding
to experimental units and serially correlated observational disturbances observed over time
within each unit.
Many growth studies involve nonlinear models (§5, §8) or polynomials. How to deter-
mine which terms in a growth model are to be made random and which are to be kept fixed is
the topic of §7.6.7, concerned with the water usage of horticultural trees. The comparison of
treatments in a growth study with complex subsampling design is examined in §7.6.8.
3.0
3.5
3.0
3.0
3.0
2.5
2 4 6 8 10 12 2 4 6 8 10 12
Week
Figure 7.16. Observed diameters over a "#-week period for "' of the )! apples. Data kindly
provided by Dr. Ross E. Byers, Alson H. Smith, Jr. AREC, Virginia Polytechnic Institute and
State University, Winchester, Virginia. Used with permission.
Figure 7.16 suggests that the trends are linear for each apple. A naïve approach to
estimating the population-averaged and cluster-specific trends is as follows. Let
]34 "!3 ""3 >34 /34
denote a simple linear regression for the data from apple 3 "ß âß )!. We fit this linear
regression separately for each apple and obtain the population-averaged estimates of the
overall slope "! and intercept "" by averaging the apple-specific estimates. In this averaging
one can calculate equally weighted averages or take the precision of the apple-specific esti-
mates into account. The equally weighted average approach is implemented in SAS® as
follows (Output 7.4).
/* variable time is coded as 1,2,...,6 corresponding to weeks 2,4,...,12 */
proc reg data=apples outest=est noprint;
model diam = time;
by tree apple;
run;
proc means data=est noprint;
var intercept time;
output out=PAestimates mean=beta0 beta1 std=sebeta0 sebeta1;
run;
title 'Naive Apple-Specific Estimates';
proc print data=est(obs=20) label;
var tree apple _rmse_ intercept time;
run;
title 'Naive PA Estimates';
proc print data=PAEstimates; run;
Output 7.4.
Naive Apple-Specific Estimates
Root mean
squared
Obs tree apple error Intercept time
Naive PA Estimates
Notice that for Apple "% on tree ", which is the 3 (th apple in the data set, the
s "ß( is ! because only a single observation had been collected on that
coefficient estimate "
apple. The apple-specific predictions can be calculated from the coefficients in Output 7.4 as
can the population-averaged growth trend. We notice, however, that the ability to do this re-
quired estimation of "'! mean parameters, one slope and one intercept for each of eighty
apples. Counting the estimation of the residual variance in each model we have a total of #%!
estimated parameters. We can contrast the population-average and the apple-specific predic-
tions easily from Output 7.4. The growth of the average apple is predicted as
µ µ µ
C "! "" > #Þ)$%'( !Þ!#)"*& >
and that for apple " on tree ", for example, as
Cµ" #Þ))$$$ !Þ!" >
µ µ
"! !Þ!%)'' "" !Þ!")*& >.
The quantities !Þ!%)'' and –!Þ!")*& are the adjustments made to the population-averaged
estimates of intercept and slope to obtain the predictions for the first apple. We use
the µ notation here, because the averages of the apple-specific fixed effects are not the
generalized least squares estimates.
Fitting the apple-specific trends by fitting a model to the data from each apple and then
averaging the estimates is a two-stage approach, literally. The cluster-specific estimates are
obtained in the first stage and the population-average is determined in the second stage. The
two-stage concept in the Laird-Ware model framework leads to the same end result, estimates
of the population-average trend and estimates of the cluster-specific trend. It does, however,
require fewer parameters. Consider the first-stage model
]34 a"! ,!3 b a"" ,"3 b>34 /34 , [7.62]
where ]34 is the measurement taken at time >34 for apple number 3. We assume for now that
the /34 are uncorrelated Gaussian random errors with zero mean and constant variance 5 # , al-
though the fact that ]3" ß ]3# ß â are repeated measurements on the same apple should give us
pause. We will return to this issue later. To formulate the second stage of the Laird-Ware
model we postulate that ,!3 and ,"3 are Gaussian random variables with mean ! and variances
5!# and 5"# , respectively. They are assumed not to be correlated with each other and are also
not correlated with the error terms /34 . This is obviously a mixed model of Laird-Ware form
with & parameters Ð"! , "" , 5!# , 5"# , 5 # Ñ. With the mixed procedure of The SAS® System, this is
accomplished through the statements
proc mixed data=apples;
class tree apple;
model diam = time / s;
random intercept time / subject=tree*apple s;
run;
By default the variance components 5!# , 5"# , and 5 # will be estimated by restricted maximum
likelihood. The subject=tree*apple statement identifies the unique combinations of the data
set variables apple and tree as clusters. Both variables are needed here since the apple
variable is numbered consecutively starting at 1 for each tree. Technically more correct would
be to write subject=apple(tree), since the apple identifiers are nested within trees, but the
analysis will be identical to the one above. In writing subject= options in proc mixed, the
user must only provide a variable combination that uniquely identifies the clusters. The /s
option on the model statement requests a printout of the fixed effects estimates " s ! and "
s " , the
same option on the random statement requests a printout of the solutions for the random
Output 7.5.
The Mixed Procedure
Model Information
Data Set WORK.APPLES
Dependent Variable diam
Covariance Structure Variance Components
Subject Effect tree*apple
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Containment
Dimensions
Covariance Parameters 3
Columns in X 2
Columns in Z Per Subject 2
Subjects 80
Max Obs Per Subject 6
Observations Used 451
Observations Not Used 29
Total Observations 480
Fit Statistics
-2 Res Log Likelihood -1897.7
AIC (smaller is better) -1891.7
AICC (smaller is better) -1891.6
BIC (smaller is better) -1884.5
In fitting this model it was assumed that the /34 are uncorrelated. This may not be tenable
since the measurements from the same apple are taken sequentially in time. To investigate
whether there is a significant serial correlation we perform a likelihood ratio test. We fit
model [7.62] but assume that the /34 follow a first-order autoregressive model. Since the
measurement occasions are equally spaced, this is a reasonable approach. Recall from Output
7.5 that minus twice the restricted (residual) log likelihood of the model with uncorrelated
errors is –")*(Þ(. We accomplish fitting a model with AR(1) errors by adding the repeated
statement as follows:
proc mixed data=apples noitprint;
class tree apple;
model diam = time / s;
random intercept time / subject=tree*apple;
repeated / subject=tree*apple type=ar(1);
run;
The estimate of the autocorrelation coefficient, the correlation between diameter meas-
urements on the same apple two weeks apart, is s 3 !Þ$)#& (Output 7.6). It appears fairly
substantial, but is adding the autocorrelation to the model a significant improvement? The
negative of twice the residual log likelihood in this model is –"*"!Þ& and the likelihood ratio
test statistic comparing the models with and without AR(1) correlation is "*"!Þ&
")*(Þ( "#Þ). The :-value for the hypothesis that the autoregressive parameter is zero is
thus PrÐ;#" "#Þ)Ñ !Þ!!!$&. Adding the AR(1) serial correlation does significantly im-
prove the model. The impact of adding the AR(1) correlations is primarily on the standard
errors of all estimated quantities. The population-averaged estimates as well as the BLUPs for
the random effects change very little, hence the impact on the predicted values is minor (not
necessarily so the impact on the precision of the predicted values).
Model Information
Dimensions
Covariance Parameters 4
Columns in X 2
Columns in Z Per Subject 2
Subjects 80
Max Obs Per Subject 6
Observations Used 451
Observations Not Used 29
Total Observations 480
Fit Statistics
Standard
Effect Estimate Error DF t Value Pr > |t|
Intercept 2.8321 0.01068 79 265.30 <.0001
time 0.02875 0.001017 78 28.28 <.0001
Std Err
Effect tree apple Estimate Pred DF t Value Pr > |t|
2 4 6 8 10 12 2 4 6 8 10 12
3.0
3.5
3.0
3.0
3.0
2.5
2 4 6 8 10 12 2 4 6 8 10 12
Week
Figure 7.17. Apple-specific predictions (solid lines) from mixed model [7.62] with AR(1)
correlated error terms for the same apples shown in Figure 7.16. Dashed lines show popula-
tion-averaged prediction, circles are raw data.
Once we have settled on the correlation model for these data we should go back and re-
evaluate whether the two random effects are in fact needed. Deleting them in turn one obtains
a residual log likelihood of *%(Þ(& for the model without random slope, *%'Þ$ for the model
without random intercept, and *%'Þ!& for the model without any random effects. Any one
likelihood ratio test against the full model in Output 7.6 is significant. Both random effects
remain in the model along with the AR(1) serial correlation.
The predictions from this model trace the observed growth profiles very closely (Figure
7.17) and the deviation of the solid from the dashed line in Figure 7.17 is an indication of
how strongly a particular apple differs from the average apple. It is clear that most of the
apple-to-apple variation is in the actual size of the apple (heterogeneity in intercepts), not the
uses the update option to write to the log window what proc mixed is currently doing. Fitting
models with many random effects can be time consuming and it is then helpful to find out
whether the procedure is still processing.
Notice that there are now two more covariance parameters corresponding to a random
intercept and a random slope for the trees (Output 7.7). A likelihood ratio test whether the
addition of the two random effects improved the model has test statistic "*"(Þ" "*"!Þ&
'Þ' on two degrees of freedom with :-value PrÐ;## 'Þ'Ñ !Þ!$(. Notice, however, that the
variance component for the tree-specific random intercept is practically zero. AIC (smaller
is better) is calculated as minus twice the residual log likelihood plus twice the number of
covariance parameters. The AIC adjustment was made for only five, not six covariance
parameters. The estimate for the variance of tree-specific random intercepts was set to zero.
The data do not support that many random effects. Also, the variance of the random slopes on
the tree and the apple level add up to !Þ!!!!&$, which should be compared to 5 s #" !Þ!!!!&
in Output 7.6.
Output 7.7.
The Mixed Procedure
Model Information
Covariance Parameters 6
Columns in X 2
Columns in Z Per Subject 26
Subjects 10
Max Obs Per Subject 72
Observations Used 451
Observations Not Used 29
Total Observations 480
Fit Statistics
Standard
Effect Estimate Error DF t Value Pr > |t|
Intercept 2.8322 0.01066 9 265.58 <.0001
time 0.02870 0.001623 9 17.68 <.0001
lacking control over experimental units and conditions one should anticipate such farm
treatment interaction and allow for replication of the treatments within a farm. In contrast to
research station experimentation where locations at which to apply the treatments are often
chosen deliberately to reflect certain conditions of interest or because of availability, the
farms for on-farm research are often chosen at random to represent the population of farms
(conditions) in the region where technology transfer is to take place. If we think of farms as
stochastic locational or enviromental effects, these will have to enter any statistical model as
random effects (§7.2.3). Treatments, chosen deliberately to reflect current practice and tech-
nology to be transferred, are fixed effects on the contrary. As a consequence, statistical
models for the analysis of data from on-farm experimentation are typically mixed models.
Consider the (hypothetical) data in Table 7.7 representing wheat yields from eight on-
farm block designs, each with three blocks and two treatments. The farms were selected at
random from a list of all farms in the area where the new treatment (B) is to be tested.
On each farm we have a randomized block design
]34 . 73 34 /34 ,
where 73 a3 "ß #b is the effect of the 3th treatment and 34 a4 "ß #ß $b denotes the block
effects. One could analyze eight separate RCBDs to determine the effectiveness of the treat-
ments by farm. These farm-specific analyses would be powerless since each RCBD has only
two degrees of freedom for the experimental error. Also, nothing would be learned about the
treatment farm interaction. A more suitable analysis will combine all the data into a single
analysis.
Table 7.7. Data from on-farm trials conducted as randomized block designs
with three blocks on each of eight farms
Block 1 Block 2 Block 3
Farm A B A B A B
" $!Þ)' $$Þ$" $!Þ$# $!Þ*% $#Þ$" $&Þ#%
# $"Þ$* #(Þ)( $!Þ'# #&Þ#& #*Þ*$ #"Þ(*
$ $*Þ## %"Þ*& $)Þ*' %$Þ$) $&Þ$* %"Þ!*
% $(Þ"* $!Þ*( $'Þ"! $#Þ&& $&Þ)& $$Þ!%
& #%Þ*) #$Þ$* ##Þ!% #%Þ&! ##Þ*$ #$Þ#%
' #)Þ!' #)Þ'* #(Þ*) #&Þ') #&Þ"$ #&Þ))
( #(Þ)# $(Þ#$ #&Þ$# $%Þ%& #'Þ&# $#Þ%*
) #*Þ%" $!Þ*) #'Þ'$ $!Þ(" #*Þ'! $!Þ'$
The model for this analysis is that of a replicated RCBD with farm effects, treatment
effects, block effects nested within farms (since block 1 on farm 1 is a different physical
entity than block 1 on farm 2), and treatment farm interactions. Since the eight farms are a
random sample of farms their effects enter the model as random (95 ). The treatment effects
(73 ) are fixed and the interaction between farms and treatments (a97 b35 ) is random since it in-
volves the random farm effects. The complete model for analysis is
Because the farm effects are random it is reasonable to also treat the block effects nested
within farms as random variables. To obtain a test of the treatment effects and estimates of all
variance components by restricted maximum likelihood we use the proc mixed code
proc mixed data=onfarm;
class farm block tx;
model yield = tx /ddfm=satterth;
random farm block(farm) farm*tx;
run;
Only the treatment effect tx is listed in the model statement since it is the only fixed
effect in the model. The Satterthwaite approximation is invoked here because exact tests may
not be available in complex mixed models such as this one.
The largest variance component estimate is 5 s #9 "*Þ**(* (Output 7.8). The variance
between farms is twenty times larger than variation within farms Ð5 s #3 !Þ*)%)Ñ. The test for
differences among the treatments is not significant aJ9,= !Þ$ß : !Þ'!!$b. Based on this
test one would conclude that the new treatment does not increase or decrease yield over the
current standard. It is possible, however, that the marginal treatment effect is masked by an
interaction. If, for example, the new treatment (B) outperforms the standard on some farms
but performs more poorly than the standard on other farms, it is conceivable that the
treatment averages across farms are not very different. To address this question we need to
test the significance of the interaction between treatments and farms. Since the interaction
terms a97 b35 are random variables, the farm*tx effect appears as a covariance parameter, not
as a fixed effect. Fitting a reduced model without the interaction we can calculate a likelihood
#
ratio test statistic to test L! : 597 !. The proc mixed code
proc mixed data=onfarm;
class farm block tx;
model yield = tx /ddfm=satterth;
random farm block(farm) ;
run;
produces a -2 Res Log Likelihood of #&!Þ% (output not shown). The difference between
this value and the -2 Res Log Likelihood of ##$Þ) is asymptotically the realization of a
Chi-square random variable with one degree of freedom. The :-value of the likelihood ratio
#
test of L! : 597 ! is thus PrÐ;#" #'Þ'Ñ !Þ!!!". There is a significant interaction
between farms and treatments.
Output 7.8.
The Mixed Procedure
Model Information
Data Set WORK.ONFARM
Dependent Variable yield
Covariance Structure Variance Components
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Satterthwaite
Dimensions
Covariance Parameters 4
Columns in X 3
Columns in Z 48
Subjects 1
Max Obs Per Subject 48
Observations Used 48
Observations Not Used 0
Total Observations 48
Fit Statistics
Res Log Likelihood -111.9
Akaike's Information Criterion -115.9
Schwarz's Bayesian Criterion -116.1
-2 Res Log Likelihood 223.8
This interaction would normally be investigated with interaction slices by farms producing
separate tests of the treatment difference for each farm. Since the interaction term is random
this is not possible (slices require fixed effects). However, the best linear unbiased predictors
for the treatment means on each farm can be calculated with the procedure. These are the
quantities on which treatment comparisons for a given farm should be based. The statements
estimate 'Blup Farm 1 tx A' intercept 1 tx 1 0 | farm 1 0 0 0 0 0 0 0
farm*tx 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
estimate 'Blup Farm 1 tx B' intercept 1 tx 0 1 | farm 1 0 0 0 0 0 0 0
farm*tx 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
for example, estimate the BLUPs for the two treatments on farm 1. Notice the vertical slash
after tx 1 0 which narrows the inference space by fixing the farm effects to that of farm ".
BLUPs for other farms are calculated similarly by shifting the coefficients for farm and
farm*tx effects to the appropriate positions. For example,
estimates the treatment BLUPs for the third farm. The coefficients that were shifted compared
to the estimate statements for farm " are shown in bold. The BLUPs so obtained for the two
treatments on all farms follow in Output 7.9. Comparing the Estimate values there with the
entries in Table 7.7 (p. 475), it is evident that the EBLUPs are the sample means for each
treatment calculated across the blocks on a particular farm (apart from roundoff error, the
values in Table 7.7 were rounded to two decimal places).
Of interest is of course a comparison of these means by farm. In other words, are there
farms where the new treatment outperforms the current standard and farms where the reverse
is true? Since the treatment effect was not significant in the analysis of the full model but the
likelihood ratio test for the interaction was significant, we almost expect such a relationship.
The following estimate statements contrast the two treatments for each farm (Output 7.10).
estimate 'Tx eff. on Farm 1' tx 1 -1 | farm*tx 1 -1;
estimate 'Tx eff. on Farm 2' tx 1 -1 | farm*tx 0 0 1 -1;
estimate 'Tx eff. on Farm 3' tx 1 -1 | farm*tx 0 0 0 0 1 -1;
estimate 'Tx eff. on Farm 4' tx 1 -1 | farm*tx 0 0 0 0 0 0 1 -1;
estimate 'Tx eff. on Farm 5' tx 1 -1 | farm*tx 0 0 0 0 0 0 0 0 1 -1;
estimate 'Tx eff. on Farm 6' tx 1 -1 | farm*tx 0 0 0 0 0 0 0 0 0 0 1 -1;
estimate 'Tx eff. on Farm 7' tx 1 -1 | farm*tx 0 0 0 0 0 0 0 0 0 0 0 0 1 -1;
estimate 'Tx eff. Farm 8' tx 1 -1 | farm*tx 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -1;
Output 7.9.
Estimates
Standard
Label Estimate Error DF t Value Pr > |t|
On farms # and % the old treatment significantly outperforms the new treatment. On
farms $, (, and ) the new treatment significantly outperforms the current standard, however
(at the &% significance level). This reversal of the treatment effects masked the treatment
main effect. Whereas the recommendation based on the treatment main effect would have
been that one may as well stick with the old treatment and not transfer technology from
experiment station research, the analysis of the interaction shows that for farms $, (, and )
(and by implication farms in the target region that are alike) the new treatment holds promise.
Output 7.10.
Estimates
Standard
Label Estimate Error DF t Value Pr > |t|
Tx eff. on Farm 1 -1.9364 1.0117 17.6 -1.91 0.0720
Tx eff. on Farm 2 5.3249 1.0117 17.6 5.26 <.0001
Tx eff. on Farm 3 -4.1009 1.0117 17.6 -4.05 0.0008
Tx eff. on Farm 4 3.9200 1.0117 17.6 3.87 0.0011
Tx eff. on Farm 5 -0.4200 1.0117 17.6 -0.42 0.6831
Tx eff. on Farm 6 0.2409 1.0117 17.6 0.24 0.8145
Tx eff. on Farm 7 -7.7718 1.0117 17.6 -7.68 <.0001
Tx eff. Farm 8 -2.1535 1.0117 17.6 -2.13 0.0477
It is tempting to analyze these data with a two-way factorial analysis of variance based on
the linear model
]345 .34 /345 . !3 "4 a!" b34 /345 , [7.64]
assuming that /345 is the "experimental" error for the 5 th experimental unit receiving level 3
of factor E and level 4 of factor F . As will be shown shortly, /345 in [7.64] is the observation-
al error and the experimental error has been confounded with the treatment means. Since The
SAS® System cannot know whether repeated values in the data set that share the same treat-
ment assignment represent subsamples or replicates, an analysis of the data in Table 7.8 with
model [7.64] will be successful. Using proc glm significant main effects of factors E and F
a: !Þ!!!% and !Þ!"%%b and a nonsignificant interaction are inferred (Output 7.11).
data noreps;
input A B plant y;
datalines;
1 1 1 3.5
1 1 2 4.0
1 1 3 3.0
1 1 4 4.5
1 2 1 5.0
1 2 2 5.5
1 2 3 4.0
... and so forth ...
;;
run;
Dependent Variable: y
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 5 47.34375000 9.46875000 6.94 0.0009
Error 18 24.56250000 1.36458333
Corrected Total 23 71.90625000
Notice that the J statistics on which the :-values are based are obtained by dividing the
main effects or interaction mean square by the mean square error of "Þ$'%&. This mean square
error is based on ") degrees of freedom, % " $ degrees of freedom for the subsamples in
each of ' experimental units. This analysis is clearly wrong, since the experimental error in
this design has >a< "b degrees of freedom where > denotes the number of treatments and <
the number of replications for each treatment. Since each of the > ' treatments was
assigned to only one pot, we have < " and >a< "b !. What SAS® terms the Error
s # "Þ$'%&) is an estimate of the obser-
source in this model is the observational error and 5
vational error variance. The correct model for the subsampling design contains separate ran-
dom terms for experimental and observational error. In the two-factor design we obtain
]3456 .34 /345 .3456 [7.65]
Varc/345 d 5/# , Varc.3456 d 5.# ,
where 5 "ß âß < indexes the replications, /345 is the experimental error as defined above,
and .3456 is the observational (subsampling) error for subample 6 "ß âß 8 on replicate 5 . 5/#
and 5.# are the experimental and observational error variances, respectively. If 5 ", as for
the data in Table 7.8, the model becomes
]346 .34 /34 .346 .34 .346 . [7.66]
and the experimental error is now confounded with the treatments. This is model [7.64] where
/345 is replaced with .346 and .34 is replaced with .34 . Because .34 and /34 in [7.66] have the
same subscript the two sources of variability are confounded. The only random variation that
can be estimated is the variance of .346 , the observational error. Finally, the observational
error mean square is not the correct denominator for J -tests (Table 7.9). The statistic
Q W aTreatmentbÎQ W aObs. Errorb thus is not a test statistic for the absence of treatment
effects Ð0 Ð.#34 Ñ !Ñ but for the simultaneous absence of treatment effects and the experimen-
tal error, a nonsensical proposition.
Table 7.10. Expected mean squares in completely randomized design with subsampling (>
denotes number of treatments, < number of replicates, and 8 number of subsamples)
Source of Variation DF EcQ W d
Treatments aX Bb >" 5.# 85/# 0 a73# b
Experimental Error aII b >a< "b 5.# 85/#
Observational Error aSI b ><a8 "b 5.#
Notice that experimental error degrees of freedom are not affected by the number of sub-
samples, and that J9,= Q W aX BbÎQ W aII b is the test statistic for testing treatment effects.
The data in Table 7.11, taken from Steel, Torrie, and Dickey (1997, p. 159), represent a
$ # factorial treatment structure arranged in completely randomized design with < $
replicates and 8 % subsamples per experimental unit. From a large group of plants four
were randomly assigned to each of ") pots. Six treatments were then randomly assigned to
the pots such that each treatment was replicated three times. The treatments consisted of all
possible combinations of three hours of daylight a), "#, "' hrsb and two levels of night tem-
peratures (low, high). The outcome of interest was the stem growth of mint plants grown in
nutrient solution under the assigned conditions. The experimental units are the pots since
treatments were assigned to those. Stem growth was measured for each plant in a pot, hence
there are four subsamples per experimental unit.
Table 7.11. One-week stem growth of mint plants data from Steel et al. (1997, p. 159)
Low Night Temperature
8 hrs 12 hrs 16 hrs
Plant Pot 1 Pot 2 Pot 3 Pot 1 Pot 2 Pot 3 Pot 1 Pot 2 Pot 3
" $Þ& #Þ& $Þ! &Þ! $Þ& %Þ& &Þ! &Þ& &Þ&
# %Þ! %Þ& $Þ! &Þ& $Þ& %Þ! %Þ& 'Þ! %Þ&
$ $Þ! &Þ& #Þ& %Þ! $Þ! %Þ! &Þ! &Þ! 'Þ&
% %Þ& &Þ! $Þ! $Þ& %Þ! &Þ! %Þ& &Þ! &Þ&
where the /345 and .3456 are zero-mean uncorrelated random variables. The analysis of
variance is shown in Table 7.12.
The analysis of variance can be obtained with proc glm of The SAS® System (Output
7.12):
proc glm data=mintstems;
class hour night pot;
model growth = hour night hour*night pot(hour*night);
run; quit;
The sequential (Type I) and partial (Type III) sums of squares are identical because the
design is orthogonal. Notice that the source denoted Error is again the obervational error as
can be seen from the associated degrees of freedom and the experimental error is modeled as
pot(hour*night). The J statistics calculated by proc glm are obtained by dividing the mean
square of a source of variability by the mean square for the Error source; hence they use the
observational error mean square as a denominator and are incorrect. The two error mean
square estimates in Output 7.12 are
s #. !Þ*$%!
5
s #.
5 s /# #Þ"&#(,
85
hence dividing by the observational mean square error is detrimental in two ways. The J
statistic is inflated and the : -value is calculated from an distribution with incorrect (too many)
degrees of freedom.
The correct tests can be obtained in two ways with proc glm. One can add a random
statement indicating which terms of the model statement are random variables and The SAS®
System will construct the appropriate test statistics based on the formulas of expected mean
squares. Alternatively one can use the test statement if the correct error term is known. The
two methods lead to the following procedure calls (output not shown).
proc glm data=mintstems;
class hour night pot;
model growth = hour night hour*night pot(hour*night);
random pot(hour*night) / test;
run; quit;
Output 7.12.
The GLM Procedure
Number of observations 72
A more elegant approach is to use proc mixed which is specifically designed for mixed
models. The statements to analyze the two-way factorial with subsampling are
proc mixed data=mintstems;
class hour night pot;
model growth = hour night hour*night;
random pot(hour*night);
run; quit;
or
proc mixed data=mintstems;
class hour night pot;
model growth = hour night hour*night;
random intercept / subject=pot(hour*night);
run; quit;
The two versions of proc mixed code differ only in the form of the random statement and
yield identical results. The second form explicitly defines the experimental units
pot(hour*night) as clusters and the columns of the Z3 matrix as having an intercept only.
The residual maximum likelihood estimates of the variance components 5/# and 5.# are
s #/ !Þ$!%( and 5
5 s #.ß !Þ*$%!, respectively (Output 7.13).
Output 7.13.
The Mixed Procedure
Model Information
Dimensions
Covariance Parameters 2
Columns in X 12
Columns in Z 18
Subjects 1
Max Obs Per Subject 72
Observations Used 72
Observations Not Used 0
Total Observations 72
Num Den
Effect DF DF F Value Pr > F
hour 2 12 5.18 0.0239
night 1 12 70.45 <.0001
hour*night 2 12 1.32 0.3038
The latter estimate is labeled as Residual in the Covariance Parameter Estimates table.
Since the data are completely balanced these estimates are identical to the method of moment
estimator implied by the analysis of variance. From 5 s #. 85
s /# #Þ"&#( and 5
s .# !Þ*$% one
obtains the moment estimator of the experimental error variance as
s #/ a#Þ"&#( !Þ*$%!bÎ% !Þ$!%(.
5
Results of the main effects and interaction tests are shown in the Type 3 Tests of Fixed
Effects table. The J statistics are identical to those obtained in proc glm if one uses the
correct mean square error term there. For example from Output 7.12 one obtains
Q W aL9?<b ""Þ"%*$
J9,= &Þ")
Q W aII b #Þ"&#(
Q W aR 312>b "&"Þ'(!
J9,= (!Þ%&
Q W aII b #Þ"&#(
Q W aL9?< R 312>b #Þ)$')
J9,= "Þ$#.
Q W aII b #Þ"&#(
These are the J statistics shown in Output 7.13. Also notice that the denominator degrees of
freedom are set to the correct degrees of freedom associated with the experimental error
(Table 7.12).
The marginal correlation structure in the subsampling model [7.67] is compound
symmetric, observational errors are nested within experimental errors. Adding the v=list
option to the random statement of proc mixed requests a printout of the (estimated) marginal
variance-covariance matrices of the clusters (subjects) in list. For example,
random intercept / subject=pot(hour*night) v=1;
requests a printout of the variance-covariance matrix for the first cluster (Output 7.14). It is
easy to verify that this matrix is of the form
s #/ J% 5
5 s #. I% .
Output 7.14.
Estimated V Matrix for pot(hour*night) 1 8 High
The same analysis can thus be obtained by modeling the marginal variance-covariance
matrix VarcY3 d directly as a compound symmetric matrix. Replacing the random statement
with a repeated statement and choosing the appropriate covariance structure (type=cs), the
statements
proc mixed data=mintstems noitprint;
class hour night pot;
model growth = hour night hour*night;
repeated / sub=pot(hour*night) type=cs r=1;
run; quit;
lead to the same results as in Output 7.13 and Output 7.14, only the covariance parameter
Intercept in Output 7.13 has been renamed to CS (Output 7.15).
Output 7.15.
The Mixed Procedure
Model Information
hour 3 8 12 16
night 2 Hig Low
pot 3 1 2 3
Dimensions
Covariance Parameters 2
Columns in X 12
Columns in Z 0
Subjects 18
Max Obs Per Subject 4
Observations Used 72
Observations Not Used 0
Total Observations 72
Fit Statistics
This design is balanced in two ways. Each treatment is replicated the same number of
times a'b throughout the experiment and each pair of treatments appears the same number of
times a$b within a block. For example, treatments F and G appear in block ", (, ). Treat-
ments E and F appear in blocks &, ', and (. As a result, all treatment comparisons will be
made with the same precision in the experiment. However, because of the incompleteness,
block and treatment effects are not orthogonal. Whether block effects are removed or not
prior to assessing treatment effects is critical. To see this consider a comparison of treatments
E and F . The naïve approach is to base this comparison on the two arithmetic averages CE
and CF . Their difference is not an estimate of the treatment effect; however, since these are
averages calculated over different blocks. CE is calculated from information in blocks #ß &ß 'ß
(ß *ß "! and C F from information in blocks "ß $ß &ß 'ß (ß ). The difference C E C F carries not
only information about differences between the treatments but also about block effects. To
obtain a fair comparison of the treatments unaffected by the block effects, the treatment sum
of squares must be adjusted for the block effects and treatment means are not estimated as
arithmetic averages. A statistical model for the design in Table 7.13 is ]34 . 33 74 /34
where the 33 are block effects a3 "ß ÞÞÞß "!b, 74 are the treatment effects a4 "ß ÞÞÞß &b and /34
are independent experimental errors with mean ! and variance 5 # . The only difference
between this linear model and one for a randomized complete block design is that not all
combinations 34 are possible. The appropriate estimate of the mean of the 4th treatment in the
incomplete design is . s s7 4 where carets denote the least squares estimate. . s s7 4 is also
known as the least squares mean for treatment 4. In fact, these estimates are always appro-
priate. In a balanced design it turns out that the least squares estimates are identical to the
arithmetic averages. The question thus should not be when one should use the least squares
means for treatment comparisons, but when one can rely on arithmetic means.
We illustrate the effect of nonorthogonality with data from a balanced incomplete block
design reported by Cochran and Cox (1957, p. 448). Thirteen hybrids of corn were arranged
in a field experiment in blocks of size 5 % such that each pair of treatments appeared once
in a block throughout the experiment and each treatment is replicated four times. This
arrangement requires , "$ blocks.
Table 7.14. Experimental layout of BIB in Cochran and Cox (1957, p. 448)†
(showing yield of corn in pounds per plot)
Hybrid
Block " # $ % & ' ( ) * "! "" "# "$
" #&Þ$ "*Þ* #*Þ! #%Þ'
# #$Þ! "*Þ) $$Þ$ ##Þ(
$ "'Þ# "*Þ$ $"Þ( #'Þ'
% #(Þ$ #(Þ! $&Þ' "(Þ%
& #$Þ% $!Þ& $!Þ) $#Þ%
' $!Þ' $#Þ% #(Þ# $#Þ)
( $%Þ( $"Þ" #&Þ( $!Þ&
) $%Þ% $#Þ% $$Þ$ $'Þ*
* $)Þ# $#Þ* $(Þ$ $"Þ$
"! #)Þ( $!Þ( #'Þ* $&Þ$
"" $'Þ' $"Þ" $"Þ" #)Þ%
"# $"Þ) $$Þ( #(Þ) %"Þ"
"$ $!Þ$ $"Þ& $*Þ$ #'Þ(
†
Cochran, W.G. and Cox, G.M. (1957), Experimental Design, 2nd Edition. Copyright © 1957 by
John Wiley and Sons, Inc. This material is used by permission of John Wiley and Sons, Inc.
We obtain the analysis of variance for these data with proc glm (Output 7.16).
proc glm data=cornyld;
class block hybrid;
model yield = block hybrid;
lsmeans hybrid / stderr;
means hybrid;
run; quit;
Number of observations 52
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 24 1017.929231 42.413718 2.13 0.0298
Error 27 538.217500 19.933981
Corrected Total 51 1556.146731
Standard
hybrid yield LSMEAN Error Pr > |t|
Level of ------------yield------------
hybrid N Mean Std Dev
1 4 35.3250000 2.75121185
2 4 29.8000000 2.40277617
3 4 30.0000000 6.92194578
4 4 28.0500000 5.50424079
5 4 30.7250000 2.55783111
6 4 28.0750000 6.08187197
7 4 31.7750000 6.57133929
8 4 31.8000000 3.38526218
9 4 28.1000000 2.25831796
10 4 28.1750000 8.00848508
11 4 22.4250000 5.01489448
12 4 27.9000000 4.06939799
13 4 34.9750000 6.09555849
What SAS® terms Type I SS and Type III SS are sequential and partial sums of squares,
respectively. Sequential sums of squares are the sum of squares contributions of sources
given that variability of the previously listed sources has been accounted for. The sequential
block sum of squares of ')*Þ$) is the sum of squares among block averages and the sequen-
tial hybrid sum of squares of $#)Þ&% is the contribution of the treatment variability after
adjusting for block effects. Inferences about treatment effects are to be based on the partial
sums of squares. The nonorthogonality of this design is evidenced by the fact that the Type I
SS and the Type III SS differ. In an orthogonal design, the two sets of sums of squares would
be identical. Whenever the design is nonorthogonal great care must be exercised to estimate
treatment means properly. The list of least squares means shows the estimates . s s7 4 that are
adjusted for the block effects. Notice that all least squares means have the same standard
error, since every treatment is replicated the same number of times. The final part of the
output shows the result of the means statement. These are the arithmetic sample averages of
the observations for a particular treatment which do not estimate treatment means unbiasedly
unless every treatment appears in every block. One must not base treatment comparisons on
these quantities in a nonorthogonal design. The column Std Dev is the standard deviation of
the four observations for each treatment. It is not the standard deviation of a treatment based
on the analysis of variance.
An analysis of an incomplete block design such as the proc glm analysis above is termed
an intra-block analysis that obtains treatment information by comparing block-adjusted least
squares estimates. Yates (1936, 1940) coined the term along with the term inter-block
analysis that also recovers treatment information contained in the block totals (averages). In
incomplete block designs contrasts of block averages also contain contrasts among the treat-
ments. To see this consider blocks " and $ in Table 7.13. The first block contains treatments
F , G , and I , the third block contains treatments F , H, and I . If ] "Þ denotes the average in
block " and ] $Þ the average in block $, then we have
"
E] "Þ . 3" a7F 7G 7I b
$
"
E ] $Þ . 3$ a7F 7H 7I b.
$
The difference of the block averages contains information about the treatments, namely,
EÒ] "Þ ] $Þ Ó 3" 3$ "$ a7G 7H b. Unfortunately, this is not just a contrast among treat-
ments, but involves the effects of the two blocks. The solution to uncovering the inter-block
information is to let the block effects be random (with mean !) since then
EÒ] "Þ ] $Þ Ó ! ! "$ a7G 7H b "$ a7G 7H b, a contrast between treatment effects. The
linear mixed model for the incomplete block design now becomes
]34 . 33 74 /34 ß /34 µ !ß 5 # ß 34 µ !ß 53# ,
where the /34 and 34 are independent. The term 3" 3$ "$ a7G 7H b now represents the
conditional (narrow inference space) comparison of the two block means and the uncon-
ditional (broad inference space) comparison is EÒ] "Þ ] $Þ Ó EÒEÒ] "Þ ] $Þ l3ÓÓ
"
$ a 7G 7 H b .
For the corn hybrid experiment of Cochran and Cox (1957, p. 448) the inter-block
analysis is carried out with the following proc mixed statements.
The inter-block analysis is invoked by moving the block term from the model statement
to the random statement. The estimate statements are not necessary unless one wants to esti-
mate treatment means in the narrow or intermediate inference spaces (§7.3). The lsmeans
statement requests block-adjusted estimates of the hybrid means in the broad inference space.
On Output 7.17 we notice that the J statistic for hybrid differences in the mixed analysis
aJ9,= "Þ'(b has changed from the intra-block analysis in proc glm aJ9,= "Þ$(b. This re-
flects the additional treatment information recovered by the inter-block analysis. Furthermore,
the estimates of the treatment means have changed as compared to the least squares means re-
ported by proc glm. The additional information recovered from block averages surfaces here
again. In the example of §7.3 it was noted that the estimates of factor means would be the
same if all factors would have been considered fixed and only the standard errors would
differ between the fixed effects and mixed effects analysis. This statement was correct there
because the design was completely balanced and hence orthogonal. In a nonorthogonal in-
complete block design both the estimates of the treatment means as well as their standard
errors differ between the fixed effects and mixed effects analysis.
Model Information
Data Set WORK.CORNYLD
Dependent Variable yield
Covariance Structure Variance Components
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Containment
Dimensions
Covariance Parameters 2
Columns in X 14
Columns in Z 13
Subjects 1
Max Obs Per Subject 52
Observations Used 52
Observations Not Used 0
Total Observations 52
Estimates
Standard
Label Estimate Error DF t Value Pr > |t|
hybrid 1 broad 34.1712 2.4447 27 13.98 <.0001
hybrid 1 narrow 34.1712 2.3475 27 14.56 <.0001
hybrid 1 in block 1 32.6735 2.9651 27 11.02 <.0001
hybrid 1 in block 2 31.2751 2.9651 27 10.55 <.0001
hybrid 1 in block 7 34.1635 2.6960 27 12.67 <.0001
hybrid 1 vs hybrid2 5.1305 3.3331 27 1.54 0.1354
The first two estimate statements produce the hybrid " estimate in the broad and narrow
inference space and show that the lsmeans statement operates in the broad inference space.
The third through fifth estimate statements show how to estimate the hybrid mean in a
particular plot. Notice that hybrid " did not appear in blocks " or # in the experiment but did
in block (. Nevertheless, we are able to predict how well the hybrid would have done in
blocks " and #, although this prediction is less precise than prediction of the hybrid's perform-
ance in blocks were the hybrid was observed. In an intra-block analysis where blocks are
fixed, it is not possible to differentiate a hybrid's performance by block.
• Plant Population ('!, "#!, ")!, #%!, $!! thousand per acre)
• Row spacing (*", ")").
Cultivar X assigned,
EU for cultivar
EU for
population
EU for row
spacing*population
Figure 7.18. Experimental layout for a single block in the soybean yield experiment.
Cultivars (varieties) were assigned to the large experimental units (plots), row spacings and
population densities to perpendicular strips within the plots.The experiment was brought to
our attention and the data were made kindly available by Dr. David Holshouser, Tidewater
Agricultural Research and Extension Center, Virginia Polytechnic Institute and State
University. Used with permission.
Although the experiment was conducted in four site-years, we consider only a single site-
year here. At each site four replications of the cultivars were arranged in a randomized block
design. Because of technical limitations, it was decided to apply the row spacing and popula-
tion densities in strips within a cultivar experimental unit (plot). It was determined at random
which side (strip) of the plot received *" spacing. Then the population densities were assigned
randomly to five strips running perpendicular to the row spacing strips. Figure 7.18 displays a
schematic layout of one of the four blocks in the experiment.
The factors Row Spacing and Population Density are a split of the experimental unit to
which a cultivar is assigned, but are not arranged in a # & factorial structure. Considering
the cultivar experimental units, Row Spacing and Population Density form a strip-plot (split-
block) design with "' blocks (replications). Each cultivar experimental unit serves as a repli-
cate for the split-block design of the other two factors. We call this design a split-strip-plot
design.
There are experimental units of four different sizes in this experiment, hence the linear
model will contain four different experimental error sources of variability associated with the
plot, the columns, the rows, and their intersection. Before engaging in an analysis of data
from a complex experiment such as this, it is helpful to develop the source of variability and
degree of freedom decomposition. Correct specification of the programming statements can
then be more easily checked. As is good practice for designs with a split, the whole-plot and
subplot design analysis of variance can be developed separately. On the whole-plot (Cultivar)
level we have a simple randomized complete block design of four treatments in four blocks
(Table 7.15). The sub-plot source and degree of freedom decomposition regards each experi-
mental unit in the whole-plot design as a replicate. Hence, there are +< " "& replicate
degrees of freedom for the sub-plot analysis, which is a strip-plot (split-block) design.
Upon combining the whole-plot and sub-plot analysis, the Replicate source in Table 7.16
is replaced with the whole-plot decomposition in Table 7.15. Furthermore, interactions
between whole-plot factor Cultivar and all subplot factors are added. The degrees of freedom
for the interactions are removed from the corresponding sub-plot errors Errora# b through
Errora% b.
The degrees of freedom for Errora#b, for example, are obtained as
.0I<<9<a# b .0G?6>3@+<V9AW:Þ a+< "ba, "b a+ "ba, "b
a+< " + "ba, "b
a+< +ba, "b +a< "ba, "b,
and similarly for the other sub-plot error terms. The linear model for this experiment has as
many terms as there are rows in Table 7.17. In two steps the model can be defined as
]3456 .345 <6 /36" /346
# $
/356 %
/3456
[7.68]
.345 . !3 "4 a!" b34 #5 a!# b35 a"# b45 a!"# b345 .
.345 denotes the mean of the treatment combination of the 3th cultivar a3 "ß âß %b, 4th row
spacing a4 "ß #b, and 5 th population a5 "ß âß &b. It is decomposed into a grand mean
a.b, main effects of Cultivar a!3 b, Row Spacing a"4 b, Population a#5 b and their respective
interactions. The first line of model [7.68] expresses the observation ]3456 as a sum of the
mean .3456 and various random components. <6 is the random effect of the 6th block (whole-
plot), assumed Ka!ß 5<# b. /36" is the experimental error on the whole-plot assumed KÐ!ß 5"# Ñ, /346
#
# $
is the experimental error on a row spacing strip assumed KÐ!ß 5# Ñ, /356 is the experimental
error on a population density strip assumed KÐ!ß 5$# Ñ, and finally, /3456%
is the experimental
#
error on the intersection of perpendicular strips assumed KÐ!ß 5% Ñ (see Figure 7.18). All ran-
dom components are independent by virtue of independent randomizations.
Table 7.17. Sources of variability and degrees of freedom in soybean yield example
Source df
Block <" $
Cultivar +" $
Errora"b a< "ba+ "b *
Row Spacing ," "
Cultivar Row Sp. a+ "ba, "b $
Errora#b +a< "ba, "b "#
Population -" %
Cultivar Population a+ "ba- "b "#
Errora$b +a< "ba- "b %)
Row Sp. Population a, "ba- "b %
Cultivar Row Sp. Population a+ "ba, "ba- "b "#
Errora%b +a< "ba, "ba- "b %)
Total +<,- " "&*
We consider the blocks random in this analysis for two reasons. We posit that the blocks
are only a smaller subset of possible conditions over which inferences are to be drawn.
Secondly, the Block Cultivar interaction serves as the experimental error term on the
whole-plot. How can this interaction be random if Cultivar and Block factors are fixed? Some
research workers adopt the viewpoint that this apparent inconsistency should not be of
concern. A block*cultivar term will be used in the SAS® code only to generate the
necessary error term. We do remind the reader, however, that treating Block Cultivar
interactions as random and blocks as fixed corresponds to choosing an intermediate inference
space. As discussed in §7.3 this choice results in smaller standard errrors (and :-values)
compared to the broad inference space in which random effects are allowed to vary. Treating
the blocks as random in the analysis could be viewed as a somewhat conservative approach.
In a three-factor experiment it is difficult to roadmap the analysis from main effect and
interaction tests to contrasts, multiple comparisons, slices, etc. Whether marginal mean com-
parisons are meaningful depends on which factors interact. The first step in the analysis is
thus to produce tests of the main effects and interactions. The proc mixed statements
Output 7.18.
The Mixed Procedure
Model Information
Data Set WORK.SOYBEANYIELD
Dependent Variable YIELD
Covariance Structure Variance Components
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Satterthwaite
Covariance Parameters 5
Columns in X 90
Columns in Z 132
Subjects 1
Max Obs Per Subject 160
Observations Used 154
Observations Not Used 6
Total Observations 160
REP 3.0368
REP*CULTIVAR 0.4524
REP*CULTIVAR*ROWSPACE 1.2442
REP*CULTIVAR*TPOP 2.4215
Residual 3.9276
Fit Statistics
Num Den
Effect DF DF F Value Pr > F
The estimates of the variance components are shown in the Covariance Parameter
Estimates s #< $Þ!$(, 5
table as 5 s #" !Þ%&#, 5
s ## "Þ#%%, 5
s #$ #Þ%#", and 5
s #% $Þ*#). The
denominator degrees of freedom for the J statistics in the Type 3 Tests of Fixed Effects
table were adjusted by the Satterthwaite procedure due to the missingness of the observations.
For complete data we would have expected denominator degrees of freedom of *, "#, "#, %),
%), %), and %). At the &% significance level the three-way interaction, the Population Density
main effect, the Cultivar Row Spacing interaction, and the Cultivar main effect are signi-
ficant. Because of the significance of the three-way interaction, the two-way interactions that
appear nonsignificant may be masked, and similarly for the Row Spacing main effect.
The next step in the analysis is to investigate the interaction pattern more closely.
Because of the significance of the three-way interaction, we start there. Figure 7.19 shows the
estimated three-way cell means (least square means) for the Cultivar Population Row
Spacing combinations. Since the factor Population Density is quantitative, trends of soybean
yield in density are investigated later via regression contrasts. The Row Spacing effect for a
given population density and cultivar combination and the Cultivar effect for a given density
and row spacing can be assessed by slicing the three-way interaction. To this end add the
statement
lsmeans cultivar*rowspace*tpop / slice=(cultivar*tpop tpop*rowspace);
to the proc mixed code (Output 7.19). The first block of tests in the Tests of Effect
Slices table compares * inches vs. ") inches row spacing for cultivar AG3601 at various
population densities. The second block for cultivar AG3701 and so forth. The last block
compares cultivars for a given combination of population density and row spacing.
100 200 300
AG4601 AG4701
30
Yield (bushels/acre)
20
AG3601 AG3701
30
20
Figure 7.19. Three-way least squares means for factors Cultivar, Row Spacing, and
Population Density. Solid line refers to *-inch spacing, dashed line to ")-inch spacing.
For cultivar AG3601 it is striking that there is no spacing effect at '!,!!! plants per acre
a: !Þ))!%b, but there are significant spacing effects for all greater population densities.
This effect is also visible in Figure 7.19. For the other cultivars the row spacing effects are
absent with two exceptions: AG4601 and AG4701 at "#!ß !!! plants per acre
a: !Þ!!$! and !Þ!%&*, respectivelyb. The last block of tests reveals that at *-inch spacing
there are significant differences among the cultivars at any population density (e.g.,
: !Þ!"*) at '!ß !!! R Î+-</). For the wider row spacing cultivar effects are mostly absent.
Output 7.19.
Tests of Effect Slices
Num Den
Effect CULTIVAR TPOP ROWSPACE DF DF F Value Pr > F
To investigate the nature of the yield dependency on Population Density, the information
in the table of slices is very helpful. It suggests that for cultivars AG3701, AG4601, and
AG4701 it is not necessary to distinguish trends among row spacings. Determining the nature
of the trend averaged across row spacings for these cultivars is sufficient. The levels of the
factor Population Density are equally spaced and we use the standard orthogonal polynomial
coefficients to test for linear, quadratic, cubic, and quartic trends. The following twelve
contrast statements are added to the proc mixed code to test trends for AG3701, AG4601,
and AG4701 across row spacings (Output 7.20).
contrast 'AG3701 quartic ' tpop 1 -4 6 -4 1
cultivar*tpop 0 0 0 0 0 1 -4 6 -4 1;
Output 7.20.
Contrasts
Num Den
Label DF DF F Value Pr > F
The contrast statements to discern the row-spacing specific trends for cultivar AG3601
are more involved (Output 7.21):
contrast "AG3601 quart., 9inch" tpop 1 -4 6 -4 1
cultivar*tpop 1 -4 6 -4 1
tpop*rowspace 1 0 -4 0 6 0 -4 0 1 0
cultivar*tpop*rowspace 1 0 -4 0 6 0 -4 0 1 0;
tpop*rowspace -2 0 -1 0 0 0 1 0 2 0
cultivar*tpop*rowspace -2 0 -1 0 0 0 1 0 2 0;
For this cultivar at *-inch row spacing, yield depends on population density in quadratic
and linear fashion. At ")-inch row spacing, yield is not responsive to changes in the
population density. The slight yield increase at ")-inch spacing (Figure 7.19) is evident in the
marginally significant linear trend a: !Þ!&#'b.
Output 7.21.
Contrasts
Num Den
Label DF DF F Value Pr > F
This analysis of the three-way interaction leads to the overall conclusion that only for
cultivar AG3601 is row spacing of importance for a given population density. Does this con-
clusion prevail when yields are averaged across different population densities? A look at the
significant Cultivar Rowspace interaction confirms this. Figure 7.20 shows the correspond-
ing two-way least square means and the :-values from slicing this interaction by cultivar
(bottom margin) and by spacing (right margin). Significant differences exist among varieties
for *-inch spacing a: !Þ!!!"b but not for ")-inch spacing a: !Þ%#!%b. Averaged across
the population densities, only variety AG3601 shows a significant yield difference among the
two row spacings (: !Þ!!!)). The :-values in Figure 7.20 were obtained with the statement
(Output 7.22)
lsmeans cultivar*rowspace / slice=(cultivar rowspace);
30
Yield (bushels/acre) 28
27
26
25 p< 0.0001
p=0.4204
24
23
p=0.0008 p=0.8759 p=0.2804 p=0.5323
22
AG3601 AG3601 AG3601 AG3601
Variety
Figure 7.20. Two-way Cultivar Row Spacing least squares means. :-values from slices of
the two-way interaction are shown in the margins.
Num Den
Effect CULTIVAR TPOP ROWSPACE DF DF F Value Pr > F
CULTIVAR*ROWSPACE AG3601 1 11.8 19.80 0.0008
CULTIVAR*ROWSPACE AG3701 1 11.8 0.03 0.8759
CULTIVAR*ROWSPACE AG4601 1 10.8 1.29 0.2804
CULTIVAR*ROWSPACE AG4701 1 10.8 0.42 0.5323
CULTIVAR*ROWSPACE 9 3 17.3 14.79 <.0001
CULTIVAR*ROWSPACE 18 3 17.7 0.99 0.4204
Finally, we can raise the question, which varieties differ significantly from each other at
*-inch row spacing. The previous slice shows that the question is not of interest at ")-inch
spacing. The statement
lsmeans cultivar*rowspace / diff;
compares all Cultivar Row Spacing combinations in pairs and produces many comparisons
that are not of interest. The trimmed output that follows shows only those comparisons where
factor spacing was held fixed at * inches. Cultivar AG3601 is significantly higher yielding
(Estimates are positive) than any of the other cultivars.
The basic model for the analysis of these data comprises fixed effects for the Surface and
Rainfall Rate effects, temporal effects and all possible two-way interactions and one three-
way interaction. Random effects are associated with the replicates stemming from random
assignment of treatments to the columns and within-column disturbances over time. The
model can be expressed as
] . !3 "4 a!" b34 /345 76 a!7 b36 a"7 b46 a!"7 b346 03456 [7.69]
where
!3 is the effect of the 3th surface type a3 "ß âß $b
5 21 40 57
6.2
Estimated mean pH
5.1
LEACHATE: 2 LEACHATE: 2 LEACHATE: 2
Hardwd Sand Spruce
6.2
5.1
5 21 40 57 5 21 40 57
Day of Measurement
If this basic model is accepted the first question that needs to be addressed is that of the
possible correlation model to be used for the 03456 from the same column. We fit seven dif-
ferent correlation models and compare their respective AIC and Schwarz criteria (Table
7.19). These criteria point to several correlation models. The ARa"b, exponential, gaussian,
and compound symmetric models have similar fit statistics. The AIC criterion is calculated as
the residual log likelihood minus the number of the covariance parameters (“larger is better”
version). The unstructured models appear to fit the data well as judged by the negative of the
residual log likelihood (which we try to minimize). Their large number of covariance parame-
ters carries a hefty penalty in the determination of AIC, however.
The power model [7.56] is a reparameterization of the exponential model [7.54]. Hence,
their fit statistics are identical. The AR(1) and the exponential model are identical if the data
are equally spaced. For the data considered here, the measurement intervals of "', "*, and "(
days are almost identical which explains the small difference in AIC between the two models.
The ante-dependence model is also penalized substantially because of its many covariance
parameters. Should we ignore the possibility of serial correlations and continue with a com-
pound symmetry model, akin to a split-plot design where the whole-plot factor is a # $ fac-
torial? Fortunately, the power and compound symmetry models are nested. A test of
L! :3 ! in the power model can be carried out as a likelihood-ratio test. The test statistic is
PV #%Þ! #"Þ' #Þ% and the probability that a Chi-squared random variable with one de-
gree of freedom exceeds #Þ% is PrÐ;#" #Þ%Ñ !Þ"#". At the &% level, we cannot reject the
hypothesis, however. To continue with a compound symmetry model implies acceptance of
L! , and in our opinion the :-value is not large enough to rule out the possibility of a Type-II
error. We are not sold on independence of the repeated measurements. For the gaussian
model [7.57] the likelihood ratio statistic would be even greater, but the two models are not
directly nested. The gaussian model approaches compound symmetry as the parameter !
approaches !. At exactly ! the model is no longer defined. It appears from Table 7.19 that the
gaussian model is the model of choice for these data. Because of its high degree of regularity
at short lag distances, we do not recommend it in general for most repeated measures data
(see comments in §7.5.2). Since it outperforms the other parsimonious models we use it here.
Table 7.19. Akaike's information criterion and Schwarz' criterion for various covariance
models (The last column contains the number of covariance parameters)
Covariance Model AIC Schwarz # Res. Log Likelihood Parameters
Compound Symmetry "%Þ! "%Þ* #%Þ! #
†
Unstructured ")Þ! ##Þ* "$Þ* ""
Unstructureda#b "&Þ* "*Þ% "&Þ( )
Exponential, [7.54] "$Þ) "&Þ" #"Þ' $
Power, [7.56] "$Þ) "&Þ" #"Þ' $
Gaussian, [7.57] "$Þ# "%Þ& #!Þ$ $
ARa"b, [7.52] "$Þ* "&Þ# #"Þ) $
Antedependence, [7.59] "(Þ& #"Þ! "*Þ! )
†
The unstructured model led to a second derivative matrix of the log likelihood function
which was not positive definite. It is not considered further for these data.
The proc mixed code to analyze this repeated measures design follows. The model state-
ment contains the main effects and interactions of the three factors Rainfall Rate, Surface Ma-
terial, and Time. The random statement identifies the experimental errors /345 and the
repeated statement instructs proc mixed to treat the observations from the same replicate as
correlated according to the gaussian covariance model. The r=1 and rcorr=1 options of the
repeated statement request a printout of the Ra!b and the correlation matrix for the first
subject in the data set. The initial sorting of the data set is good practice to ensure that the ob-
servations are ordered by increasing time of measurement within each experimental unit. The
variable t created in the data set is used in the type=sp(gau)(t) option to represent temporal
distance between repeated observations on the same experimental unit. It is measured in num-
ber of weeks since the initial measurement (since time is measured in days).
Output 7.24.
The Mixed Procedure
Model Information
Data Set WORK.LEACHATE
Dependent Variable PH
Covariance Structures Variance Components, Spatial Gaussian
Subject Effect REP(SURFACE*RAINFAL)
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Containment
Dimensions
Covariance Parameters 3
Columns in X 60
Columns in Z 18
Subjects 1
Max Obs Per Subject 72
Observations Used 72
Observations Not Used 0
Total Observations 72
Num Den
Effect DF DF F Value Pr > F
and the correlation is expe #Þ#)&# Î#Þ&)(# f !Þ%&). These values can be found in the first
row, second column of the Estimated R Matrix for Subject 1 and the Estimated R
Correlation Matrix for Subject 1. The continuity of the gaussian covariance model near
the origin lets correlations decay rather quickly. The correlation between the first and third
measurement a> &Þ! weeksb is only !Þ!#$). The estimated practical range of the correlation
process is È$ 9 s %Þ%) weeks. After four and a half weeks leachate pHs from the same
soil column are essentially uncorrelated.
The variance component for the experimental error is rather small. If the fixed effects
part of the model contains numerous effects and the within-cluster correlations are also
modeled, one will frequently find that the algorithm fails to provide estimates for some or all
effects in the random statement. The likelihood solutions for the covariance parameters of
these effects are outside the permissible range. After dropping the random statement (or indi-
vidual effects in the random statement), and retaining the repeated statement the algorithm
then often converges successfully. While this is a reasonable approach, one should be
cautioned that this effectively alters the statistical model being fit. Dropping the random state-
ment here corresponds to the assumption that 5/# ´ !, i.e., there is no experimental error
variability associated with the experimental unit and all stochastic variation arises from the
within cluster process.
For some covariance models, combining a random and repeated statement is impossible,
since the effects are confounded. For example, compound symmetry is implied by two nested
random effects. In the absence of a repeated statement a residual error is always added to the
model and hence the statements
proc mixed data=whatever;
class ...;
model y = ... ;
random rep(surface*rainfall);
run;
and
proc mixed data=whatever;
class ...;
model y = ... ;
repeated / subject=rep(surface*rainfall) type=cs;
run;
will fit the same model. Combining the random and repeated statement with type=cs will
lead to aliasing of one of the variance components.
Returning to the example at hand, we glean from the Table of Type 3 Tests of Fixed
Effects in Output 7.24 that the Surface Time interaction, the Time main effect, the
Rainfall main effect, and the Surface main effect are significant at the &% level.
Hardwood
6.6
Sand
Estimated Mean pH
6.1
5.6
Spruce
5.1
5 21 40 57
Day of Measurement
Figure 7.22. Surface least squares means by day of measurement. Estimates of the marginal
surface means are shown along the vertical axis with the same symbols as the trends over
time.
The next step is to investigate the significant effects, starting with the two-way interac-
tion. The presence of the interaction is not surprising considering the graph of the two-way
least square means (Figure 7.22). After studying the interaction pattern we can also conduct
comparisons of the marginal surface means since the trends over time do not criss-cross
wildly. The proc mixed code that follows requests least squares means and their differences
for the two rainfall levels (there will be only one difference), slices of the surface*time
interaction by either factor and differences of the Surface Time least squares means that are
shown in Figure 7.22.
ods listing close;
proc mixed data=Leachate;
class rep surface rainfall time;
model ph = surface rainfall surface*rainfall
time time*surface time*rainfall time*surface*rainfall;
random rep(surface*rainfall);
repeated / subject=rep(surface*rainfall) type=sp(gau)(t) r=1 rcorr=1;
lsmeans rainfall / diff;
lsmeans surface*time /slice=(time surface) diff;
The ods listing close statement prior to the proc mixed call suppresses printing of
procedural output to the screen. Instead, the output of interest (slices and least squares mean
differences) is saved to data sets (diffs and slices) with the ods output statement. We do
this for two reasons: (1) parts of the proc mixed output such as the Dimensions table, the
Type 3 Tests of Fixed Effects have already been studied above. (2) the set of least
squares mean differences for interactions contains many comparisons not of interest. For
example, a comparison of spruce surface at day & vs. sand surface at day %! is hardly mean-
ingful (it is also not a simple effect!). The data step following the proc mixed code processes
the data set containing the least squares mean differences. It deletes two-way mean com-
parisons that do not correspond to simple effects, calculates the least significant differences
(LSD) at the &% level for all comparisons, and indicates the significance of the comparisons
at the &% ab, #Þ&% ab, and "% level ab.
The slices show that at any time point there are significant differences in pH among the
three surfaces (Output 7.25). From Figure 7.22 we surmise that these differences are probably
not between the sand and hardwood surface, but are between these surfaces and the spruce
surface. The least squares mean differences confirm that. The first observation among the
least squares mean differences compares the marginal Rainfall means, the next three compare
the marginal Surface means that are graphed along the vertical axis of Figure 7.22. Compari-
sons of the Surface Time means start with the fifth observation. For example, the difference
between estimated means of hardwood and sand surfaces at day & is !Þ#('( with an LSD of
!Þ$"!$. At any given time point hardwood and sand surfaces are not statistically different at
the &% level, whereas the spruce surface leads to significantly higher acidity of the leachate.
Notice that at any point in time the least significant difference (LSD) to compare surfaces is
!Þ$"!$. The least significant differences to compare time points for a given surface depend on
the temporal separation, however. If the repeated measurements were uncorrelated, there
would be only one LSD to compare surfaces at a given time point and one LSD to compare
time points for a given surface.
Output 7.25.
Tests of Effect Slices
Num Den
Obs Effect SURFACE TIME DF DF FValue ProbF
AGE: 1 AGE: 1
SPECIES: 1 SPECIES: 2
4
2
Mean Water Usage
0
AGE: 2 AGE: 2
SPECIES: 1 SPECIES: 2
4
0
150 200 250 300
Day of Measurement
Figure 7.23. Age and species-specific averages for water usage data. The dashed line
represents an exploratory loess fit to suggest a parametric polynomial model. Data made
kindly available by Dr. Roger Harris and Dr. Robert Witmer, Department of Horticulture,
Virginia Polytechnic Institute and State University. Used with permission.
A model for the age and species-specific trends over time based on Figure 7.23 could be
the following. Let ]345 denote the water usage at time >5 of the average tree in age group 3
a3 "ß #b for species 4 a4 "ß #b. Define +3 to be a binary regressor (dummy) variable taking
on value " if an observation is from age group " and ! otherwise. Similarly define =4 as a
dummy variable taking on value " for species " and value ! for species #. Then,
Ec]345 d "! "" +3 "# =4 "$ +3 =4 "% >5 "& >#5 "' +3 >5 "( +3 >5#
") =4 >5 "* =4 >#5 ""! +3 =4 >5 """ +3 =4 >#5 . [7.70]
This model appears highly parameterized, but is really not. It allows for separate linear and
quadratic trends among the four groups. The intercepts, linear, and quadratic slopes are
constructed from "! through """ according to Table 7.20.
Table 7.20. Intercepts and Gradients for Age and Species Groups in Model [7.70]
Group +3 =4 Intercept Linear Gradient Quadratic Gradient
Age 1, Species " " " "! "" "# "$ "% "' ") ""! "& "( "* """
Age 1, Species # " ! "! "" "% "' "& "(
Age #, Species " ! " "! "# "% ") "& "*
Age #, Species # ! ! "! "% "&
Model [7.70] is our notion of a population-average model that permits inference about
the effects of age, species, and their interaction. It does not accommodate the sequence of
measurements for individual trees, however. First, we notice that varying the parameters of
model [7.70] on a tree-by-tree basis, the total number of parameters in the mean function
alone would be "! "# "#!, an unreasonable number. Second, not all of the trees will de-
viate significantly from the population average, finding out which ones are different is a time-
consuming exercise in a model with that many parameters. The hierarchical, two-stage mixed
model idea comes to our rescue. Rather than modeling tree-to-tree variability within each
group through a large number of fixed effects, we allow one or more of the parameters in
model [7.70] to vary at random among trees. This approach is supported by the random selec-
tion of trees within each age class and for each species. The BLUPs of these random effects
then can be used to (i) assess whether a tree differs in its water usage significantly from the
group average and (ii) to produce tree-specific predictions of water usage. To decide which of
the polynomial parameters to vary at random, we first focus on a single group, trees of
species # at age #. The observed water usage over time for the "! trees in this group is shown
in Figure 7.24. If we adopt a quadratic response model for the average tree in this group we
can posit random coefficients for the intercept, the linear, and the quadratic gradient. It is un-
likely that the model will support all three parameters being random.
Either a random intercept, random linear slope, or both seem possible. Nevertheless, we
commence modeling with the largest possible model
]75 "! ,7
!
"" ,7
"
>5 "# ,7
# #
>5 /75 , [7.71]
! #
where 7 denotes the tree a7 "ß #ß âß "!b, 5 the time point and ,7 through ,7 are random
effects/coefficients assumed independent of the model disturbances /75 and independent of
each other. The variances of these random effects are denoted 5!# through 5## , respectively.
The SAS® proc mixed code to fit model [7.71] to the data from age group #, species # is as
follows (Output 7.26).
4 5 6
5
Observed Water Usage
7 8 9
5
10
5
Age 2, Species 2
2
Figure 7.24. Longitudinal observations for ten trees of species 2 in age group 2. The
trajectories suggest quadratic effects with randomly varying slope or intercept.
Output 7.26.
Model Information
Dimensions
Covariance Parameters 4
Columns in X 3
Columns in Z Per Subject 3
Subjects 10
Max Obs Per Subject 25
Observations Used 239
Observations Not Used 0
Total Observations 239
Standard Z
Cov Parm Subject Estimate Error Value Pr Z
Intercept treecnt 0.2691 0.1298 2.07 0.0190
t treecnt 0 . . .
t*t treecnt 0 . . .
Residual 0.1472 0.01382 10.65 <.0001
Fit Statistics
-2 Res Log Likelihood 262.5
AIC (smaller is better) 266.5
AICC (smaller is better) 266.6
BIC (smaller is better) 267.1
The time variable (measured in days since Jan 01) has been rescaled to allow estimates of
the fixed effects and BLUPs to be shown on the output with sufficient decimal places. The
estimates for the variance components 5"# and 5## are zero (Output 7.26). This is an indication
that not all three random effects can be supported by these data, as was already anticipated.
The -2 Res Log Likelihood of #'#Þ& is the same value one would achieve in a model where
only the intercept varies at random. Fitting models with either 5## ! or 5"# ! also leads to
zero estimates for the linear or quadratic variance component (apart from the intercept). We
interpret this as evidence that the data support only one of the parameters being random, not
that the intercept being random necessarily provides the best fit. Next all models with a single
random effect are investigated, as well as the purely fixed effects model:
]75 "! ,7
!
"" >5 "# >#5 /75
]75 "! "" ,7
"
>5 "# >#5 /75
]75 "! "" >5 "# ,7
# #
>5 /75
]75 "! "" >5 "# >#5 /75 .
The -2 Res Log Likelihoods of these models are, respectively, #'#Þ&, #)!Þ&, $"%Þ', and
%'"Þ). The last model is the fixed effects model without any random effects and likelihood
ratio tests can be constructed to test the significance of any of the random components.
L! À 5!# ! A %'"Þ) #'#Þ& "**Þ$ : !Þ!!!"
L! À 5"# ! A %'"Þ) #)!Þ& ")"Þ$ : !Þ!!!"
L! À 5## ! A %'"Þ) $"%Þ' "%(Þ# : !Þ!!!".
Incorporating any of the random effects provides a significant improvement in fit over a
purely fixed effects model. The largest improvement (smallest -2 Res Log Likelihood) is
obtained with a randomly varying intercept.
So far, it has been assumed that the model disturbances /75 are uncorrelated. Since the
data are longitudinal in nature, it is conceivable that residual serial correlation remains even
after inclusion of one or more random effects. To check this possibility, we fit a model with
an exponential correlation structure (since the measurement times are not quite equally
spaced) via
Another drop in minus twice the residual log likelihood from #'#Þ& to #&"Þ$ is confirmed
with Output 7.27. The difference of A #'#Þ& #&"Þ$ ""Þ# is significant with :-value of
PrÐ;#" ""Þ#Ñ !Þ!!!). Adding an autoregressive correlation structure for the /75 signifi-
cantly improved the model fit. It should be noted that the :-value from the likelihood ratio
test differs from the :-value of !Þ!!'( reported by proc mixed in the Covariance Parameter
Estimates table. The :-value reported there is for a Wald-type test statistic obtained from
comparing a D test statistic against an asymptotic standard Gaussian distribution. The test
statistic is simply the estimate of the covariance parameter divided by its standard error. Esti-
mates of variances and covariances are usually far from Gaussian-distributed and the Wald-
type tests for covariance parameters produced by the covtest option of the proc mixed state-
ment are not very reliable. We prefer the likelihood ratio test whenever possible.
Output 7.27.
The Mixed Procedure
Model Information
Data Set WORK.AGE2SP2
Dependent Variable wu
Covariance Structures Variance Components,
Spatial Exponential
Subject Effects treecnt, treecnt
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Containment
Dimensions
Covariance Parameters 3
Columns in X 3
Columns in Z Per Subject 1
Subjects 10
Max Obs Per Subject 25
Observations Used 239
Observations Not Used 0
Total Observations 239
Fit Statistics
-2 Res Log Likelihood 251.3
AIC (smaller is better) 257.3
AICC (smaller is better) 257.4
BIC (smaller is better) 258.2
The /s options on the model and random statements of proc mixed yield printouts of the
fixed effects estimates and the BLUPs. We obtain " s ! ""Þ##, "
s " "#Þ(", and
s
" # #Þ*"% for the fixed effects estimates. Water usage of the average tree in this group is
thus predicted as sC ""Þ## "#Þ("> #Þ*"%># . The best linear unbiased predictors of
b Ò,"! ß ,#! ß âß ,7
!
Ó are displayed as Solutions for Random Effects. For example the tree-
specific prediction for tree #" is
sC ""Þ## !Þ"*$! "#Þ("> #Þ*"%># .
The BLUPs are significantly different from zero (at the &% level) for trees %, &, ', (, ), and *.
1 2 3
5
2
Observed and Predicted Water Usage
4 5 6
5
7 8 9
5
10
5
Age 2, Species 2
2
Figure 7.25. Predictions from random intercept model with continuous ARa"b errors for
species 2 at age 2. Population average fit shown as a dashed lines, cluster-specific predictions
shown as solid lines.
The tree-specific trends show a noticeable deviation from the population-averaged pre-
diction for trees that have a significant BLUP (Figure 7.25). For any of these trees it is
obvious that the tree-specific prediction provides a much better fit to the data than the popu-
lation average.
The previous proc mixed runs and inferences were for one of the four groups only. Next
we need to combine the data across the age and species groups which entails adding a mixed
effects structure to model [7.70]. Based on what was learned from investigating the Age 2,
Species 2 group, it is tempting to add a random intercept and an autoregressive structure for
the within-tree disturbances and leave it at that. The mixed model analyst will quickly notice
when dealing with combined data sets that random effects that may not be estimable for a
subset of the data can successfully be estimated based on a larger set of data. In this appli-
cation it turned out that upon combining the data from the four groups not only a random
intercept, but also a random linear slope could be estimated. The statements
proc mixed data=wateruse noitprint;
class age species treecnt;
model wu = age*species age*species*t age*species*t*t / s;
estimate 'Intercpt Age1-Age2 = Age Main' age*species 1 1 -1 -1;
estimate 'Intercpt Sp1 -Sp2 = Sp. Main' age*species 1 -1 1 -1;
estimate 'Intercpt Age*Sp = Age*Sp. ' age*species 1 -1 -1 1;
random intercept t /subject=age*species*treecnt s;
repeated /subject=age*species*treecnt type=sp(exp)(time);
run;
fit the model with random intercept and random (linear) slope and an autoregressive correla-
tion structure for the within-tree errors (Output 7.28). The subject= options of the random
and repeated statements identify the units that are to be considered uncorrelated in the
analysis. Observations with different values of the variables age, species, and treecnt are
considered to be from different clusters and hence uncorrelated. Any set of observations with
the same values of these variables is considered correlated. The use of the age and species
variables in the class statement allows a more concise expression of the fixed effects part of
model [7.70]. The term age*species fits the four intercepts, the term age*species*t the four
linear slopes and so forth. The first two estimate statements compare the intercept estimates
between ages " and #, and species " and #. These are inquiries into the Age or Species main
effect. The third estimate statement tests for the Age Species interaction averaged across
time.
This model achieves a -2 Res Log Likelihood of **&Þ). Removing the repeated state-
ment and treating the repeated observations on the same tree as uncorrelated, twice the
negative of the residual log likelihood becomes "!(%Þ#. The likelihood ratio statistic
"!(%Þ# **&Þ) ()Þ% indicates a highly significant temporal correlation among the repeated
measurements. The estimate of the correlation parameter is 9 s %Þ(#"(. Since the measure-
ment times are coded in days, this estimate implies that water usage exhibits temporal correla-
tions over $9 s "%Þ"' days. Although there are *&$ observations in the data set, notice that
the test statistics for the fixed effects estimates are associated with $' degrees of freedom, the
number of clusters (subjects) minus the number of estimated covariance parameters.
The tests for main effects and interactions at the end of the output suggest no differences
in the intercepts between ages " and #, but differences in intercepts between the species.
Because of the significant interaction between Age Class and Species we decide to retain all
four fixed effect intercepts in model [7.70]. Similar tests can be performed to determine
whether the linear and quadratic gradients differ among the species and ages.
Model Information
age 2 1 2
species 2 1 2
treecnt 40 1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33
34 35 36 37 38 39 40
Fit Statistics
Standard
Effect age species Estimate Error DF t Value Pr > |t|
Std Err
Effect age species treecnt Estimate Pred DF t Value Pr > |t|
Intercept 1 1 1 -0.04041 0.2731 869 -0.15 0.8824
t 1 1 1 0.04677 0.1180 869 0.40 0.6921
Intercept 1 1 2 -0.03392 0.2738 869 -0.12 0.9014
t 1 1 2 0.07863 0.1181 869 0.67 0.5057
Intercept 1 1 3 0.08246 0.2731 869 0.30 0.7628
t 1 1 3 -0.05849 0.1180 869 -0.50 0.6204
Intercept 1 1 4 0.07395 0.2731 869 0.27 0.7866
t 1 1 4 -0.01493 0.1180 869 -0.13 0.8994
Intercept 1 1 5 -0.1568 0.2731 869 -0.57 0.5660
t 1 1 5 -0.07856 0.1180 869 -0.67 0.5059
Estimates
Standard
Label Estimate Error DF t Value Pr > |t|
Intercpt Age1-Age2 = Age Main 2.3498 2.3551 36 1.00 0.3251
Intercpt Sp1 -Sp2 = Sp. Main -8.7630 2.3551 36 -3.72 0.0007
Intercpt Age*Sp = Age*Sp. -6.8769 2.3551 36 -2.92 0.0060
There were four replicate plots for each treatment. In 1997, mature melons were harvested on
each plot at days )", )%, )), and *". The average yield per plot was converted into yield per
hectare and added to the previous yield. Figure 7.26 shows the mean cumulative yields in
tons ha" for the four treatments in 1997. The cumulative yield increases over time since it
is obtained from adding positive yield figures. In the absence of mulching the cumulative
yields are considerably lower compared to the red or black mulch applications. There seems
to be little difference in yields between the two plastic mulch types.
These data display several interesting features. The experimental units are the plots to
which a particular treatment was applied. The mature melons harvested at a particular day
represent observational units. The number of matured melons varies from plot to plot and
hence the number of subsamples from which the yield per hectare is calculated differs. If the
variability of melon weights is homogeneous across all plots, the variability of observed
yields per hectare increases with the number of melons harvested. Also, the cumulative yields
which are the focus of the analysis (Figure 7.26) are not independent, even if the muskmelons
matured independently on a given plot. The cumulative yield at day )) is the sum of the
cumulative yield at day )% and the yield observed at day )). To build a statistical model for
these data the two issues of subsampling and correlated responses must be kept in mind.
50
40
Irrig. Black Mulch
30
Irrigated
20
10
Not irrigated
80 82 84 86 88 90
Days After Planting
Figure 7.26. Average cumulative yields vs. days after planting in muskmelon study for 1997.
The data for this experiment was kindly provided by Dr. Norris L. Powell, Tidewater
Agricultural Research and Extension Center, Virginia Polytechnic Institute and State
University. Used with permission.
The investigators were particularly interested in the following questions and hypotheses:
ó In the absence of mulch, is there a benefit of irrigation?
ô Is there a difference in cumulative yields between red and black plastic mulch?
õ Is there a benefit of mulching beyond irrigation?
ö What is the mulch effect?
÷ Can mulching shorten the growing period, i.e., is the yield of mulched plots at the
beginning of the observation period comparable to the yield of unmulched plots at the
end of the observation period?
These questions have a temporal component; in the case of ó, for example, there may be
a beneficial effect late in the season rather than early in the season. In building a statistical
model for this experiment we commence by formulating a model for the yield per hectare on
a given plot at a given point in time a]345 b:
]345 .345 /34 .345 . [7.72]
In [7.72] .345 denotes the mean of the 3th treatment a3 "ß âß %b on replicate (plot) 4 at the
5 th harvesting day after planting a5 "ß âß %b. The /34 are experimental errors associated
with replicate (plot) 4 of treatment 3 and .345 is the subsampling error from harvesting 8345
#
melons on replicate 4 of treatment 3 at time 5 . If 57 is the variability of melon weights
(converted to hectare basis) and Varc/34 d 5 # is the experimental error variability, then
Varc]345 d 5 # 8345 57
#
.
Model [7.72] is a relatively straightforward mixed model, but the ]345 are not the outcome of
interest. Rather, we wish to analyze the cumulative yields
Y34" ]34"
Y34# ]34" ]34#
ã
5
Y345 "]34: .
:"
That the Y345 are correlated even if the ]345 are independent is easy to establish. For example,
CovcY34" ß Y34# d Covc]34" ß ]34" ]34# d Varc]34" d. The accumulation of yields results in an
accumulation of the experimental errors and the subsampling errors. The resulting correlation
structure is rather complicated and difficult to code in a statistical computing package. As an
alternative, we choose the following route: an unstructured variance-covariance matrix for the
accumulated .345 is combined with a random effect for the experimental errors. The mean
model is a cubic response model with different trends for each treatment. This mean model
will produce the same estimates of treatment means at days )", )%, )), and *" as a block
design analysis but also allows for estimating treatment differences at other time points. The
comparisons ó ÷ are coded with the estimate statement of proc mixed. The variable t in
the code that follows is )! days less than time after planting. The first harvesting time thus
coincides with > ". Divisors in estimate statements are used here to ensure that the resulting
estimates are properly scaled. For example, a statement such as
estimate 'A1+A2 vs. B1+B2' tx 1 1 -1 -1;
compares the sum of treatment means for A" and A# to the sum of B" and B# . Using a divisor
has no effect on the significance level of the comparison, and the statement
estimate 'Average(A1 A2) vs. Average(B1 B2)' tx 1 1 -1 -1 / divisor=2;
will produce the same :-value. The estimate itself will be the difference of sums in the first
case and the difference of averages in the second case.
proc mixed data=melons97 convh=0.002 noitprint;
class rep tx;
model cumyld = tx tx*t tx*t*t tx*t*t*t / noint;
random rep*tx; /* experimental error */
repeated /subject=rep*tx type=un;
/* ó at days 81, 84, 88, 91 corresponding to t=1, 4, 8, 11 */
estimate 'Bare vs. Irrig (81)' tx 1 -1 0 0 tx*t 1 -1 0 0
tx*t*t 1 -1 0 0 tx*t*t*t 1 -1 0 0;
estimate 'Bare vs. Irrig (84)' tx 1 -1 0 0 tx*t 4 -4 0 0
tx*t*t 16 -16 0 0 tx*t*t*t 64 -64 0 0;
estimate 'Bare vs. Irrig (88)' tx 1 -1 0 0 tx*t 8 -8 0 0
tx*t*t 64 -64 0 0 tx*t*t*t 512 -512 0 0;
estimate 'Bare vs. Irrig (91)' tx 1 -1 0 0 tx*t 11 -11 0 0
tx*t*t 121 -121 0 0 tx*t*t*t 1331 -1331 0 0;
/* ÷ */
estimate 'Irrig(91) - Mulch(81)' tx 0 2 -1 -1 tx*t 0 22 -1 -1
tx*t*t 0 242 -1 -1 tx*t*t*t 0 2662 -1 -1 /divisor=2;
run;
Fit Statistics
Res Log Likelihood -190.3
Akaike's Information Criterion -201.3
Schwarz's Bayesian Criterion -205.5
-2 Res Log Likelihood 380.5
Estimates
Standard
Label Estimate Error DF t Value Pr > |t|
ó
Bare vs. Irrig (81) 1.6425 10.2688 36 0.16 0.8738
Bare vs. Irrig (84) -5.9495 7.1567 36 -0.83 0.4113
Bare vs. Irrig (88) -4.8272 6.4511 36 -0.75 0.4592
Bare vs. Irrig (91) -10.3727 7.5759 36 -1.37 0.1794
ô
IrrBl vs. IrrRed (81) 4.5288 10.2688 36 0.44 0.6618
IrrBl vs. IrrRed (84) 0.7070 7.1567 36 0.10 0.9219
IrrBl vs. IrrRed (88) 2.8887 6.4511 36 0.45 0.6570
IrrBl vs. IrrRed (91) 0.7420 7.5759 36 0.10 0.9225
õ
Irrig vs. Mulch (81) -25.8126 8.8931 36 -2.90 0.0063
Irrig vs. Mulch (84) -28.7948 6.1979 36 -4.65 <.0001
Irrig vs. Mulch (88) -29.5604 5.5868 36 -5.29 <.0001
Irrig vs. Mulch (91) -26.3408 6.5609 36 -4.01 0.0003
ö
Mulch vs. NoMulch (81) -24.9914 7.2611 36 -3.44 0.0015
Mulch vs. NoMulch (84) -31.7695 5.0605 36 -6.28 <.0001
Mulch vs. NoMulch (88) -31.9740 4.5616 36 -7.01 <.0001
Mulch vs. NoMulch (91) -31.5271 5.3570 36 -5.89 <.0001
÷
Irrig(91) - Mulch(81) -5.5644 7.4202 36 -0.75 0.4582
8.1 Introduction
8.2 Nonlinear and Generalized Linear Mixed Models
8.3 Toward an Approximate Objective Function
8.3.1 Three Linearizations
8.3.2 Linearization in Generalized Linear Mixed Models
8.3.3 Integral Approximation Methods
8.4 Applications
8.4.1 A Nonlinear Mixed Model for Cumulative Tree Bole Volume
8.4.2 Poppy Counts Revisited — a Generalized Linear Mixed Model
for Overdispersed Count Data
8.4.3 Repeated Measures with an Ordinal Response
8.1 Introduction
Box 8.1 NLMMs and GLMMs
• Models for data that are clustered or otherwise call for the inclusion of
random effects do not necessarily have a linear mean function.
• Generalized linear mixed models (GLMMs) arise when clustered data are
modeled where the (conditional) response has a distribution in the
exponential family.
Chapters 5 and 6 discussed general nonlinear models and generalized linear models (GLM)
for independent data. It was emphasized in §6 that GLMs are special cases of nonlinear
models where a linear predictor is placed inside a nonlinear function. Chapter 7 digressed
from the nonlinear model theme by introducing linear models for clustered data where the
observations within a cluster are possibly correlated. To capture cluster-to-cluster as well as
within-cluster variability we appealed to the idea of randomly varying cluster effects which
gave rise to the Laird-Ware model
Y3 X3 " Z3 b3 e3 .
Recall that the b3 are random effects or coefficients, with mean 0 and variance-covariance
matrix D, that vary across clusters and the e3 are the within-cluster errors. Throughout
Chapter 7 it was assumed that the mean function is linear. There are situations, however,
where the model calls for the inclusion of random effects and the mean function is nonlinear.
Figure 8.1 shows the cumulative bole volume profiles of three yellow poplar (Lirioden-
dron tulipifera L.) trees that are part of a data set of $$' randomly selected trees. The volume
of a bole was obtained by felling the tree, delimbing the bole and cutting it into four-foot-long
sections. The volume of each section was determined by geometric principles assuming a
circular shape of the bole cross-section and accumulated with the volumes of lower sections.
This process is repeated to the top of the tree bole if total-bole volume is the desired response
variable, or to the point where the bole diameter has tapered to the merchantable diameter, if
merchantable volume is the response variable. If .34 is the cross-sectional diameter of the bole
of the 3th tree at the 4th height of measurement, then <34 " .34 Îmaxa.34 b is termed the
complementary diameter. It is zero at the stump and approaches one at the tree tip.
The trees differ considerably in size (total height and breast height diameter), hence their
total cumulative volumes differ greatly. The general shapes of the tree profiles are similar,
however, suggesting a sigmoidal increase of cumulative volume with increasing complemen-
tary diameter. These data are furthermore clustered. Each of the 8 $$' trees represents a
cluster of observations. Since the individual bole segments were cut at equal four-foot inter-
vals the number of observations within a cluster a83 b varies from tree to tree. A cumulative
bole volume model for these data will have a nonlinear mean function that captures the sig-
moidal behavior of the response and account for differences in size and shape of the trees
No. 308
150
No. 5
Cumulative volume (ft3)
100
No. 151
50
Figure 8.1. Cumulative volume profiles of three yellow poplar (Liriodendron tulipifera L.)
trees as a function of the complementary bole diameter. Data kindly provided by Dr. David
Loftis, USDA Forest Service, originally collected by Dr. Donald E. Beck (see Beck 1963).
Mixed models with nonlinear mean function also come about when generalized linear
models are applied to data with clustered structure. Recall the Hessian fly experiment of
§6.7.2 where "' varieties were arranged in a randomized block design with four blocks and
the outcome was the proportion of plants on an experimental unit infested with the Hessian
fly. The model applied there was
134
log . 73 34 ,
" 134
where 134 is the probability of infestation for variety 3 in block 4, and 73 , 34 are the treatment
and block effects, respectively. Expressed as a statistical model for the observed data this
model can be written as
"
]34 /34 , [8.1]
" expe . 73 34 f
where ]34 is the proportion of infested plants and /34 is a random variable with mean ! and
variance 134 a" 134 bÎ834 (a so-called shifted binomial random variable). If the blocks were
not predetermined but selected at random, then the 34 are random effects and model [8.1] is a
nonlinear mixed model. It is nonlinear because the linear predictor . 73 34 is inside a
nonlinear function, and it is a mixed model because of fixed treatment effects and random
block effects.
The problem of overdispersion in generalized linear models was discussed in §6.6.
Different strategies for modeling overdispersed data were (i) the addition of extra scale
parameters that alter the dispersion but not the mean of the response, (ii) parameter mixing,
and (iii) generalized linear mixed models. An extra scale parameter was used to model the
Hessian fly data in §6.7.2. For the poppy count data in §6.7.8 parameter mixing was applied.
Recall that ]34 denoted the number of poppies for treatment 3 in block 4. It was assumed that
given -34 , the average number of poppies per unit area, the counts ]34 l-34 were Poisson
distributed and that -34 was a Gamma-distributed random variable. This led to a Negative
Binomial model for the poppy counts ]34 . Because this distribution is in the exponential
family of distributions (§6.2.1), the model could be fit as a standard generalized linear model.
The parameter mixing approach is intuitive because a quantity assumed to be fixed in a
reference model is allowed to vary at random thereby introducing more uncertainty in the
marginal distribution of the response and accounting for the overdispersion in the data. The
generalized linear mixed model approach is equally intuitive. Let ]34 denote the poppy count
for treatment 3 in block 4 and assume that ]34 l.34 are Poisson distributed with mean
-34 expe. 73 34 .34 f.
The model for the log intensity of poppy counts appears to be a classical model for a
randomized block design with error term .34 . In fact, we specify that the .34 are independent
Gaussian random variables with mean ! and variance 5 # . Compare this to the standard
Poisson model without overdispersion, ]34 µ Poissonaexpe. 73 34 fb. The uncertainty in
.34 increases the uncertainty in ]34 . The model
]34 l.34 µ Poisson/.73 34 .34 [8.2]
is also a parameter mixing model. If the distribution of the .34 is carefully chosen, one can
average over it analytically to derive the marginal distribution of ]34 on which inference is
based. In §6.7.8 the distribution of -34 was chosen as Gamma because the marginal distribu-
tion then had a known form. In other situations the marginal distribution may be difficult to
obtain or intractable. We then start with a generalized linear mixed model such as [8.2] and
approximate the marginal distribution (see §8.3.2). The poppy count data are modeled with a
generalized linear mixed model in §8.4.2.
The vector x34 is a vector of regressor or design variables and ) is a vector of fixed effects.
The vector b3 denotes a vector of random effects (or coefficients) modeling the cluster-to-
cluster heterogeneity. As in §7, the b3 are zero mean random variables with variance-covar-
iance matrix D. The /34 are within-cluster errors with mean ! which might be correlated. Be-
cause computational difficulties in fitting nonlinear models are greater than those in fitting
linear models it is commonly assumed that the within-cluster errors are uncorrelated, but this
When combining the responses for a particular cluster in a vector Y3 c]3" ß âß ]383 dw
models [8.3] and [8.4] will be written as
Y 3 f ax ß ) ß b 3 b e 3
Y3 g" aX3 " Z3 b3 b e3 .
Maximum (or restricted) maximum likelihood estimation is based on the marginal distribution
of Y3 which turns out to be multivariate Gaussian if the b3 and e3 are Gaussian-distributed.
Even when b3 and e3 are Gaussian, the distribution of Y3 in the general model Y3
faxß )ß b3 b e3 is not necessarily Gaussian since Y3 is no longer a linear combination of
Gaussian random variables. Even deriving the marginal mean and variance of Y3 proves to be
a difficult undertaking. If the distribution of the within-cluster errors e3 is known finding the
conditional distribution of Y3 lb3 is simple. The marginal distribution of Y3 , which is key in
statistical inference, remains elusive in the nonlinear mixed model. The various approaches
put forth in the literature differ in the technique and rationale applied to approximate this mar-
ginal distribution.
where 0Cß, , 0Cl, , and 0, denote the joint, conditional, and marginal probability density (mass)
functions, respectively. The marginal distribution of Y3 is obtained by integrating this joint
density over the distribution of the random effects,
If the distribution 0C ay3 b is known the maximum likelihood principle can be invoked to obtain
estimates of the unknown parameters. Unfortunately, this distribution is usually intractable.
The relevant approaches to arrive at a solution to this problem can be classified into two
broad categories, linearization and integral approximation methods. Linearization methods
replace the nonlinear mixed model with an approximate linear model. They are also called
pseudo-data methods since the response being modeled is not Y3 but a function thereof. Some
linearization methods are parametric in the sense that they assume a distribution for the
pseudo-response. Other linearization methods are semi-parametric in the sense that they esti-
mate the parameters without distributional assumptions beyond the first two moments of the
pseudo-response. Methods of integral approximation assume a particular distribution for the
random effects b3 and for the conditional distribution of Y3 lb3 and approximate the integral
[8.5] by numerical techniques (quadrature methods or Monte Carlo integration).
Before proceeding we need to point out that linearization and integral approximation
techniques are not the only possible methods for estimating the parameters of a nonlinear
mixed model. For example, if the random effects b3 enter the model linearly, Vonesh and
Carter (1992) apply iteratively reweighted least squares estimation akin to that in a general-
ized linear model. Parameter estimates in nonlinear mixed models can also be obtained in a
two-stage approach known as the individual estimates method. In the first stage the non-
linear model is fit to each cluster separately and in the second stage the cluster-specific esti-
mates are combined (averaged) to arrive at population-averaged values. Success of inference
based on individual estimates depends on having sufficient measurements on each cluster to
estimate the nonlinear response reliably and on methods for combining the individual esti-
mates into population average estimates efficiently. One advantage of the linearization
methods is not to depend on the ability to fit the model to each cluster separately. Clusters
that do not contribute information about the entire response profiles, for example, because of
missing values, will nevertheless contribute to estimation in linearization methods. Earlier
applications of the individual estimates method suffered from not borrowing strength across
clusters in the estimation process (see, e.g., Korn and Whittemore 1997 and Biging 1985).
Davidian and Giltinan (1993) improved the method considerably by fitting the nonlinear
model separately to each subject but using weight matrices that are estimated across subjects.
For more details on two-stage methods that build on individual estimates derived from each
cluster the reader is referred to Ch. 5. in Davidian and Giltinan (1995) and references therein.
The Gauss-Newton method (§A5.10.3) for fitting a nonlinear model (for independent data)
starts with a least squares objective function to be minimized, the residual sum of squares
W a)b ay faxß )bbw ay faxß )bb. It then approximates the mean function faxß )b by a first-
order Taylor series about s) and substitutes the approximate model back into W a)b. This yields
an approximated residual sum of squares to be minimized. In nonlinear mixed models a
similar rationale can be applied. Approximate the conditional mean function by a Taylor
series about some value chosen for ) and some value chosen for b3 . This leads to an approxi-
mate linear mixed model whose parameters are then estimated by one of the techniques from
§7. What are sometimes termed the first- and second-order expansion methods (Littell et al.
1996) differ in whether the mean is expanded about the mean of the b3 , Ecb3 d 0, or the esti-
mated BLUP s b3 .
For the discussion that follows we consider the stacked form of the model to eliminate
the cluster subscript 3. Let Y cYw" ß âß Yw8 dw , e cew" ß âß ew8 dw and denote the model vector as
Ô 0 ax"" ß )ß b" b ×
Ö 0 ax"# ß )ß b" b Ù
Ö Ù
Ö ã Ù
Ö Ù
faxß )ß bb Ö 0 ax"8" ß )ß b" b Ù.
Ö Ù
Ö 0 ax#" ß )ß b# b Ù
Ö Ù
ã
Õ 0 ax888 ß )ß b8 b Ø
If b3 has mean 0 and variance-covariance matrix D, then we refer to b cbw" ß âß bw8 dw as the
vector of random effects across all clusters, a random variable with mean 0 and variance-co-
variance matrix B DiageDf. In the sequel we present the rationale behind the various
linearization methods. Detailed formulas and derivations can be found in §A8.5.1.
The first linearization method expands faxß )ß bb about a current estimate s ) of ) and the
mean of b, Ecbd 0. Littell et al. (1996, p. 463) term this the approximate first-order method.
We prefer to call it the population-average (PA) expansion. A subject-specific (SS) expansion
is obtained as a linearization about s) and a predictor of sb, commonly chosen to be the esti-
mated BLUP. Littell et al. (1996, p. 463) refer to it as the approximate second-order method.
Finally, the generalized estimating equations of Zeger, Liang and Albert (1988) can be adapt-
ed for the case of a nonlinear mixed model with continuous response based on an expansion
about Ecbd 0 only (see §A8.5.2). We term this case the GEE expansion. The three lineariza-
tions lead to the approximate models shown in Table 8.1.
s µ s µ s µ µ µ µ
SS ), s
b bÑ X ) Z s
Y Y f Ðx ß ) ß s b Y X )Z be
GEE Ecbd not necessary Y fÐxß )ß 0Ñ Zb e
µ µ
The matrices X , Z , X , and Z are matrices of derivatives of the function fÐxß )ß bÑ
defined as follows:
` f ax ß ) ß b b ` f ax ß ) ß b b
X Z
` )w ls
) ß0 ` bw ls
) ß0
Y » f Ðx ß s
)Ñ ` faxß )bÎ` )w Ð) s
)Ñ e fÐxß s
)Ñ FÐ) s
)Ñ e,
Y Y f Ðx ß s
)Ñ Fs
) F) e.
After an estimate of ) is obtained, the pseudo-response Y and the derivative matrix F are up-
dated and the linear model for the next iteration is obtained. This process continues until a
convergence criterion is met.
moment estimators for the unknowns in V. Because these estimates depend on the current
solution s ), the process remains iterative. After an update of s
), given a current estimate of V,
Vs is re-estimated followed by another update of s ) and so forth. Because the moment estima-
tors are noniterative this procedure usually converges rather quickly. The performance of the
method of moments estimators will improve with increasing number of clusters. The esti-
mates so obtained can be used as starting values for a parametric fit of the nonlinear mixed
model based on a PA or SS expansion assuming Gaussian pseudo-response. Gregoire and
Schabenberger (1996a, 1996b) note that the predicted cluster-specific profiles from a GEE
and REML fit were nearly identical and indistinguishable on a graph of plotted response
curves. These authors conclude that the GEE estimation method constitutes a viable estima-
tion approach in its own right and is more than just a vehicle to produce starting values for
the other linearization methods.
To incorporate the stochastic nature of the response the observational model is expressed as
random deviations of Y3 lb3 from its mean,
Y3 lb3 g" aX3 " Z3 b3 b e3 . [8.7]
The conditional distribution of the ]34 lb3 is chosen as an exponential family member (this can
be relaxed) with variance function Varc/34 d <2a.b. In many situations < will be one (see
Table 6.2) although the method of Wolfinger and O'Connell allows for estimation of an extra
scale parameter. The variance-covariance matrix of e3 can be expressed as Varce3 d
/3 ?"
3
Y3 lb3 g" a( s Z3 s
s 3 b X 3 " b3 . [8.8]
The role of the within-cluster error dispersion matrix R is now played by ?" ½ ½ "
3 A 3 C 3 A 3 ?3 .
Because of the linearity of the linear predictor the Z and X matrices in [8.9] are the same
matrices as in model [8.7]. They do not depend on the expansion locus and/or current solu-
tions of the fixed and random effects. The linear predictor (
s 3 and the gradient matrix ?3 are
evaluated at the current solutions, however. The process of fitting a generalized linear mixed
model based on this linearization is thus again doubly iterative. The parameters in D and C
are estimated iteratively by maximum or restricted maximum likelihood as in any Gaussian
linear mixed model. Once these estimates and updates of " s have been obtained the pseudo-
mixed model is recalculated for the next iteration and the process is repeated. Noniterative
methods of estimating the covariance parameters are available. For details see Schabenberger
and Gregoire (1996).
which is needed to calculate the likelihood for the entire data. Assuming that clusters are
independent, this likelihood is simply the product 0C ayb #83" 0C ay3 b. The linearizations
combined with a Gaussian assumption for the errors e3 lead to pseudo-models where
0Cl, ay3 lb3 b, 0Cl, Ðyµ3 lb3 Ñ, or 0/ l, Ð/3 lb3 Ñ are Gaussian. Consequently, the marginal distributions
0C ay3 b, 0C Ðyµ3 Ñ, and 0/ a/3 b are Gaussian if b3 µ K a0ß Db and the problem of calculating the
high-dimensional integral in [8.10] is defused. The linearization methods rest on the assump-
tion that the values which maximize the approximate likelihoods
8
0C ay b $0C ay3 b
3"
8
0C е
y Ñ $0C Ðyµ3 Ñ
3"
8
0/ a/ b $0/ a/3 b
3"
8.4 Applications
We have given linearization methods a fair amount of discussion, although we prefer integral
approximations. Until recently, most statistical software capable of fitting nonlinear mixed
models relied on linearizations. Only a few specialized packages performed integral
approximations. The %nlinmix and %glimmix macros distributed by SAS Institute
(http://ftp.sas.com/techsup/download/stat/) perform the SS and PA linearizations discussed in
§8.3.1 and §8.3.2. The text by Littell et al. (1996) gives numerous examples on their usage.
With Release 8.0 of The SAS® System the nlmixed procedure has become available.
Although it has been used in previous chapters to fit various nonmixed models it was
specifically designed to fit nonlinear and generalized linear mixed models by integral
approximation methods. For models that can be fit by either the %nlinmix/%glimmix macros
and the nlmixed procedure, we prefer the latter. Although integral approximations are
computationally intensive, the nlmixed procedure is highly efficient and converges reliably
and faster (in our experience) than the linearization-based macros. It furthermore allows the
optimization of a general log-likelihood function which opens up the possibility of modeling
mixed models with conditional distributions that are not in the exponential family or not al-
ready coded in the procedure. Among the conditional distributions currently (as of Release
8.01) available in proc nlmixed are the Gaussian, Bernoulli, Binomial, Gamma, Negative Bi-
nomial, and Poisson distribution. Among the limitations of the procedure is the restriction to
one level of random effects nesting. Only one level of clustering is possible but multiple ran-
dom effects at the cluster level are permitted. The random effects distribution is restricted to
Gaussian. Since the basic syntax of the procedure resembles that of proc nlin, the user has to
supply starting values for all parameters and the coding of classification variables can be
tedious (see §8.4.2 and §8.4.3 for applications). Furthermore, there is no support for modeling
within-cluster correlations in the conditional distributions. Since proc mixed provides this
possibility through the repeated statement and the linearization algorithms essentially call a
linear mixed model procedure repeatedly, random effects and within-cluster correlations can
be accommodated in linearization approaches. In that case the %glimmix and %nlinmix macro
should be used. It has been our experience, however, that after modeling the heterogeneity
across clusters through random effects the data do not support further modeling of within-
cluster correlations in many situations. The combination of random effects and a nondiagonal
R matrix in nonlinear mixed models appears to invite convergence troubles. Finally, it should
be noted that the integral [8.10] is that of the marginal likelihood of y, not that of Ky, say.
The nlmixed procedure performs approximate maximum likelihood inference and no REML
alternative is available. A REML approach is conceivable if, for example, one were to
approximate the distribution of u3 Ky3 ,
Open questions are the nature of the conditional distribution if y3 lb3 is distributed in the expo-
nential family, the transformation Ky3 that yields EcKy3 d 0, and how to obtain the fixed
effects estimates. Another REML approach would be to assume a distribution for ) and inte-
grate over the distributions of b3 and ) (Wolfinger 2001, personal communication). Only
when the number of fixed effects is large would the difference in bias between REML and
ML estimation likely be noticeable. And with a large number of fixed effects quadrature
methods are then likely to prove too cumbersome computationally.
For the applications that follow we have chosen a longitudinal study and two designed
experiments. The yellow poplar cumulative tree-bole volume data are longitudinal in nature.
Instead of a temporal metric, the observations were collected along a spatial metric, the tree
bole. In §8.4.1 the cumulative bole volume, a continuous response, is modeled with a non-
linear volume-ratio model and population-averaged vs. tree-specific predictions are
compared. The responses for applications §8.4.2 and §8.4.3 are not Gaussian and not con-
tinuous. In §8.4.2 the poppy count data is revisited and the overdispersion is modeled with a
Poisson/Gaussian mixing model. In contrast to the Poisson/Gamma mixing model of §6.7.8
this model does not permit the analytic derivation of the marginal distribution, and we use a
generalized linear mixed model approach with integral approximation. The Poisson/Gaussian
model can be thought of as an extension of the Poisson generalized linear model to the mixed
model framework and to clustered data. In §8.4.3 we extend the proportional odds model for
ordinal response to the clustered data framework. There we analyze the data from a repeated
measures experiment where the outcome was a visual rating of plant quality.
shape are apparent after adjusting for tree size and plotting the relative cumulative volumes
Z. ÎZ! (Figure 8.3).
12 to 74 ft 74 to 88 ft 88 to 95 ft
250
200
150
100
50
Cumulative volume (ft3)
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Complementary diameter rij
Figure 8.2. Cumulative volume profiles for yellow poplar trees graphed against the comple-
mentary diameter <34 " .34 Îmaxa.34 b. .34 denotes the cross-sectional bole diameter of tree
3 at the 4th height of measurement a3 "ß âß 8 $$'b. Trees are grouped by total tree height
(ft).
12 to 74 ft 74 to 88 ft 88 to 95 ft
1.00
0.75
0.50
Relative cumulative volume Vd /V0
0.25
0.00
95 to 99 ft 99 to 104 ft 104 to 109 ft
1.00
0.75
0.50
0.25
0.00
109 to 115 ft 115 to 120 ft 120 to 139 ft
1.00
0.75
0.50
0.25
0.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Complementary diameter rij
Figure 8.3. Relative cumulative volume Z. ÎZ! for $$' yellow poplar trees graphed against
complementary diameter. Trees are grouped by total tree height (ft).
To develop models for Z! and V. with the intent to fit the two simultaneously while
accounting for tree-to-tree differences in size and shape of the volume profiles, we start with
a model for the total bole volume Z! . A simple model relating Z! to easily obtainable tree size
variables is
H3# L3
Z3! "! "" /3 .
"!!!
For the yellow poplar data this model fits very well (Figure 8.4) although there is some evi-
dence of heteroscedasticity. The variation in total bole volume for small trees is less than that
for larger trees. An ordinary least squares fit yields " s ! "Þ!%"', " s " #Þ#)!', and
#
V !Þ**. The regressor was scaled by the factor "ß !!! so that the estimates are of similar
magnitude.
250
Total Tree Bole Volume V0 (ft3)
200
150
100
50
0 20 40 60 80 100 120
(D2H)*1000-1
Figure 8.4. Simple linear regression of total tree volume Z! against H# L . The data set to
which this regression is fit contains 8 $$' observations, one per tree (V # !Þ**).
To develop a model for the ratio term V. , one can think of V. as a mathematical switch-
on function in the terminology of §5.8.6. Since these functions range between ! and " and
switch-on behavior is usually nonlinear, a good place to start the search for a ratio model is
with the cumulative distribution function (cdf) of a continuous random variable. Gregoire and
Schabenberger (1996b) modified the cdf of a Type-I extreme value random variable
where >w >Î"ß !!!. The V. term is always positive and tends to one as . Ä !. The logical
constraints (Z. !, Z. Z! , and Z.! Z! ) any reasonable volume-ratio model must obey
are thus guaranteed. The fixed effects volume-ratio model for the cumulative volume of the
3th tree up to diameter .4 now becomes
H3# L3
exp "# >34 expe"$ >34 f /34 .
w
Z3.4 Z3! V3.4 "! "" [8.13]
"!!!
The yellow poplar data are modeled in Gregoire and Schabenberger (1996b) with non-
linear mixed models based on linearization methods and generalized estimating equations.
Here we fit the same basic model selected by these authors as superior from a number of
models that differ in the type and number of random effects based on quadrature integral
approximation methods with proc nlmixed. Tree-to-tree heterogeneity is accounted for as
variability in size, reflected in variations in total volume, and as variability in shape of the
volume profile. The former calls for inclusion of random tree effects in the total volume com-
ponent Z3! , the latter for random effects in the ratio term. The model selected by Gregoire and
Schabenberger (1996b) was
H3# L3
exp e"# ,#3 f>34 expe"$ >34 f /34 ,
w
Z3.4 Z3! V3.4 "! e"" ,"3 f
"!!!
where the ,"3 model random slopes in the total-volume equation and the ,#3 model the rate of
change and point of inflection in the ratio terms. The variances of these random effects are
denoted 5"# and 5## , respectively. The within-cluster errors /34 are assumed homoscedastic and
uncorrelated Gaussian random variables with mean ! and variance 5 # .
The model is fit in proc nlmixed with the statements that follow. The starting values
were chosen as the converged iterates from the REML fit based on linearization. The
conditional distribution 0Cl, aylbb is specified in the model statement of the procedure. In
contrast to other procedures in The SAS® System, proc nlmixed uses syntax to denote
distributions akin to our mathematical formalism. The statement Z3.4 lb µ KÐZ3! V3.4 ß 5 # Ñ is
translated into model cumv ~ normal(TotV*R,resvar);. The random statement specifies the
distribution of b3 . Since there are two random effects in the model, two means must be
specified, two variances and one covariance. The statement
random u1 u2 ~ normal([0,0],[varu1,0,varu2]) subject=tn;
is the translation of
! 5# !
b 3 µ K ß D " à 3 "ß âß 8à Covcb3 ß b4 d 0.
! ! 5##
The predict statements calculate predicted values for each observation in the data set. The
first of the two statements evaluates the mean without considering the random effects. This is
the approximate population-average mean response after taking a Taylor series of the model
about Ecbd. The second predict statement calculates the cluster-specific predictions.
proc nlmixed data=ypoplar tech=newrap;
parms beta0=0.25 beta1=2.3 beta2=2.87 beta3=6.7 resvar=4.8
varu1=0.023 varu2=0.245; /* resvar denotes 5 # , varu1 5"# and varu2 5## */
X = dbh*dbh*totht/1000;
TotV = beta0 + (beta1+u1)*X;
R = exp(-(beta2+u2)*(t/1000)*exp(beta3*t));
model cumv ~ normal(TotV*R,resvar);
random u1 u2 ~ normal([0,0],[varu1,0,varu2]) subject=tn out=EBlups;
predict (beta0+beta1*X)*exp(-beta2*t/1000*exp(beta3*t)) out=predPA;
predict TotV*R out=predB;
run;
Specifications
Data Set WORK.YPOPLAR
Dependent Variable cumv
Distribution for Dependent Variable Normal
Random Effects u1 u2
Distribution for Random Effects Normal
Subject Variable tn
Optimization Technique Newton-Raphson
Integration Method Adaptive Gaussian Quadrature
Dimensions
Observations Used 6636
Observations Not Used 0
Total Observations 6636
Subjects 336
Max Obs Per Subject 32
Parameters 7
Quadrature Points 1
Parameters
b0 b1 b2 b3 resvar varu1 varu2 NegLogLike
0.25 2.3 2.87 6.7 4.8 0.023 0.245 15535.9783
Iteration History
Iter Calls NegLogLike Diff MaxGrad Slope
1 18 15532.1097 3.868562 9.49243 -7.48218
2 27 15532.0946 0.015093 0.021953 -0.0301
3 36 15532.0946 3.317E-7 2.185E-7 -6.64E-7
NOTE: GCONV convergence criterion satisfied.
Fit Statistics
-2 Log Likelihood 31064
AIC (smaller is better) 31078
AICC (smaller is better) 31078
BIC (smaller is better) 31105
Parameter Estimates
Standard
Parameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper
b0 0.2535 0.1292 334 1.96 0.0506 0.05 -0.00070 0.5078
b1 2.2939 0.01272 334 180.38 <.0001 0.05 2.2689 2.3189
b2 2.7529 0.06336 334 43.45 <.0001 0.05 2.6282 2.8775
b3 6.7480 0.02237 334 301.69 <.0001 0.05 6.7040 6.7920
resvar 4.9455 0.08923 334 55.42 <.0001 0.05 4.7700 5.1211
varu1 0.02292 0.00214 334 10.69 <.0001 0.05 0.0187 0.0271
varu2 0.2302 0.02334 334 9.86 <.0001 0.05 0.1843 0.2761
Since the starting values are the converged values of a linearization followed by REML
estimation, the integral approximation method converges quickly after only three iterations
s c!Þ#&$&ß #Þ#*$*ß #Þ(&#*ß 'Þ(%)!d and
(Output 8.1). The estimates of the fixed effects are "
# #
those of the random effects are 5s %Þ*%&&, 5s " !Þ!##*#, and 5s ## !Þ#$!#. Notice that the
degrees of freedom equal the number of clusters minus the number of random effects in the
model (apart from /34 ). The asymptotic *&% confidence intervals for the variances of the ran-
dom effects do not include zero and based on this evidence one would conclude that the in-
clusion of the random effects improved the model fit. A better test can be obtained by fitting
the models without random effects or only one random effect and comparing minus twice the
log likelihoods (Table 8.3)
Table 8.3. Minus twice log likelihoods for various models fit to the yellow poplar data
(Models differ only in the number of random effects)
Model Random Effects # Log Likelihood
ó ," and ,# $"ß !'% (Output 8.1)
ô ," $&ß *)$
õ ,# $*ß ")"
ö none %$ß %!# (Output 8.2)
The model with two random effects has the smallest value for minus twice the log likeli-
hood and is a significant improvement over any of the other models. Note that ö is the
purely fixed effect model which fits only a population-average curve and does not take into
account any clustering. Fit statistics and parameter estimates for model ö are shown in
Output 8.2. Since this model (incorrectly) assumes that all observations are independent, its
degrees of freedom are no longer equal to the number of clusters minus the number of covar-
iance parameters. The estimate of the residual variation is considerably larger than in the ran-
dom effects model ó. Residuals are measured against the population average in model ö
and against the tree-specific predictions in model ó.
Fit Statistics
Parameter Estimates
Standard
Parameter Estimate Error DF t Value Pr > |t|
We selected four trees (&, "&", #(*, and $!)) from the data set to show the difference
between the population-averaged and cluster-specific predictions (Figure 8.5). The trees vary
appreciably in size and total volume. The population average fits fairly well to tree ##(* and
the lower part of the bole of tree #"&". For the medium to large sized tree #& the PA predic-
tions overestimate the cumulative volume in the tree bole. For the large tree #$!), the popula-
tion average overestimates the volume in the lower parts of the tree bole where most of the
valuable timber is accrued. Except for the smallest tree, the tree-specific predictions provide
an excellent fit to the data. An operator that processes high-grade timber in a sawmill where
adjustments of the cutting tools on a tree-by-tree basis are feasible, would use the tree-
specific cumulative volume profiles to maximize the output of high-quality lumber. If
adjustments to the saws on an individual tree basis are not economically feasible, because the
timber is of lesser quality, for example, one can use the population-average profiles to
determine the settings.
5 151
70
100
Cumulative Volume (ft3)
30
40
D = 23 in, H = 108 ft D = 18.9 in, H = 91 ft
V0 = 125.8 ft 3 V0 = 73.64 ft3
279 308
100
3
D = 7 in, H = 26 ft
D = 23.2 in, H = 134 ft
V0 = 7.59 ft3
0 V0 = 166.8 ft3
Figure 8.5. Population-averaged (dashed line) and cluster-specific (solid line) predictions for
four of the $$' trees. Panel headings are the tree identifiers.
where 73 denotes treatments and 34 block effects, showed considerable overdispersion. The
overdispersion problem was tackled there by assuming that ]34 , the poppy count for treatment
3 in block 4, was not a Poisson random variable with mean -34 expe(34 f, but that -34 was a
Gamma random variable. The conditional distribution ]34 l-34 was modeled as a Poissona-34 b
random variable. This construction allowed the analytic derivation of the marginal probability
mass function
which turned out to follow the Negative Binomial law. Since this distribution is a member of
the exponential family (Table 6.2) the resulting model could be fit as a generalized linear
model. We used proc nlmixed to estimate the parameters of the model not because this was a
mixed model, but because of a problem associated with the dist=negbin option of proc
genmod (in the SAS® release we used that has been subsequently corrected) and in anticipa-
tion of fitting the model that follows.
An alternative approach to the Poisson/Gamma mixing procedure is to assume that the
linear predictor is a linear mixed model
(34 . 73 34 .34 , [8.15]
where .34 are independent Gaussian random variables with mean ! and variance 5.# . These
additional random variables introduce extra variability into the system associated with the
experimental units. Conditional on .34 , the poppy counts are again modeled as Poisson ran-
dom variables. The marginal distribution of the counts in the Poisson/Gaussian mixing model
is elusive, however. The integral [8.14] can not be evaluated in closed form. It can be
approximated by the methods of §8.3.3, however. This generalized linear mixed model
becomes
]34 l-34 Poissona-34 b
-34 expe. 73 34 .34 f
.34 µ K!ß 5 # .
The proc nlmixed code to fit this model is somewhat lengthy, because treatment and
block effects are classification variables and must be coded inside the procedure. The block
of if .. else .. statements in the code below sets up the linear predictor for the various
combinations of block and treatment effects. The last level of either factor is set to zero and
its effect is absorbed into the intercept. This parameterization coincides with that of proc
genmod. The variance of the .34 is not estimated directly, because 5 # is bounded by zero from
below. Instead, we estimate the logarithm of the standard deviation which can range over the
real line (parameter logsig).
proc nlmixed data=poppies df=14;
parameters intcpt=3.4 bl1=0.3 bl2=0.3 bl3=0.3
tA=1.5 tB=1.5 tC=1.5 tD=1.5 tE=1.5
logsig=0;
if block=1 then linp = intcpt + bl1;
else if block=2 then linp = intcpt + bl2;
else if block=3 then linp = intcpt + bl3;
else if block=4 then linp = intcpt;
if treatment = 'A' then linp = linp + tA;
else if treatment = 'B' then linp = linp + tB;
else if treatment = 'C' then linp = linp + tC;
else if treatment = 'D' then linp = linp + tD;
The statement lambda = exp(linp + d); calculates the conditional Poisson mean -34 .
Notice that the random effect d does not appear in the parameters statement. Only the disper-
sion of .34 is a parameter of the model. The model statement informs the procedure that the
counts are modeled (conditionally) as Poisson random variables and the random statement
determines the distribution of the random effects. Only the normal() keyword can be used in
the random statement. The first argument of normal() defines the mean of .34 , the second the
variance 5 # . Since we are also interested in the estimate of 5 # this value is calculated in the
last of the estimate statements. The other estimate statements calculate the "least squares"
means of the treatments on the log scale and averaged across the random effects. The
subject=plot statement identifies the experimental unit as the cluster which yields a model
with a single observation per cluster (Dimensions table in Output 8.3). The degrees of
freedom were set to coincide with the Negative Binomial analysis in §6.7.8. The Poisson
model without overdispersion had "& deviance degrees of freedom. The Negative Binomial
model estimated one additional parameter (labeled k there). Similarly, the Poisson/Gaussian
model adds one parameter, the variance of the .34 .
The initial negative log likelihood of "%&Þ") calculated from the starting values improved
during #" iterations that followed. The converged negative log likelihood was ""'Þ)*. The
important question is whether the addition of the random variables .34 in the linear predictor
improved the model over the standard Poisson generalized linear model. If L! : 5 # ! can be
rejected, the Poisson/Gaussian model is superior. From the result of the last estimate state-
ment it is seen that the approximate *&% confidence interval for 5 # is c!Þ!#'!ß !Þ"&)#d and
one would conclude that there is extra variation among the experimental units beyond that
accounted for by the Poisson law. A better approach is to fit a Poisson model with linear pre-
dictor (34 . 73 34 and to compare the log likelihoods of the two models with a likeli-
hood ratio test. For the reduced model one obtains a negative log-likelihood of #!%Þ*(). The
likelihood ratio test statistic A %!*Þ*& #$$Þ) "('Þ"& is highly significant.
To compare this model against the Negative Binomial model in §6.7.8 a likelihood ratio
test can not be employed because the Poisson/Gamma and the Poisson/Gaussian models are
not nested. For the comparison among non-nested models AIC can be used. The information
criteria for the two models are very close (Poisson/Gamma: AIC #&$Þ", Poisson/Gaussian:
AIC #&$Þ)). Note that since both models have the same number of parameters, the
difference of their AIC values equals twice the difference of their log-likelihoods. From a
statistical point of view either model may be chosen.
Output 8.3.
The NLMIXED Procedure
Specifications
Data Set WORK.POPPIES
Dependent Variable count
Distribution for Dependent Variable Poisson
Random Effects d
Distribution for Random Effects Normal
Subject Variable plot
Optimization Technique Dual Quasi-Newton
Integration Method Adaptive Gaussian
Quadrature
Dimensions
Observations Used 24
Observations Not Used 0
Total Observations 24
Subjects 24
Max Obs Per Subject 1
Parameters 10
Quadrature Points 1
Parameters
intcpt bl1 bl2 bl3 tA tB tC tD tE logsig NegLogLike
3.4 0.3 0.3 0.3 1.5 1.5 1.5 1.5 1.5 0 145.186427
Iteration History
Fit Statistics
Parameter Estimates
Standard
Parameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper
Output 8.4.
Poi/ |
Poi Gauss Poi/Gam | Poi Poi/OD Poi/Gauss Poi/Gam
effect Est" Est# Est$ | StdErr% StdErr& StdErr' StdErr(
A multiplicative overdispersion factor does not alter the estimates, only their precision.
Comparing columns % and & of Output 8.4 the extent to which the regular GLM overstates the
precision of estimates is obvious. Estimates of the parameters as well as the treatment means
(on the log scale) are close for all methods. The standard errors of the two mixing models are
also very close.
An interesting aspect of the generalized linear mixed model is the ability to predict the
cluster-specific responses. Since each experimental unit serves as a cluster (of size one) this
corresponds to predicting the plot-specific poppy counts. If all effects in the model were
fixed, the term .34 would represent the block treatment interaction. The model would be
saturated with a deviance of zero and predicted counts would coincide with the observed
counts. In the generalized mixed linear model the .34 are random variables and only a single
degree of freedom is lost to the estimation of its variance (compared to $ & "& degrees of
freedom for a fixed effect interaction). After calculating the predictors s . 34 of the random
effects, cluster-specific predictions of the counts are obtained as
s 34 l.34 exp.
] 34 s
s s7 3 s . 34 .
to the code above. Taking a Taylor series of expe. 73 34 .34 f about Ec.34 d !, the
marginal average Ec]34 d can be approximated as expe. 73 34 f and estimated as
exp. 34 . These PA predictions can be obtained with the nlmixed statement
s s7 3 s
predict exp(linp) out=papred;
Treatment A,
Treatment B, Block 3
Block 3
600
500
Predicted Counts
400
300
0
0 100 200 300 400 500
Observed Counts
Figure 8.6. Predicted vs. observed poppy counts in Poisson/Gaussian and Poisson/Gamma
(= Negative Binomial) mixing models.
The population-averaged and cluster-specific predictions are plotted against the observed
poppy counts in Figure 8.6. The cluster-specific predictions for the Poisson/Gaussian model
are very close to the %&° line but the model is not saturated; the predictions do not reproduce
the data. There are still fourteen degrees of freedom left! The Negative Binomial model based
on Poisson/Gamma mixing does not provide the opportunity to predict poppy counts on the
plot level because the conditional distribution is not involved at any stage of estimation. The
PA predictions of the two mixing models are very similar as is expected from the agreement
of their parameter estimates (Output 8.4). The predicted values for treatments A and B in
block $ do not concur well with the observed counts, however. There appears to be block
treatment interaction which is even more evident by plotting the PA predicted counts by treat-
ments against blocks (Figure 8.7).
600
A
400
Predicted Count
200
C/D
E/F
0
1 2 3 4
Block
It appears that the probability to observe a low rating decreases over time and the proba-
bility of excellent turf quality appears to be largest in July. Varietal differences in the ratings
distributions seem to be minor. To confirm the presence or absence of varietal effects, trends
over time, and possibly variety time interactions, a proportional odds model containing
these effects is fit. A standard model ignoring the possibility of correlations over time can be
fit with the genmod procedure in The SAS® System (see §§6.5.2, 6.7.4., and 6.7.5 for
additional code examples):
data counts;
input rating $ variety month count;
datalines;
low 1 5 4
med 1 5 10
xce 1 5 4
low 1 7 1
med 1 7 9
xce 1 7 8
med 1 9 12
xce 1 9 6
and so forth ...
;;
run;
This model incorporates both linear and quadratic time effects because the data in Table 8.4
suggest that the rating distributions in July may be different from those in May or September.
The term variety*month models differences in the linear slopes among the varieties.
The fit achieves a # log-likelihood of %)$Þ"# (Output 8.5). From the LR Statistics
For Type 3 Analysis table it is seen that only the linear and quadratic effects in time appear to
be significant. Varietal differences in the rating distributions appear to be absent
a: !Þ(#**b and trends over time appear not to differ among the five varieties a: !Þ()"!b.
Output 8.5.
The GENMOD Procedure
Model Information
Data Set WORK.COUNTS
Distribution Multinomial
Link Function Cumulative Logit
Dependent Variable rating
Frequency Weight Variable count
Observations Used 42
Sum Of Frequency Weights 264
Probabilities Modeled Pr( Low Ordered Values of rating )
Response Profile
Ordered Ordered
Level Value Count
1 low 36
2 med 137
3 xce 91
Since this model does not account for correlations over time it is difficult to say whether
these findings persist if the temporal correlations are incorporated in a model. In particular,
because modeling the correlations through random effects will not only change the standard
error estimates but the estimates of the model coefficients themselves. Positive autocorrela-
tion leads to overdispersed data and one approach to remedy the situation is by formulating a
mixed proportional odds model where, given some random effects, the data follow a POM
and to perform maximum likelihood inference based on the marginal distribution of the data.
This indirect approach of modeling correlations (see §7.5.1 for the distinction between direct
and induced correlation models) is reasonable in models for correlated data where the mean
function is nonlinear.
Using the same regressors and fixed effects as in the previous fit, we now add a random
effect that models the plot-to-plot variability. This is reasonable since treatments have been
assigned at random to plots, because extra variation is likely to be related to excess variation
among the experimental units, and the plots have been remeasured (are the clusters). This fit
is obtained with the nlmixed procedure in SAS® . We note in passing that because proc
nlmixed uses an integral approximation based on quadrature, this modeling approach is
identical to the one put forth by Jansen (1990) for ordinal data with overdispersion and
Hedeker and Gibbons (1994) for clustered ordinal data.
The data must be set up differently for the nlmixed operation, however. The counts data
set used with proc genmod lists for all varieties and months the number of plots that were
assigned a particular rating (variable count). The data set CountProfiles used to fit the
mixed model variety of the POM contains the number of response profiles over time. The
first three observations show one unique response profile for variety ". A low rating in May
was followed by two medium ratings in July and September. Two of the ") plots for this
variety exhibited that particular response profile (variable count). The remaining triplets of
observations in the data set CountProfiles give the response profiles for this and the other
varieties. The sub variable identifies the clusters for this study, corresponding to the plots. It
works in conjunction with the replicate statement of proc nlmixed. The first triplet of
observations are identified as belonging to the same plot (cluster) and the value of the count
variable determines that there are two plots (experimental units) with this response profile.
data CountProfiles;
label rating = '1=low, 2=medium, 3=excellent';
input rating variety month sub count;
datalines;
1 1 5 1 2
2 1 7 1 2
2 1 9 1 2
1 1 5 2 1
2 1 7 2 1
3 1 9 2 1
1 1 5 3 1
3 1 7 3 1
2 1 9 3 1
and so forth ...
;;
run;
proc nlmixed data=CountProfiles;
parms i1=7.14 i2=9.900 /* cutoffs */
v1=1.57 v2=1.26 v3=0.08 v4=1.783 /* variety effects */
m=-2.85 m2=0.197 /* month and month^2 slope */
mv1=-0.15 mv2=-0.17 mv3=0.08 mv4=-0.03 /* Variety spec. slopes */
sd=1; /* standard deviation of random plot errors */
The block of if .. else .. statements sets up the linear predictor apart from the two
cutoffs (parameters i1 and i2) needed to model a three-category ordinal response and the
random plot effect. The latter is added in the linp = linp + ploterror; statement. The
second block of if .. then .. else .. statements calculates the category probabilities
from which the multinomial log-likelihood is built. Should a category probability be very
small a log-likelihood contribution of "!"!! is assigned to avoid computational inaccuracies
when taking the logarithm of a quantity close to zero. The random statement models the plot
errors as Gaussian random variables with mean zero and variance 5 # sd*sd. In the vernac-
ular of mixing models, this is a Multinomial/Gaussian model. The replicate statement
identifies the variable in the data set which indicates the number of response profiles for a
particular variety. This statement must not be confused with the repeated statement of the
mixed procedure. As starting values for proc nlmixed the estimates from Output 8.5 were
chosen. The starting value for the standard deviation of the random plot errors was guessed.
The Dimensions table shows that the data have been set up properly (Output 8.6).
Although there are three observations in each response profile, the replicate statement uses
only the last observation in each profile to determine the number of plots that have the partic-
ular profile. The number of clusters is correctly determined as )) and the number of repeated
measurements as three (Max Obs Per Subject). The adaptive quadrature determined that three
quadrature points provided sufficient accuracy in the integration problem.
The procedure required thirty-four iterations until further updates did not provide an im-
provement in the log-likelihood. The # log-likelihood at convergence of %&'Þ! is consider-
ably less than that of the independence model (%)$Þ", Output 8.5). The difference of #(Þ" is
highly significant ÐPrÐ;#" #(Þ"Ñ !Þ!!!"Ñ, an improvement over the independence model
brought about only by the inclusion of the random plot errors.
From the *&% confidence bounds on the parameters it is seen that the linear and quadrat-
ic time effects (m and m2) are significant, their bounds do not include zero. The confidence
interval for the standard deviation of the plot errors also does not include zero, supporting the
finding obtained by the likelihood ratio test, that the inclusion of the random plot errors is a
significant improvement of the model.
Output 8.6.
The NLMIXED Procedure
Specifications
Data Set WORK.COUNTPROFILES
Dependent Variable rating
Distribution for Dependent Variable General
Random Effects ploterror
Distribution for Random Effects Normal
Subject Variable sub
Replicate Variable count
Optimization Technique Dual Quasi-Newton
Integration Method Adaptive Gaussian Quadrature
Dimensions
Observations Used 129
Observations Not Used 0
Total Observations 129
Subjects 88
Max Obs Per Subject 3
Parameters 13
Quadrature Points 3
Parameters
i1 i2 v1 v2 v3 v4 m
7.14 9.9 1.57 1.26 0.08 1.783 -2.85
m2 mv1 mv2 mv3 mv4 sd NegLogLike
0.197 -0.15 -0.17 0.08 -0.03 1 233.22016
Iteration History
Iter Calls NegLogLike Diff MaxGrad Slope
1 5 233.216559 0.003601 35.97236 -213.078
2 8 232.974984 0.241575 197.669 -0.05738
3 10 232.018721 0.956263 11.48055 -2.90352
ã
33 66 227.995816 0.000395 0.018264 -0.00075
34 68 227.995816 1.656E-7 0.003131 -3.05E-7
Fit Statistics
-2 Log Likelihood 456.0
AIC (smaller is better) 482.0
AICC (smaller is better) 485.2
BIC (smaller is better) 514.2
Parameter Estimates
Standard
Parameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper
The confidence bounds for the varieties and the variety month interaction terms
include zero and on first glance one would conclude that there are no varietal effects at work.
Because of the coding of the classification variables, v1 for example, does not measure the
intercept for variety ", rather the difference between the intercepts of varieties " and &. To test
the significance of various effects we consider the model whose output is shown in Output
8.6 as the full model and fit various reduced versions of it (Table 8.5).
The likelihood ratio test statistics aAb and :-values represent comparisons to the full
model ó. Removing the variety month interaction from the model does not significantly
impair the fit (: !Þ&!)) but removing any other combination of effects in addition to the
interaction does worsen the model. Based on these results one could adopt model ô as the
new full model. Since variety effects have not been removed by themselves, one can test their
significance by comparing the # log likelihoods of models ô and ÷. The : -value is calcu-
lated as
: Pr;#% %("Þ( %&*Þ$ Pr;#% "#Þ% !Þ!"%.
Table 8.5. # Log-likelihoods for various mixed models fitted to the repeated
measures turf ratings (All models contain a random plot effect)
Fixed Effects
Model included dropped #logL df † A :
#
ó Variety, >, > , Variety > %&'Þ!
#
ô Variety, >, > Variety > %&*Þ$ % $Þ$ !Þ&!)
õ Variety, > Variety >, ># %(#Þ) & "'Þ) !Þ!!&
ö Variety Variety >, >, ># %(&Þ( ' "*Þ( !Þ!!$
#
÷ >, > Variety, Variety > %("Þ( ) "&Þ( !Þ!%'
†
: df denotes the degrees of freedom dropped compared to the full model ó.
Output 8.7.
Parameter Estimates
Standard
Parameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper
The model finally selected is ô and its parameter estimates are shown in Output 8.7.
From these estimates the variety specific probability distributions over time can be calculated
(Figure 8.8). Perhaps surprisingly, the drop in low rating probabilities is less striking than
appears in Table 8.4. Except for variety % the probability of receiving a low rating remains
constant throughout the three-month period. Excellent ratings are most common around July
but only for varieties # and & is excellent turf quality in that period more likely than medium
quality.
5 6 7 8 9
variety: 1 variety: 2
0.7
0.3
Category Probabilities
variety: 3 variety: 4
0.7
0.3
variety: 5
0.7
Pr(low rating)
Pr(medium rating)
0.3
Pr(excellent rating)
5 6 7 8 9
Month
5 6 7 8 9
variety: 1 variety: 2
2
Logits of Cumulative Probabilities
-2
variety: 3 variety: 4
-2
variety: 5
2
low rating
at most medium rating
-2
5 6 7 8 9
Month
The linear portion of the model is best interpreted on the logit scale (Figure 8.9). The
final model contains varietal differences and linear and quadratic time effects. Especially
variety % has elevated intercepts compared to the other entries. From Output 8.7 we see that
the confidence interval c!Þ'$'&ß $Þ$%"d for coefficient v4 does not contain zero. Since variety
& is the benchmark, this implies that varieties % and & are significantly different in the
elevation of the lines in Figure 8.9. The statements to fit the selected model with proc
nlmixed and to obtain pairwise comparisons of the variety effects (intercepts) follows below.
Results of the pairwise comparisons are shown in Output 8.8. At the &% significance level
variety % is significantly different in the rating probability distributions from varieties ", #,
and & (: !Þ!#)%, !Þ!!#(, and !Þ!!%%, respectively).
linp = vv{variety};
linp = linp + m*month + m2*month*month+ ploterror;
Output 8.8.
Additional Estimates
Standard
Label Estimate Error DF t Value Pr > |t| Alpha Lower Upper
0.4
0.3
Density
0.2
0.1
0.0
-3 -2 -1 0 1 2 3
y
C" "Þ'
A single realization: C "Þ%$ A single realization:
C# !Þ&
Figure 9.1. Univariate (a) and bivariate (b) random variables and the realization of a
stochastic process in # (c). In panels (a) and (b), the graph represents the distribution
(process) from which a single realization is drawn. In panel (c) the graph is the realization.
• When a random field is sampled, samples are drawn from one particular
realization of the random experiment.
Consider ^asb as a function of the spatial coordinates s that are elements of a set H, which
we call the domain. For now assume that H is a continuous set. To incorporate stochastic
behavior (randomness), ^asb is considered the realization of a random experiment. To make
the dependence on the random experiment explicit we use the notation ^asß =b for the time
being, where = is the outcome of a particular experiment. Hence, we are really concerned
with a function of two variables, s and =. This is called a random function because the surface
Figure 9.2. Four realizations of a random function in # . The domain H is continuous and
given by the rectangle a!ß &b a!ß &b. Realizations at s! c#ß $d are shown as dots.
denotes a spatial random field with a two-dimensional domain. The attribute of interest, ^ , is
a stochastic process with domain (or index set) H which itself is a subset of # . When we
have in mind the random variable at s, we use ^ asb and denote its realization as D asb. The
vector of all observations is denoted Zasb c^ as" bß âß ^ as8 bdw . Definition [9.1] is quite
abstract but it can be fleshed out by considering various types of spatial data.
• The domain H is fixed and continuous for geostatistical data, fixed and
discrete for lattice data, and a random set for point data.
• The scientific questions raised differ substantially among the three data
types and specific tools have been developed to address these questions.
Some of the tools are transitive in that they can be applied to any of the data
types, others are particular to a specific spatial data structure.
Many practitioners associate with spatial data analysis terms like geostatistics and methods
such as kriging. Geostatistical data is only one of many spatial data types which can be de-
fined through the domain H of the random field [9.1]. In the case of geostatistical data the do-
main is a fixed, continuous set; the number of locations at which observations can be made is
not countable. Between any two sample locations s3 and s4 an infinite number of additional
samples can be placed in theory. Furthermore, there is nothing random about the locations
themselves. Examples of geostatistical data are measuring the electrical conductivity of soil,
yield monitoring a field, and sampling the ore grade of a rock formation. Figure 9.3 (left
panel) shows (# locations on a shooting range at which the lead concentration was measured.
The shooter location is at coordinate B "!!, C !.
Because of the continuity of H, geostatistical data is also referred to as spatial data with
continuous variation. This does not imply that the attribute ^ is continuous. The nature of the
150
250
Y-Coordinate
200
100
Y-coordinate
150
100
50
50
0
0
Figure 9.3. Left panel: sampling locations on shooting range at which lead concentrations
were measured. Right panel: Sample grid for collecting wheat kernels in a field with wheat
scab. Lead data kindly provided by Dr. James R. Craig and Dr. Donald Rimstidt, Dept. of
Geological Sciences, Virginia Polytechnic Institute and State University.
3.0 - 4.0
> 4.0
Figure 9.4. Sudden infant deaths (SIDs) in North Carolina 1974-1978. Shown are the
number of SIDs relative to the number of live births. These data are included with
S+SpatialStats® .
20
19
18
17
16
15
14
13
12
11
Row
10
9
8
7
6
5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Column
Figure 9.5. Grain yield data of Mercer and Hall (1911) at Rothamsted Experimental Station.
The area of the squares in each lattice cell is proportional to the grain yield. From Table 6.1
of Andrews and Herzberg (1985). Used with permission.
Geostatistical and lattice data have in common that the domain is fixed, not random. To
develop the idea of a random domain, consider ^asb to be crop yield and define an indicator
variable Y asb which takes on the value " if the yield is below some threshold level and !
otherwise,
" ^ a sb -
Y a sb
! otherwise.
If ^ asb is geostatistical, so is Y asb. The random function Y asb now returns the values " and !
instead of the continuous attribute yield.
300
200
Latitude
100
-100
-200
Figure 9.6. Locations in a corn field where yield per unit area is less than some threshold
value.
Now imagine throwing away all the points where the yield is above threshold and retain-
ing only those locations for which Y asb ". Define a new domain H which consists of
those points where ^ asb - . Since ^ asb is random, so is H . We have replaced the attribute
^ asb (and Y asb) with a degenerate random variable whose domain H consists of the loca-
tions at which we observe the event of interest and the focus has switched from studying the
attribute itself to studying the locations (Figure 9.6). Such processes are termed point
processes or point patterns.
4000
3000
Latitude
2000
1000
Figure 9.7. Spatial distribution of a group of red cockaded woodpeckers in the Fort Bragg
area of North Carolina. Data kindly made available by Dr. Jeff Walters, Dept. of Biology,
Virginia Polytechnic Institute and State University. Used with permission.
Most point patterns do not arise through this transformation of an underlying random
function, they are observed directly. The emergence of plants, the distribution of seeds, the
Stationarity in simple terms means that the random field looks similar in different parts of the
domain H, it replicates itself. Consider two observations at ^ asb and ^ as hb. The vector h
is a displacement by which we move from location s to location u s h; it is referred to as
the lag-vector (or lag for short). If the random field is self-replicating, the stochastic proper-
ties of ^ asb and ^ as hb should be similar. For example, to estimate the covariance between
locations distance h apart, we might consider all pairs a^ as3 b, ^ as3 hbb in the estimation
process, regardless of where s3 is located. Stationarity is the absence of an origin, the spatial
process has reached a state of equilibrium. Stationarity assumptions are also made for time
series data. There it means that it does not matter when a time shift is considered in terms of
absolute time, only how large the time shift is. In a stationary time series one can talk about a
difference of two days without worrying that the first occasion was a Saturday. In the spatial
context stationarity means the lack of importance of absolute coordinates. There are different
degrees of stationarity, however, and before we can make the various definitions more precise
a few comments are in order.
Since ^asb is a random variable, it has moments and a distribution. For example,
Ec^ asbd .asb is the mean of ^ asb at location s and Varc^ asbd is its variance. Analysts and
practitioners often refer to Gaussian random fields, but some care is necessary to be clear
about what is assumed to follow the Gaussian law. A Gaussian random field is defined as a
random function whose finite-dimensional distributions are multivariate Gaussian. That is,
the cumulative distribution function
Pra^ as" b D" ß âß ^ as5 b D5 b [9.2]
is that of a 5 -variate Gaussian distribution for all 5 . By the properties of the multivariate
Gaussian distribution (§3.7) this implies that any ^as3 b is a univariate Gaussian random var-
iable. The reverse is not true. If ^as3 b is Gaussian does not imply that [9.2] is a multivariate
Gaussian probability. Chilès and Delfiner (1999, p. 17) point out that this leap of faith is
sometimes made. The spatial distribution of ^asb is defined by the multivariate cumulative
distribution function [9.2], not the marginal distribution of ^asb.
meaning that the spatial distribution is invariant under translation of the coordinates by the
vector h. The random field repeats itself throughout the domain. Geometrically, this implies
that the spatial distribution is invariant under rotating and stretching of the coordinate system.
As the name suggests, strong stationarity is a very strict condition, more restrictive than what
is required for many of the statistical methods that follow. Two important versions of station-
arity, second-order (weak) and intrinsic stationarity are defined in terms of moments of ^asb.
A random field is second-order stationary if Ec^ asbd . and Covc^ asbß ^ as hbd
G ahb. The first assumption states that the mean of the random field is constant and does not
depend on locations. The second assumption states that the covariance between two observa-
tions is only a function of their spatial separation. The function G ahb is called the covari-
ance function or the covariogram of the spatial process. If a random field is strictly station-
ary it is also second-order stationary, but the reverse is not necessarily true. The reasons are
similar to those that disallow inferring the distribution of a random variable from its mean and
variance alone. An exception is the Gaussian random variable where zero covariance does
imply independence. Similarly for random fields. If a Gaussian random field is second-order
stationary it is also strictly stationary.
Imagine that we wish to estimate the covariance function G ahb in a second-order
stationary process. For a lag vector h c $&Þ$&ß $&Þ$&dw all pairs of points that are exactly
distance h apart can be utilized (Figure 9.8).
150
Latitude
100
50
0
-50 0 50 100 150 200
Longitude
Figure 9.8. The notion of second-order stationarity. Pairs of points separated by the same lag
vector (here, h c $&Þ$&ß $&Þ$&dw ) provide built-in replication to assess the spatial depen-
dency for the particular choice of h.
While stationarity reflects the lack of importance of absolute coordinates, the direction in
which the lag vector h is assessed still plays an important role. We could not combine the
pairs of observations in Figure 9.8 whose lag vector is h c $&Þ$&ß $&Þ$&dw , they are
The Euclidean distance is defined as follows. Let s œ cBß C dw where B and C are the longitude
and latitude, respectively. The Euclidean distance (Figure 9.9) between s" and s# is then
lls" s# ll œ ÉaB" B# b# aC" C# b# . Random fields that are stationary but not isotropic
are called anisotropic.
B=(2,10)
10
8 C=(9,8)
Y Coordinate
6
|| sA − sC ||= (2 − 9)2 + (2 − 8)2 = 9.219
2
A=(2,2) D=(10,2)
2 4 6 8 10
X Coordinate
Figure 9.9. Euclidean distance between points E œ aB œ #ß C œ #b and G œ a*ß )b. The
Euclidean distance between E and F and between E and H is ).
Note that we have distinguished between the covariogram G ‡ ahb of the second-order
stationary random field and the covariogram G allhllb which is also isotropic, since these are
different functions. In the sequel we will often refer to only Gab and whether the function
depends on h or on llhll is sufficient to distinguish the general from the isotropic case.
In a process with isotropic covariance function it does not matter how the lag vector
between pairs of points is oriented, only that the Euclidean distance between pairs of points is
the same (Figure 9.10). To visualize the difference between second-order stationary random
fields with isotropic and anisotropic covariance function, realizations were simulated with
proc sim2d in The SAS® System. The statements
random field in which data points are correlated up to Euclidean distance &È$ œ )Þ'' in the
generate two sets of spatial data on a "! ‚ "! grid. The first simulate statement generates a
East-West direction, but correlated up to a much smaller distance in the North-South direction
150
Latitude
100
50
0
-50 0 50 100 150 200
Longitude
Figure 9.10. The notion of isotropy. Pairs of points separated by the same Euclidean distance
(here llhll &!) provide built-in replication to assess spatial dependency at that lag. Orienta-
tion of the distance vector is immaterial.
10
8
8
6
6
4
4
2
2 4 6 8 10 2 4 6 8 10
Figure 9.11. Anisotropic (left) and isotropic (right), stationary Gaussian random fields.
The anisotropic field changes its values in the East-West direction more slowly than in
the North-South direction. For the isotropic random field the spatial dependency between data
points develops in the same fashion in all directions. If one were to estimate the covariance
between two points llhll $ distance units apart, for example, the covariance must be esti-
mated separately in the North-South and the East-West direction in the anisotropic case. In
• The semivariogram conveys information about the spatial structure and the
degree of continuity of a random field (§9.2.1, §9.2.3).
for all constants !" ß âß !8 and spatial locations. This condition guarantees that the variance
of spatial predictions are non-negative. As a consequence it can be shown by the Cauchy-
Schwartz inequality that lG ahbl G a0b and we have already established that G a0b !, since
Ga0b is the variance of an observation. In practice, we often consider only covariance func-
tions that have the following additional properties: they are positive and decrease monoton-
for any number of spatial locations and constants !" ß âß !8 such that !83" !3 ! (Cressie
1993, p. 86). When the additional conditions (positive, monotonic decreasing) are imposed on
the covariance function the semivariogram of a second-order stationary random field takes on
a very characteristic shape (Figure 9.12) . It rises from the origin monotonically to an upper
asymptote called the sill of the semivariogram. The sill corresponds to Varc^asbd G a0b.
When the semivariogram meets the asymptote the covariance G ahb is zero since # ahb
G a0b G ahb. The distance at which this occurs is called the range of the semivariogram. In
Figure 9.12 the semivariogram approaches the sill only asymptotically. In this case the prac-
tical range is defined as the lag distance at which the semivariogram achieves *&% of the sill,
here, the practical range is llhll "&.
Observations that are spatially separated by more than the range are uncorrelated (or
practically uncorrelated when separated by more than the practical range). Spatial auto-
correlation exists only for pairs of points separated by less than the (practical) range. The
more quickly the semivariogram rises from the origin to the sill the more quickly autocorrela-
tions decline.
Sill
10
8
Semivariogram γ(||h||)
Practical Range
0
0 10 20 30 40
||h||
An intrinsically but not second-order stationary random field has a semivariogram that
does not reach an upper asymptote. The semivariance may increase with spatial separation as
in Figure 9.13. Obviously, there is no range defined in this case. The increase of the semi-
variogram with llhll can not be arbitrary, however. It must rise less quickly than #Îllhll# (this
10
Semivariogram γ(||h||)
6
0 10 20 30 40
||h||
By definition we have #a0b ! but many data sets do not seem to comply with that
property. Figure 9.14 shows an estimate of the semivariogram for the Mercer and Hall grain
yield data plotted in Figure 9.5 (p. 569). The estimated semivariogram values are represented
by the dots and a nonparametric loess smooth of these values was also added to the graph.
The semivariogram appears to reach or at least to approach an asymptote with increasing lag
distance, the sill is approximately !Þ").
0.20
0.15
Semivariogram
0.10
0.05
0.0
0 5 10 15
distance
Figure 9.14. Classical semivariogram estimator (see §9.2.4) for Mercer and Hall grain data
(connected dots). Dashed line is loess smooth of the semivariogram estimator.
Notice that the empirical semivariogram commences at llhll ", since this is the smallest
lag between experimental units. We do not recommend smoothing semivariogram estimates
with standard nonparametric procedures because the resulting fit may not have the required
properties. It may not be conditionally negative definite (nonparametric semivariogram esti-
mation is discussed in 9.2.4). The loess smooth was added to the figure only to suggest an
If any of the two components is not zero, the semivariogram exhibits a discontinuity at the
origin. The magnitude of this discontinuity is called the nugget effect. The term stems from
the idea that ore nuggets are dispersed throughout a larger body of rock but at distances
smaller than the smallest sample distance. If a semivariogram has nugget )! and sill Ga0b, the
difference G a0b )! is called the partial sill of the semivariogram. The practical range is
then defined as the lag distance at which the semivariogram has achieved )!
!Þ*&aG a0b )! b (Figure 9.15).
Sill
10
8
Semivariogram γ(||h||)
4 θ0
Practical Range
0
0 10 20 30 40
||h||
Figure 9.15. Semivariogram of a second-order stationary process with and without nugget
effect. The no-nugget semivariogram (lower line) has sill "! and practical range "&, the
nugget semivariogram has )! %, partial sill G a0b )! ', and the same practical range.
In the presence of a nugget effect the relationship between semivariogram and covario-
gram must be slightly altered. In the no-nugget model we can put Varc^as3 bd G a0b 5 # .
For the nugget model define Varc^as3 bd G a0b )! 0 where 0 is the partial sill (and
5 # Varc^as3 bd )! 0). The semivariogram can now be expressed as # ahb )! 00 ahb
In the presence of a nugget effect a useful statistic is the Relative Structured Variability
(VWZ ). It measures the relative elevation of the semivariogram over the nugget effect in
percent:
0 )!
VWZ "!!% " "!!%. c9Þ9d
0 )! 0 )!
One interpretation of VWZ is as the degree to which variability is spatially structured. The
unstructured part of the variability is due to measurement error and/or microscale variability.
The larger the VWZ , the more efficient geostatistical prediction will be compared to methods
of prediction that ignore spatial information and the greater the continuity of the spatial
process (see §9.2.3)
Nugget-Only Model
The nugget-only model is the semivariogram of a white-noise process, where the ^as3 b
behave like a random sample, all having the same mean, variance with no correlations among
them. The model is void of spatial structure, the relative structured variability is zero. The
nugget-only model is of course second-order stationary and a valid semivariogram in any
dimension. Nugget-only models are not that uncommon although analysts keen on applying
techniques from the spatial statistics toolbox such as kriging are usually less enthusiastic
when a nugget-only model is obtained. A nugget-only model is an appropriate model if the
smallest sample distance in the data is greater than the range of the spatial process. Sampling
an attribute on a regular grid whose spatial range is unknown may invariably lead to a nugget-
only model if grid points are spaced too far apart.
8
θs = 9
! h0 6
γ(h)
# ah à )= b
)= hÁ0 4
0
0 2 4 6 8 10
||h||
Linear Model
The Linear Model is intrinsically stationary with parameters )! and " , both of which must be
positive. Covariances or semivariances usually do not change linearly over a large range, but
linear change of the semivariogram near the origin is often reasonable. If a linear semivario-
gram model is found to fit the data in practice, it is possible that one has observed the initial
increase of a second-order stationary model that is linear or close to linear near the origin but
failed to collect samples far enough apart to capture the range and sill of the semivariogram.
A second-order stationary semivariogram model that behaves linearly near the origin is the
spherical model.
50
θ0 = 5, β = 1
40
! h0 30
# ah à ) b
γ(h)
10
0
0 10 20 30 40 50
||h||
Spherical Model
The spherical model is one of the most popular semivariogram models in applied spatial sta-
tistics for second-order stationary random fields. Its two main characteristics are linear behav-
ior near the origin and the fact that at distance ! the semivariogram meets the sill and remains
flat thereafter. This sometimes creates a visible kink at llhll !. The spherical semivario-
gram thus has a range !, rather than a practical range. The popularity of the spherical model
in the geostatistical literature is a mystery to Stein (1999, p. 52) who argues that perhaps
“there is a mistaken belief that there is some statistical advantage in having the autocorre-
lation function being exactly zero beyond some finite distance” a!b. The fact that # ahà )b is
only once differentiable at llhll ! can lead to problems in likelihood estimation that relies
on derivatives. Stein (1999, p. 52) concludes that the spherical model is a poor substitute for
the exponential model (see next). He recommends using the square of the spherical model
# ahà )b# instead of # ahà )b, since the former provides two derivatives on a!ß _b. The behav-
ior of the squared spherical semivariogram near the origin is not linear, however (see §9.2.3
on the effect of the near-origin behavior).
12
Ú
Ý!
Ý llhll ! 10
$ 8
ll hll ll hll
# ahà )b Û )! )= $# ! "# ! ! llhll !
γ(h)
Ý
Ý
6
Ü )! )= llhll ! 4
θ0 = 3
2
0
0 10 20 ||h||
30 40 50
Exponential Model
The second-order stationary exponential model is a very useful model that has been found to
fit spatial data in varied applications well. It approaches the sill )= asymptotically as
llhll Ä _. In the parameterization shown below the parameter ! is the practical range of the
semivariogram (Figure 9.12 is an exponential semivariogram without nugget). Often the
model can be found in a parameterization where the exponent is llhllÎ!. The practical
range then corresponds to $!. The SAS® System and S+SpatialStats® use this parameteriza-
tion. For the same range and sill as the spherical model the exponential model rises more
quickly from the origin and yields autocorrelations at short lag distances smaller than those of
the spherical model. A random field with an exponential semivariogram is less regular (less
continuous) on short distances than the spherical model (§9.2.3).
14 θ0 = 3, θs = 10, α = 25
12
10
! h0 8
γ(h)
# ah à ) b $llhll
)! )= " exp ! hÁ0
6
4
θ0 = 3
2
0
0 10 20 ||h|| 30 40 50
It is easily seen that this is a special case of the covariance model introduced in §7.5.2 for
modeling the within-cluster correlations in repeated measures data (see formula [7.54] on p.
457). There it was referred to as the continuous AR(1) model because of its relationship to a
continuous first-order regressive time series. The extra constant $ was not used there, since
the temporal range is usually of less interest when modeling clustered repeated measures data
than the range for spatial data. The temporal separation l>34 >34w l between two observations
from the same cluster is now replaced by the Euclidean distance llhll. For the exponential
semivariogram to be valid we need to have )! !, )= !, and ! !.
12
10
Ú! h0 8
γ(h)
# ah à ) b Û #
)! )= " exp $ ll!hll h Á 0
6
Ü 4
θ0 = 3
2
0
0 10 20 ||h||
30 40 50
This is a fairly subtle difference that has considerable implications. The gaussian model
is the most continuous near the origin of the models considered here. In fact, it is infinitely
differentiable near !. This implies a very smooth, regular spatial process (see §9.2.3). It is so
smooth that knowing the value at ! and the values of all partial derivatives determines the
values in the random field at any arbitrary location. Such smoothness is unrealistic for most
processes.
The name should not imply that this semivariogram model deserves similar veneration in
spatial statistics as is awarded rightfully to the Gaussian distribution in classical statistics. The
name stems from the fact that a stochastic process with covariance function
G a>b - exp !>#
which resembles in functional form the Gaussian probability mass function. Furthermore, one
should not assume that the semivariogram of a Gaussian random field (see §9.1.4 for the
definition) is necessarily of this form. It most likely will not be.
As for the exponential model, the parameter ! is the practical range and The SAS®
System and S+SpatialStats® again drop the factor $ in the opponent. In their parameterization
the practical range is È$!.
Power Model
This is an intrinsically stationary model but only for ! - #. Otherwise the variogram
would increase faster than llhll# , which is in violation with the intrinsic hypothesis. The
parameter " furthermore must be positive. The linear semivariogram model is a special case
of the power model with - ". Note that the covariance model proc mixed in The SAS®
350
300 θ0 = 3, β = 2, λ = 1.3
250
! h0
γ(h)
200
# ah à ) b
)! " llhll- hÁ0 150
100
50
0
0 10 20 30 40 50
||h||
2.0
1.5
θ0 = 0.25, θs= 1.5, α = 25*π/180
! h0
γ(h)
# ah à ) b
)! )= " !sin ll!hll Îllhll h Á 0
1.0
0.5
0.0
0 5 10 15 20 25 30
||h||
The term llhllÎ! is best measured in radians. In the Figure above we have chosen
! #&1Î")!. The practical range is the value where the peaks/valleys of the covariogram
are no greater than !Þ!&G a0b, approximately 1'Þ&!. A process with a wave semivariogram
has some form of periodicity.
Nugget-only
Exponential
Spherical
Gaussian
Position on transect
Figure 9.16. Simulated Gaussian random fields that differ in their isotropic semivariogram
structure. In all cases the semivariogram has sill #, the exponential, spherical, and gaussian
semivariograms have range (practical range) &.
Because the (practical) range of the semivariogram has a convenient interpretation as the
distance beyond which observations are not spatially autocorrelated it is often interpreted as a
zone of influence, or a scale of variability of ^asb, or in terms of the degree of homogeneity
of the process. It must be noted that Varc^ asbd, the variability (scale) of ^ asb, is not a func-
tion of spatial location in a second-order stationary process. The variances of observations are
the same everywhere and the process is homogeneous in this sense, regardless of the magni-
tude of the variability. The scale of variability the range refers to is the spatial range over
which the variance of differences ^ asb ^ as hb changes. For distances exceeding the
range Varc^ asb ^ as hbd is constant. From Figure 9.16 it is also seen that processes with
are the integral scale measures for one- and two-dimensional processes, respectively, where 2
denotes Euclidean distance a2 llhllb. Consider a two-dimensional process with an
exponential semivariogram, no nugget, and practical range !. Then,
_ _ ½
N" ( expe $2Î!f.2 !Î$ N# #( 2expe $2Î!f.2 !È#Î$.
! !
Integral scales are used to define the distance over which observations are highly related
rather than relying on the (practical) range, which is the distance beyond which observations
are not related at all. For a process with gaussian semivariogram and practical range !, by
comparison, one obtains N" !Þ&!È1Î$ ¸ !Þ&"!. The more continuous gaussian process
has a longer integral scale, correlations wear off more slowly. An alternative measure to de-
fine distances over which observations are highly related is obtained by choosing a critical
value of the autocorrelation and to solve the correlation function for it. The distance 2a!ß - b
at which an exponential semivariogram with range ! (and no nugget) achieves correlation
! - " is 2a!ß - b !a lne- fÎ$b. For the gaussian semivariogram this distance is
2a!ß - b !È lne- fÎ$. The more continuous process maintains autocorrelations over
longer distances.
Solie, Raun, and Stone (1999) argue that integral scales provide objective measures for
the distance at which soil and plant variables are highly correlated and are useful when this
distance cannot be determined based on subject matter alone. The integral scale these authors
employ is a modification of N" where the autocorrelation function is integrated only to the
!
(practical) range, N '! 3a2b.2. For the exponential semivariogram with no nugget effect,
this yields N ¸ !Þ*&!Î$.
where R ahb is the set of location pairs that are separated by the lag vector h and lR ahbl de-
notes the number of unique pairs in the set R ahb. Notice that ^ as" b ^ as# b and ^ as# b
^as" b are the same pair in this calculation and are not counted twice. If the semivariogram of
the random field is isotropic, h is replaced by llhll. [9.10] is known as the classical semivario-
gram estimator due to Matheron (1962) and also called the Matheron estimator. Its properties
are generally appealing. It is an unbiased estimator of #ahb provided the mean of the random
field is constant and it behaves similar to the semivariogram: it is an even function s # ah b
# a hb and s
s # a0b !. A disadvantage of the Matheron estimator is its sensitivity to outliers.
If ^ as4 b is an outlying observation the difference ^ as3 b ^ as4 b will be large and squaring the
difference amplifies the contribution to the empirical semivariogram estimate at lag s3 s4 . In
addition, outlying observations contribute to the estimation of #ahb at various lags and exert
their influence on more than one s # ahb value. Consider the following hypothetical data chosen
small to demonstrate the effect. The data represent five locations on a a$ %b grid.
Table 9.1. A spatial data set containing an outlying observation ^ ac$ß %db #!
Column (y)
Row (x) " # $ %
" " %
# #
$ $ #!
The observation in row $, column % is considerably larger than the remaining four obser-
vations. What is its effect on the Matheron semivariogram estimator? There are five lag dis-
tances in these data, at llhll È#, #, È&, $, and È"$ distance units. For each lag there are
exactly two data pairs. For example, the pairs contributing to the estimation of the semivario-
gram at llhll $ are e^ ac"ß "dbß ^ ac"ß %dbf and e^ ac$ß "dbß ^ ac$ß %dbf. The variogram
estimates are
Square roots of absolute differences are averaged first and then raised to the fourth power.
The influence of outlying observations is reduced because absolute differences are more
stable than squared differences and averaging is carried out before converting into the units of
a variance. Note that the attribute robust pertains to outlier contamination of the data; it
should not imply that [9.11] is robust against other violations, such as nonconstancy of the
mean. This estimator is not unbiased for the semivariogram but the term !Þ%&(
!Þ%*%ÎlR ahbl in the denominator reduces the bias considerably. Calculating the robust esti-
mator for the spatial data set with an outlier one obtains (where !Þ%&( !Þ%*%Î# !Þ(!%)
%
"
## ÐÈ#Ñ Èl" #l Èl# $l Î!Þ(!% "Þ%#
#
%
"
## Ð#Ñ Èl" $l Èl% #!l Î!Þ(!% ('Þ$
#
%
"
## ÐÈ&Ñ Èl% #l Èl#! #l Î!Þ(!% *!Þ*
#
%
"
## a$b Èl% "l Èl#! $l Î!Þ(!% "!%Þ$
#
%
"
2# ÐÈ"$Ñ Èl$ %l Èl#! "l Î!Þ(!% ($Þ#.
#
The influence of the outlier is clearly subdued compared to the Matheron estimator.
µ "
# !Þ& ahb median e^ as3 b ^ as4 bf# !Þ%&%*. [9.12]
#
The Matheron estimator [9.10] and the robust estimator [9.11] remain the most important esti-
mators of the empirical semivariogram in practice, however.
The precision of an empirical estimator at a given lag depends on the number of pairs
available at that lag that can be averaged or otherwise summarized. The recommendations
that at least &! (Chilès and Delfiner 1999, p. 38) or $! aJournel and Huijbregts 1978, p. 194b
unique pairs should be available for every lag vector h or distance llhll are common. Even
with &! pairs the empirical semivariogram can be quite erratic for larger lags and simulation
studies suggest that the approximate number of pairs can be considerably larger. Webster and
Oliver (1992) conclude through simulation that at least #!! to $!! observations are required
to estimate a semivariogram reliably. Cressie (1985) shows that the variance of the Matheron
semivariogram estimator can be approximated as
## # ahb
# a h bd »
Varcs . [9.13]
lR ahbl
As the semivariogram increases, so does the variance of the estimator. When the semivario-
# ahb for large lags can
gram is intrinsic but not second-order stationary, the variability of s
make it difficult to recognize the underlying structure unless R ahb is large. We show in
§A9.9.1 that [9.13] can be a poor approximation to Varcs # ahbd which also depends on the
degree of spatial autocorrelation and the spatial arrangement of the sampling locations. This
latter dependence has been employed to determine sample grids and layouts that lead to good
properties of the empirical semivariogram estimator without requiring too many observations.
For details see, for example, Russo (1984), Warrick and Myers (1987), and Zheng and
Silliman (2000).
With irregularly spaced data the number of observations at a given lag may be small,
some lags may be even unique. To collect a sufficient number of pairs the set R ahb is then
defined as the collection of pairs for which locations are separated by h % or llhll %,
where % is some lag tolerance. In other words, the empirical semivariogram is calculated only
for discrete lag classes and all observations within a lag class are considered representing that
particular lag. This introduces two potential problems. The term e^ as3 b ^ as4 bf# is an un-
biased estimator of ## Ðs3 s4 Ñ, but not of ## as3 s4 %b and grouping lags into lag classes
introduces some bias. Furthermore, the empirical semivariogram depends on the width and
number of lag classes, which introduces a subjective element into the analysis.
The goal of semivariogram estimation is not to estimate the empirical semivariogram giv-
en by [9.10], [9.11], or [9.12] but to estimate the unknown parameters of a theoretical semi-
variogram model #ahà )b. The least squares and nonparametric approaches fit the semivario-
gram model to the empirical semivariogram. If [9.10] was calculated at lags h" ß h# ß âß h5 ,
# ah " b ß s
then s # ah# bß âß s
# ah5 b serve as the data to which the semivariogram model is fit
10
5
γ(||h||)
0 10 20 30 40 50 60 70 80 90
Lag ||h||
Figure 9.17. A robust empirical estimate of the semivariogram. #ahb was calculated at 5
"$ lag classes of width (. The semivariogram estimates are plotted at the average lag distance
within each class. Connecting the dots does not guarantee that the resulting function is condi-
tionally negative definite.
OLS requires that the data points are uncorrelated and homoscedastic. Both assumptions are
not met. For the Matheron estimator Cressie (1985) showed that its variance is approximately
## # ahb
# a h bd »
Varcs . [9.15]
lR ahbl
It depends on the true semivariogram value at lag h and the number of unique data pairs at
# ah3 b are also not uncorrelated. The same data point ^as3 b contributes to the
that lag. The s
The problem with the generalized least squares approach lies in the determination of the vari-
ance-covariance matrix V. Cressie (1985, 1993 p. 96) gives expressions from which the off-
diagonal entries of V can be calculated for a Gaussian random field. These are complicated
expressions of the true semivariogram and as a simplification one often resorts to weighted
least squares (WLS) fitting. Here, V is replaced by a diagonal matrix W that contains the
variances of the s# ah3 b on the diagonal and the approximation [9.15] is used to calculate the
diagonal entries. The weighted least squares estimates of ) are obtained by minimizing
w
s ahb # ahà )bb W" a#
a# s a hb # a h à ) b b , [9.17]
where W Diage## # ahà )bÎlR ahblf. We show in §A9.9.2 that this is equivalent to minimi-
zing
5 #
# ah 3 b
s
"lR ah3 bl " , [9.18]
3"
# ah 3 à ) b
which is (2.6.12) in Cressie (1993, p. 96). If the robust estimator is used instead of the
Matheron estimator, sa# h3 b in [9.18] is replaced with # ah3 b. Note that semivariogram models
are typically nonlinear, with the exception of the nugget-only and the linear model, and mini-
mization of these objective functions requires nonlinear methods.
The weighted least squares method for fitting semivariogram models is very common in
practice. One must keep in mind that minimizing [9.18] is an approximate method. First,
[9.15] is an approximation for the variance of the empirical estimator. Second, W is a poor
approximation for V. The weighted least squares method is a poor substitute for the general-
ized least squares method that should be used. Delfiner (1976) developed a different weighted
least squares method that is implemented in the geostatistical package BLUEPACK (Delfiner,
Renard, and Chilès 1978). Zimmerman and Zimmerman (1991) compared various semivario-
gram fitting methods in an extensive simulation study and concluded that there is little to
choose between ordinary and weighted least squares. The Gaussian random fields simulated
by Zimmerman and Zimmerman (1991) had a linear semivariogram with nugget effect and a
no-nugget exponential structure. Neither the WLS or OLS estimates were uniformly superior
in terms of bias for the linear semivariogram. The weighted least squares method due to
Delfiner (1976) performed very poorly, however, and was uniformly inferior to all other
methods (including the likelihood methods to be discussed next). In case of the exponential
semivariogram the least squares estimators of the sill exhibited considerable positive bias, in
particular when the spatial dependence was weak.
In WLS and OLS fitting of the semivariogram care should be exercised in the interpreta-
tion of the standard errors for the parameter estimates reported by statistical packages.
Note that instead of the semivariogram we work with the covariance function here, but
because the process is second-order stationary, the semivariogram and covariogram are
related by
#ahà )b G a0à )b G ahà )b.
In short, Zasb µ Ka.ß Da)bb where ) is the vector containing the parameters of the covario-
gram. The negative log-likelihood of Zasb is
8 " "
6a)ß .à zasbb lna#1b lnlDa)bl azasb .bw Da)b" azasb .b [9.19]
# # #
and maximum likelihood (ML) estimates of ) (and .) are obtained as minimizers of this ex-
pression. Compare this objective function to that for fitting a linear mixed model by maxi-
mum likelihood in §7.4.1 and §A7.7.3. There the objective was to estimate the parameters in
the variance-covariance matrix and the unknown mean vector. The same idea applies here. If
. X" , where " are unknown parameters of the mean, maximum likelihood estimation pro-
vides simultaneous estimates of the large-scale mean structure (called the drift in the geosta-
tistical literature) and the spatial dependency. This is an advantage over the indirect least
as the objective function for maximization? Obviously, [9.21] is not a log-likelihood although
the individual terms 6a)à C3 b are. The estimates obtained by maximizing [9.21] cannot be as
efficient as ML estimates, which is easily established from key results in estimating function
theory (see Godambe 1960, Heyde 1997, and our §A9.9.2). The function [9.21] is called a
composite log-likelihood and its derivative,
8
`6a)5 à C3 b
GW a)5 à yb " [9.22]
3"
` )5
is the composite score function for )5 . Setting the composite score functions for )" ß âß ): to
zero and solving the resulting system of equations yields the composite likelihood estimates.
Applying this idea to the problem of estimating the semivariogram we commence by con-
sidering the 8a8 "bÎ# unique pairwise differences X34 ^ as3 b ^ as4 b. When the ^ as3 b
are Gaussian with the same mean and the random field is intrinsically stationary (this is a
weaker assumption than assuming an intrinsically stationary Gaussian random field), then X34
is a Gaussian random variable with mean ! and variance ##as3 s4 à )b. We show in §A9.9.2
that the composite score function for the X34 is
8" 8
` # a s3 s 4 à ) b "
GW a)ß tb " " ># ## as3 s4 à )b. [9.23]
3" 43
`) %# as3 s4 à )b 34
#
Although this is a complicated looking expression, it is really the nonlinear weighted least
squares objective function in the model
X34# ##as3 s4 à )b /34 ,
where the /34 are independent random variables with mean ! and variance )# # as3 s4 à )b.
Note the correspondence of VarÒX34# Ó to Cressie's variance approximation for the Matheron
# ahb is an average of the X34 .
estimator [9.15]. The expressions are the same considering that s
The composite likelihood estimator can be calculated easily with a nonlinear regression
package capable of weighted least squares fitting such as proc nlin in The SAS® System.
Obtaining the (restricted) maximum likelihood estimate requires a procedure that can mini-
mize [9.19] or [9.20] such as proc mixed. The minimization problem in ML or REML estima-
tion is numerically much more involved. One of the main problems there is that the matrix
Da)b must be inverted repeatedly. For clustered data as in §7, where the variance-covariance
matrix is block-diagonal, this is not too cumbersome; the matrix can be inverted block by
block. In the case of spatial data Da)b does not have a block-diagonal structure and in general
no shortcuts can be taken. Zimmerman (1989) derives some simplifications when the obser-
vations are collected on a rectangular or parallelogram grid. Composite likelihood (CL) esti-
mation on the other hand replaces the inversion of one large matrix with many inversions of
small matrices. The largest matrix to be inverted for a semivariogram model with $ parame-
ters (nugget, sill, range) is a $ $ matrix. However, CL estimation processes many more data
points. With 8 "!! spatial observations there are 8a8 "bÎ# %ß *&! pairs. That many
observations are hardly needed. It is quite reasonable to remove those pairs from estimation
and the nugget )! , sill )= , and practical range ! are estimated based on data. Our list of
isotropic semivariogram models in §9.2.2 is relatively short. Although many more semi-
variogram models are known, typically users resort to one of the models shown there. In
applications one may find that none of these describes the empirical semivariogram well, for
example, because the random field does not have constant mean, is anisotropic, or consists of
different scales of variation. The latter reason is the idea behind what is termed the linear
model of regionalization in the geostatistical literature (see, for example, Goovaerts 1997,
Ch. 4.2.3). Statistically, it is based on the facts that (i) if G" ahb and G# ahb are valid co-
variance structures in # , then G" ahb G# ahb is also a valid covariance structure in # ; (ii) if
G ahb is a valid structure, so is ,G ahb provided , !. As a consequence, linear combinations
of permissible covariance models lead to an overall permissible model. The coefficients in the
linear combination must be positive, however. The same results hold for semivariograms.
The linear model of regionalization assumes that the random function ^asb is a linear
combination of : stationary zero-mean random functions. If Y4 asb is a second-order stationary
random function with EcY4 asbd !, CovcY4 asbß Y4 as hbd G4 ahb and +" ß âß +: are posi-
tive constants, then
:
^ asb "+4 Y4 asb . [9.24]
4"
and semivariogram #^ ahb !:4" +4# #4 ahb provided that the individual processes Y" asbß âß
Y: asb are not correlated. If the individual semivariograms #4 ahb have sill ", then !:4" +4# is
the variance of an observation. Covariogram and semivariogram models derived from a re-
gionalization such as [9.24] are called nested models. Every semivariogram model containing
a nugget effect is thus a nested model.
The variability of a soil property is related to many causes that have different spatial
scales, each scale integrating variability at all smaller scales (Russo and Jury 1987a). If the
total variability of an attribute varies with the spatial scale or resolution, nested models can
capture this dependency, if properly modeled. Nesting models is thus a convenient way to
construct theoretical semivariogram models that offer greater flexibility than the basic models
in §9.2.2. Nested models are not universally accepted, however. Stein (1999, p. 13) takes
exception to nested models where the individual components are spherical models. A danger
of nesting semivariograms is to model the effects of nonconstancy of the mean on the empiri-
cal semivariogram through a creative combination of second-order stationary and intrinsically
stationary semivariograms. Even if this combination fits the empirical semivariogram well, a
critical assumption of variogram analysis has been violated. Furthermore, the assumption of
mutual independence of the individual random functions Y" asbß âß Y: asb must be evaluated
with great care. Nugget effects that are due to measurement errors are reasonably assumed to
be independent of the other components. But a component describing smaller scale variability
due to soil nutrients may not be independent of a larger scale component due to soil types or
geology.
To increase the flexibility in modeling the semivariogram of stationary isotropic proces-
ses without violating the condition of positive definiteness of the covariogram (conditional
negative definiteness of the semivariogram), nonparametric methods can be employed. The
rationale behind the nonparametric estimators (a special topic in §A9.9.3) is akin to the
nesting of covariogram models in the linear model of regionalization. The covariogram is ex-
pressed as a weighted combination of functions, each of which is a valid covariance function.
Instead of combining theoretical covariogram models, however, the nonparametric approach
combines positive-definite functions that are derived from a spectral representation. These are
termed the basis functions. For data on a transect the basis function is cosa2b, for data in the
plane it is the Bessel function of the first kind of order zero, and for data in three dimensions
it is sina2bÎ2 (Figure 9.18).
The flexibility of the nonparametric approach is demonstrated in Figure 9.19 which
shows semivariograms constructed with 7 & equally spaced nodes and a maximum lag of
2 "!. The functions shown as solid lines have equal weights +4# !Þ#. The dashed lines are
produced by setting +"# +&# !Þ& and all other weights to zero. The smoothness of the semi-
variogram decreases with the unevenness of the weights and the number of sign changes of
the basis function.
0.8
0.6
0.4
0.2
0.0
-0.2
-0.4
0 5 10 15 20 25 30
x
Figure 9.18. Basis functions for two-dimensional data (solid line, Bessel function of the first
kind of order !) and for three-dimensional data (dashed line, sinaBbÎB).
2 5 8
1.0
0.5
0.0
Semivariogram γ(h)
1.0
0.5
0.0
4 Sign changes
1.0
0.5
0.0
2 5 8
Lag Distance h
• Reactive effects are modeled through the mean structure, interactive effects
are modeled through the random structure. For geostatistical data inter-
active effects are represented through stationary random processes, for lat-
tice data through autoregressive neighborhood structures.
So far we have been concerned with properties of random fields and the semivariogram or co-
variogram of a stationary process. Although the fitting of a semivariogram entails modeling,
this is only one aspect of representing the structure in spatial data in a manner conducive to a
statistical analysis. The constancy of the mean assumption implied by stationarity, for
example, is not reasonable in many applications. In a field experiment where treatments are
applied to experimental units the variation among units is not just due to spatial variation
about a constant mean but also due to the effects of the treatments. Our view of spatial data
must be extended to accommodate changes in the mean structure, stationarity, and measure-
ment error. One place to start is to decompose the variability in ^asb into various sources.
Following Cressie (1993, Ch. 3.1), we write for geostatistical data
^ asb .asb [ asb (asb /asb. [9.26]
and xasb is a vector of regressor variables that can depend on spatial coordinates alone or on
other explanatory variables and factors. Cliff and Ord (1981, Ch. 6) distinguish between
reaction and interaction models. In a reaction model sites react to outside influences, e.g.,
plants will react to the availability of nutrients in the root zone. Since this availability varies
spatially, plant size or biomass will exhibit a regression-like dependence on nutrient availabil-
ity. It is then reasonable to include nutrient availability as a covariate in the regressor vector
xasb. In an interaction model, sites react not to outside influences but react with each other.
Neighboring plants compete with each other for resources, for example. In general, when the
dominant spatial effects are caused by sites reacting to external forces, these effects should be
part of the mean function xw asb" . Interactive effects (reaction among sites) call for modeling
spatial variability through the spatial autocorrelation structure of the error process.
The distinction between reactive and interactive models is useful, but not cut-and-dried.
Significant autocorrelation in the data does not imply an interactive model over a reactive one
or vice versa. Spatial autocorrelation can be spurious if caused by large-scale trends or real if
caused by cumulative small-scale, spatially varying components. The error structure is thus
often thought of as the local structure and the mean is referred to as the global structure. With
increasing complexity of the mean model xw asb", for example, as higher-order terms are
added to a response surface, the mean will be more spatially variable and more localized. In a
two-way row-column layout (randomized block design) where rows and columns interact one
could model the data as
^34 . !3 "4 # =" =# /34 ß /34 µ 33. !ß 5 # ,
where !3 denotes row, "4 column effects and =" , =# are the cell coordinates. This model
assumes that the term #=" =# removes any residual spatial autocorrelation, hence the errors /34
are uncorrelated. Alternatively, one could invoke the model
^34 . !3 "4 $34 ,
where the $34 are autocorrelated. One modeler's reactive effect will be another modeler's inter-
active effect.
With geostatistical data the spatial dependency between $ as3 b and $ as4 b (the interaction)
is modeled through the semivariogram or covariogram of the $ab process. If the spatial do-
main is discrete (lattice data), modifications are necessary since [ asb and (asb in decomposi-
The contribution to ^as3 b made by other sites is a linear combination of residuals at other lo-
cations. By convention we put ,33 ! in [9.27]. The /as3 b's are uncorrelated random errors
with mean ! and variance 53# . If all ,34 ! and .as3 b xw as3 b", the model reduces to
^ as3 b xw as3 b" /as3 b, a standard linear regression model. The interaction coefficients ,34
contain information about the strength of the dependence between sites ^ as3 b and ^ as4 b.
Since !84" ,34 a^ as4 b .as4 bb is a function of random variables it can be considered part of
the error process. In a model for lattice data it can be thought of as replacing the smooth-
scale random function [ asb in [9.26]. Model [9.27] is the spatial equivalent of an autoregres-
sive time series model where the current value in the series depends on previous values. In
the spatial case we potentially let ^as3 b depend on all other sites since space is not directed.
More precisely, model [9.27] is the spatial equivalent of a simultaneous time series model,
hence the denomination as a Simultaneous Spatial Autoregressive (SSAR) model. We discuss
SSAR and a further class of interaction models for lattice data, the Conditional Spatial Auto-
regressive (CSAR) models, in §9.6.
Depending on whether the spatial process has a continuous or discrete domain, we now
have two types of mean models. Let Zasb c^ as" bß âß ^ as8 bdw be the a8 "b vector of the
attribute ^ at all observed locations, Xasb be the a8 5 b regressor matrix
w
Ô x a s" b ×
Ö x w a s# b Ù
X a sb Ö Ù,
ã
Õ x w a s8 b Ø
and $asb c$ as" bß âß $ as8 bdw the vector of errors in the mean model for geostatistical data.
The model can be written as
Zasb Xasb" $ asb, [9.28]
where Ec$ asbd 0 and the variance-covariance matrix of $ asb contains the covariance func-
tion of the $ ab process. If Covc$ as3 bß $ as4 bd G as3 s4 ;)b, then
Ô G a0 à ) b G a s" s # à ) b â G a s" s 8 à ) b ×
Ö G a s # s" à ) b G a0 à ) b â G as# s8 à ) b Ù
Varc$ asbd Da)b Ö Ù.
ã ä ã
Õ G a s8 s " à ) b â G as8 s8" à )b G a0 à ) b Ø
Note that Da)b is also the variance-covariance matrix of Zasb. Unknown quantities in this
model are the vector of fixed effects " in the mean function and the vector ) in the covari-
It follows that VarcZasbd aI Bb" VarceasbdaI Bw b" . In applications of the SSAR model
it is often assumed that the errors are homoscedastic with variance 5 # . Then, VarcZasbd
5 # aI Bb" aI Bw b" . Parameters of the SSAR model are the vector of fixed effects ", the
residual variance 5 # , and the entries of the matrix B. While the covariance matrix Da)b de-
pends on only a few parameters with geostatistical data (nugget, sill, range), the matrix B can
contain many unknowns. It is not even required that B is symmetric, only that I B is invert-
ible. For purposes of parameter estimation it is thus required to place some structure on B to
reduce the number of unknowns. For example, one can put B 3W, where W is a matrix se-
lected by the user that identifies which sites are spatially connected and the parameter 3 deter-
mines the strength of the spatial dependence. Table 9.2 summarizes the key differences
between the mean models for geostatistical and lattice data. How to structure the matrix B in
the SSAR model and the corresponding matrix in the CSAR model is discussed in §9.6.
Table 9.2. Mean models for geostatistical and lattice data (with Varceasbd 5 # I)
Geostatistical Data Lattice Data
Model Zasb Xasb" $ asb Xasb" BaZasb Xasb" b easb
EcZasbd Xasb" Xasb"
VarcZasbd Da)b 5 # aI Bb" aI Bw b"
Mean parameters " "
#
Dependency parameters ) 5 ,B
A common goal in the analysis of geostatistical data is the mapping of the random function
^ asb in some region of interest. The sampling process produces observations ^ as" b,â,^ as8 b
but ^ asb varies continuously through the domain H. To produce a map of ^ asb requires
prediction of ^ab at unobserved locations s! . What is commonly referred to as the
geostatistical method consists of the following steps (at least the first 6).
1. Using exploratory techniques, prior knowledge, and/or anything else, posit a model of
possibly nonstationary mean plus second-order or intrinsically stationary error for the
^asb process that generated the data.
2. Estimate the mean function by ordinary least squares, smoothing, or median polishing
to detrend the data. If the mean is stationary this step is not necessary. The methods
for detrending employed at this step usually do not take autocorrelation into account.
3. Using the residuals obtained in step 2 (or the original data if the mean is stationary), fit
a semivariogram model #ahà )b by one of the methods in §9.2.4.
4. Statistical estimates of the spatial dependence in hand (from step 3) return to step 2 to
re-estimate the parameters of the mean function, now taking into account the spatial
autocorrelation.
5. Obtain new residuals from step 4 and iterate steps 2 through 4, if necessary.
6. Predict the attribute ^ab at unobserved locations and calculate the corresponding mean
square prediction errors.
If the mean is stationary or if the steps of detrending the data and subsequent estimation
of the semivariogram (or covariogram) are not iterated, the geostatistical method consists of
only steps 1, 2, 3, and 6. This section is concerned with the last item in this process, the
prediction of the attribute (mapping).
Understanding the difference between predicting ^as! b, which is a random variable, and
estimation of the mean of ^as! b, which is a constant, is essential to gain an appreciation for
the geostatistical methods employed to that end. To motivate the problem of spatial prediction
focus first on a classical linear model Y X" e, e µ a0ß 5 # Ib. What do we mean by a pre-
dicted value at a regressor value x! ? We can think of a large number of possible outcomes
that share the same set of regressors x! and average their response values. This average is an
estimate of the mean of ] at x! , Ec] lx! d. Once we have fitted the model to data and obtained
estimates " s the obvious estimate of this quantity is xw! " s . What if a predicted value is inter-
preted as the response of the next observation that has regressors x! ? This definition appeals
not to infinitely many observations at x! , but a single one. Rather than predicting the expected
value of ] , ] itself is then of interest. In the spatial context imagine that ^asb is the soil loss
potential at location s in a particular field. If an agronomist is interested in the soil loss po-
tential of a large number of fields with properties similar to the sampled one, the important
quantity would be Ec^asbd. An agronomist interested in the soil loss potential of the sampled
field at a location not contained in the sample would want to predict ^as! b, the actual soil loss
s
sc] |x! d xw! "
E
s lx! xw! "
] s.
The difference between the two predictions does not lie in the predicted value, but in their
precision (standard errors). Standard linear model theory, where " is estimated by ordinary
least squares, instructs that
s 5 # xw! aXw Xb x!
sc] lx! d Varxw! " "
VarE [9.30]
is the variance of the predicted mean at x! . When predicting random variables one considers
the variance of the prediction error ] lx! ] s lx! to take into account the variability of the
new observation. If the new observation ] lx! is uncorrelated with Y, then
s lx! Varc] lx! d 5 # xw! aXw Xb" x! .
Var] lx! ] [9.31]
Although the same formula axw! " b is used for predictions, the uncertainty associated with pre-
dicting a random variable ([9.31]) exceeds the uncertainty in predicting the mean ([9.30]). To
consider the variance of the prediction error in one case and the variance of the predictor in
the other is not arbitrary. Consider some quantity Y is to be predicted. We use some function
0 aYb of the data as the predictor. If EcY d Ec0 aYbd, the mean square prediction error is
Q WI cY ß 0 aYbd EaY 0 aYbb# VarcY 0 aYbd
VarcY d Varc0 aYbd #CovcY ß 0 aYbd. [9.32]
In the standard linear model estimating Ec] lx! d and predicting ] lx! correspond to the
following:
Target Y 0 aYb VarcY d Varc0 aYbd CovcY ß 0 aYbd Q WI cY ß 0 aYbd
s " "
Ec] lx! d xw! " ! 5 # xw! aXw Xb x! ! 5 # xw! aXw Xb x!
s " "
] lx! xw! " Varc] lx! d 5 # xw! aXw Xb x! !† Varc] lx! d 5 # x!w aXw Xb x!
† s is based.
provided ] lx! is independent of the observed vector Y on which the estimate "
The variance formulas [9.30] and [9.31] are mean square errors and well-known from
linear model theory for uncorrelated data. Under these conditions it turns out that ] s
s lx! xw! "
is the best linear unbiased predictor of ] lx! and that Ec w s is the best linear un-
s ] |x ! d x ! "
biased estimator of Ec] lx! d. Expression [9.31] applies only, however, if ] lx! and Y are not
correlated.
Spatial data exhibit spatial autocorrelations which are a function of the proximity of ob-
servations. Denote by ^ as" bß âß ^ as8 b the attribute at the observed locations s" ß âß s8 and as
s! the target location where prediction is desired. If the observations are spatially correlated,
then ^as! b is also correlated with the observations unless the target location s! is further re-
moved from the observed locations than the spatial range (Figure 9.20). We must then ask
which function of the data best predicts ^as! b and how to measure the mean square predic-
150
135
120
105
Y-Coordinate
90
75
60
45
30
15
Figure 9.20. Observed sample locations (dots). Crosses denote target locations for prediction.
Strength of correlation between observations at target locations and at observed locations
depends on the distance between dots and crosses.
• To find the optimal predictor :aZà s! b for ^ as! b requires a measure for the
loss incurred by using :aZà s! b for prediction at s! . Different loss functions
result in different best predictors.
• The predictor that minimizes the mean square prediction error is the
conditional mean Ec^ as! blZasbd. If the random field is Gaussian the
conditional mean is linear in the observed values.
When a statistical method is labeled as optimal or best, we need to inquire under which condi-
tions optimality holds; there are few methods that are uniformly best. The famous pooled >-
test, for example, is a uniformly most powerful test, but only if uniformly means among all
tests for comparing the means of two Gaussian populations with common variance. If ^as! b
is the target quantity to be predicted at location s! a measure of the loss incurred by using
some predictor :aZà s! b for ^ as! b is required (we use :aZà s! b as a shortcut for :aZasbà s! b).
The most common loss function in statistics is squared error loss
e^ as! b :aZà s! bf# [9.33]
because of its mathematical tractability and simple interpretation. But [9.33] is not directly
useful since it is a random quantity that depends on unknowns. Instead, we consider its
average, EÒe^ as! b :aZà s! bf# Ó, the mean square error of using :aZà s! b to predict ^ as! b.
This expected value is also called the Bayes risk under squared error. If squared error loss is
accepted as the suitable loss function, among all possible predictors the one that should be
chosen is that which minimizes the Bayes risk. This turns out to be the conditional
expectation :! aZà s! b Ec^ as! b l Zasbd. The minimized mean square prediction error
(MSPE) then takes on the following, surprising form:
This is a stunning result since variances are usually added, not subtracted. From [9.34] it is
immediately obvious that the conditional mean must be less variable than the random field at
s! , because the mean square error is a positive quantity. Perhaps even more surprising is the
fact that the MSPE is small if the variance of the predictor is large. Consider a time series
where the value of the series at time #! is to be predicted (Figure 9.21). Three different types
of predictors are used. The sample average C and two nonparametric fits that differ in their
smoothness. The sample mean C is the smoothest of the three predictors, since it does not
change with time. The loess fit with large bandwidth (dashed line) is less smooth than C and
more smooth than the loess fit with small bandwidth (solid line). The less smooth the predic-
tor, the greater its variability and the more closely it will follow the data. The chance that a
smooth predictor is close to the unknown observation at > #! is smaller than for one of the
variable (more jagged) predictors. The most variable predictor is one that interpolates the data
points (connects the dots). Such a predictor is said to honor the data or to be a perfect
interpolator. In the absence of measurement error the classical kriging predictors have
precisely this property to interpolate the observed data points (see §A9.9.5).
2.0
1.5
1.0
0.5
Z(t)
0.0
-0.5
-1.0
-1.5
-2.0
0 10 20 30 40
Time t
Figure 9.21. Prediction of a target point at > #! (circle) in a time series of length %!.
The predictors are loess smooths with small (irregular solid line) and large bandwidth
(dashed line) as well as the arithmetic average C (horizontal line).
Although the conditional mean Ec^ as! b l Zasbd is the optimal predictor of ^ as! b under
squared error loss, it is not the predictor usually applied. Ec^ as! b l Zasbd can be a complicated
nonlinear function of the observations. A notable exception occurs when ^asb is a Gaussian
random field and c^ as! bß Zasbd are jointly multivariate Gaussian distributed. Define the
following quantities
EcZasbd .asb Ec^ as! bd .as! b
VarcZasbd D Varc^ as! bd 5 # [9.35]
Covc^ as! bß Zasbd c
for a Gaussian random field. The joint distribution of ^as! b and Zasb then can be written as
Recalling results from §3.7 the conditional distribution of ^as! b given Zasb is univariate
Gaussian with mean .as! b cw D" aZasb .asbb and variance 5 # cw D" c. The optimal
predictor under squared error loss is thus
:! aZà s! b Ec^ as! blZasbd .as! b cw D" aZasb .asbb. [9.36]
This important expression is worthy of some comments. First, the conditional mean is a linear
function of the observed data Zasb. Evaluating the statistical properties of the predictor such
as its mean and variance is thus simple:
E:! aZà s! b .as! b, Var:! aZà s! b cw D" c.
Second, the predictor is a perfect interpolator. Assume you wish to predict at all observed
locations. To this end, replace in [9.36] .as! b with .asb and cw with D. The predictor becomes
:! aZà sb .asb DD" aZasb .asbb Zasb.
Third, imagine that ^ as! b and Zasb are not correlated. Then c 0 and [9.36] reduces to
:! aZà s! b .as! b, the (unconditional) mean at the unsampled location. One interpretation of
the optimal predictor is to consider cw D" aZasb .asbb as the adjustment to the uncondition-
al mean that draws on the spatial autocorrelation between attributes at the unsampled and
sampled locations. If ^as! b is correlated with observations nearby, then using the information
from other locations strengthens our ability to make predictions about the value at the new
location (since the MSPE if c 0 is 5 # ). Fourth, the variance of the conditional distribution
equals the mean square prediction error.
Two important questions arise. Do the simple form and appealing properties of the best
predictor prevail if the random field is not Gaussian-distributed? If the means .asb and .as! b
are unknown are predictors of the form
sas! b cw D" aZasb .
. sasbb
still best in some sense? To answer these questions we now relate the decision-theoretic setup
in this subsection to the basic kriging methods.
• Kriging predictors are the best linear unbiased predictors under squared
error loss.
Requirement (i) states that the predictors have the general form
8
:aZà s! b "-as3 b^ as3 b, [9.37]
3"
where -as3 b is a weight associated with the observation at location s3 . Relative to other
weights, -as3 b determines how much the observation ^ as3 b contributes to the predicted value
at location s! . To satisfy requirements (ii) and (iii) the weights are chosen to minimize
#
Ô 8 ×
E ^ as! b "-as3 b^ as3 b
Õ 3" Ø
subject to certain constraints that guarantee unbiasedness. These constraints depend on the
model assumptions. The three basic kriging methods, simple, ordinary, and universal kriging,
are distinguished according to the mean structure of the spatial model
^ asb .asb $ asb.
Simple Kriging
The solution to this minimization problem if .asb (and thus .as! b) is known is called the
simple kriging predictor (Matheron 1971)
:WO aZà s! b .as! b cw D" aZasb .asbb. [9.38]
The details of the derivation can be found in Cressie (1993, p. 109 and our §A9.9.4). Note
that .as! b in [9.38] is a scalar and Zasb and .asb are vectors (see [9.35] on p. 608 for
definitions). The simple kriging predictor is unbiased since Ec:WO aZà s! bd .as! b
Ec^ as! bd and bears a striking resemblance to the conditional mean under Gaussianity ([9.36]).
Simple kriging is thus the optimal method of spatial prediction (under squared error loss) in a
Gaussian random field since :WO aZà s! b equals the conditional mean. No other predictor then
has a smaller mean square prediction error, not even when nonlinear functions of the data are
where 5 # is the variance of the random field at s! . We assume here that the random field is
second-order stationary so that Varc^ asbd Varc^ as! bd 5 # and the covariance function
exists (otherwise Gahb is a nonexisting parameter and [9.38], [9.39] should be expressed in
terms of the semivariogram).
Simple kriging is useful in that it determines the benchmark for other kriging methods.
The assumption that the mean is known everywhere is not tenable for most applications. An
exception is the kriging of residuals from a fit of the mean function. If the mean model is
correct the residuals will have a known, zero mean. How much is lost by estimating an un-
known mean can be inferred by comparing the simple kriging variance [9.39] with similar ex-
pressions for the methods that follow.
which minimizes the mean square prediction error subject to an unbiasedness constraint. This
constraint can be found by noticing that Ec:SO aZà s! bd Ec!83" -SO as3 b^ as3 bd
!83" -SO as3 b. which must equal . for :SO aZà s! b to be unbiased. As a consequence the
weights must sum to one. This does not imply, by the way, that kriging weights are positive.
If the mean of the random field is .asb xw asb" it is not sufficient to require that the
kriging weights sum to one. Instead we need
8 8
E"-Y O as3 b^as3 b "-Y O as3 bxw as3 b" xw as! b".
3" 3"
Using matrix/vector notation this constraint can be expressed more elegantly. Write the uni-
versal kriging model as Zasb Xasb" $ asb and the predictor as
where - is the vector of (universal) kriging weights. For :Y O aZà s! b to be unbiased we need
- w X x w as! b .
Minimization of
#
Ô 8 ×
E ^ as! b "-SO as3 b^ as3 b subject to -w 1 "
Õ 3" Ø
to derive the universal kriging weights is a constrained optimization problem. It can be solved
as an unconstrained minimization problem using one (OK) or more (UK) Lagrange multi-
pliers (see §A9.9.4 for details and derivations). The resulting predictors can be expressed in
numerous ways. We prefer
s KPW cw D" Zasb X"
:Y O aZà s! b xw as! b" s KPW , [9.40]
As a special case of [9.40], where x ", the ordinary kriging predictor is obtained:
s cw D" aZasb 1.
:SO aZà s! b . sb [9.42]
Here, .
s is the generalized least squares estimator of the mean,
" 1w D" Zasb
s 1w D" 1 1w D" Zasb
. . [9.43]
1w D" 1
Comparing [9.40] with [9.36], the optimal predictor in a Gaussian random field, we again
notice a striking resemblance. The question raised at the end of the previous subsection about
the effects of substituting an estimate . sas! b for the unknown mean can now be answered.
Substituting an estimate retains certain best properties of the linear predictor. It remains un-
biased provided the estimate for the mean is unbiased and the model for the mean is correct.
It remains an exact interpolator and the predictor has the form of an estimate of the mean
s Ñ adjusted by surrounding values with adjustments depending on the strength of the
Ðx w as! b "
spatial correlation (c, D). Kriging predictors are obtained only if the generalized least squares
estimates [9.41] (or [9.43] for OK) are being substituted, however.
In the formulations [9.40] and [9.42] it is not immediately obvious what the kriging
weights are. Some algebra leads to
Compare these expressions to the kriging error for simple kriging [9.39]. Since
xas! b Xw D" cw Xw D" X" xas! b Xw D" c
where "s KPW is the generalized least squares estimator " s KPW aXw D" Xb" Xw D" Zasb. The
covariance matrix D is usually constructed from a model of the semivariogram utilizing the
simple relationship between covariances and semivariances in second-order stationary ran-
dom fields. The modeling of the semivariogram requires, however, that the random field is
mean stationary, i.e., the absence of large-scale structure. If Ec^asbd xw asb" the large-scale
trend must be removed before the semivariogram can be modeled. Failure to do so can se-
verely distort the semivariogram. Figure 9.22 shows empirical semivariograms (dots) calcu-
lated from two sets of deterministic data. The left panel is the semivariogram of ^ aBb
" !Þ&B where B is a point on a transect. The right-hand panel is the semivariogram of
^ aBb " !Þ##B !Þ!##B# !Þ!!"$B$ . The power model fits the empirical semivario-
gram in the left panel very well and the gaussian model provides a decent fit to the semivario-
gram of the cubic polynomial. The shapes of the semivariograms are due to trend only,
however. There is nothing stochastic about the data. One must not conclude based on these
graphs that the process on the left is intrinsically stationary and that the process on the right is
second-order stationary.
30
3
Semivariogram
20
10
1
0 0
0 5 10 15 0 5 10 15
Lag h Lag h
Certain types of mean nonstationarity (drift) can be inferred from the semivariogram
(Neuman and Jacobson 1984, Russo and Jury 1987b). A linear drift, for example, causes the
semivariogram to increase as in the left-hand panel of Figure 9.22. Using the semivariogram
as a diagnostic procedure for detecting trends in the mean function is dangerous as the drift-
contaminated semivariogram may suggest a valid theoretical semivariogram model. Kriging a
New residuals are obtained and the estimate of the semivariogram is updated. The only down-
side of this approach is that the residuals obtained from detrending the data do not exactly be-
have as the random component $asb of the model does. Assume that the model is initially de-
trended by ordinary least squares,
s SPW aXw Xb" Xw Zasb,
"
and the fitted residuals s$ SPW Zasb X" s SPW are formed. The error process $ asb of the
model has mean 0, variance-covariance matrix D, and semivariogram #$ ahb
½Varc$ asb $ as hbd. The vector of fitted residuals also has mean 0, provided the model for
the large-scale structure was correct, but it does not have the proper semivariogram or
variance-covariance matrix. Since s $ asb aI HbZasb where H is the hat matrix H
w " w
XaX Xb X , it is established that
Vars
$ asb aI HbDaI Hb Á D.
The fitted residuals exhibit more negative correlations than the error process and the estimate
of the semivariogram based on the residuals will be biased. Furthermore, the residuals do not
have the same variance as the $ asb process. It should be noted that if D were known and the
semivariogram were estimated based on GLS residuals s $ SPW Zasb X" s KPW , the semi-
variogram estimator would still be biased (see Cressie 1993, p. 166). The bias comes about
because residuals satisfy constraints that are not properties of the error process. For example,
the fitted OLS residuals will sum to zero. The degree to which a semivariogram estimate
derived from fitted residuals is biased depends on the method used for detrending as well as
the method of semivariogram estimation. Since the bias is typically more substantial at large
where :WO Ðs $ à s! Ñ is the simple kriging predictor of the residual at location s! which uses the
variance-covariance matrix D s formed from the semivariogram fitted to the residuals. Hence,
:WO Ðs$ à s! Ñ cw Ds"s $ asb, making use of the fact that EÒs
$ asbÓ 0. Comparing the naïve pre-
dictor [9.47] with the universal kriging predictor [9.46] it is clear that the naïve approach is
not equivalent to universal kriging. D is not known and the trend is not estimated by GLS.
The predictor [9.47] does have some nice properties, however. It is an unbiased predictor in
the sense that Ec µ : aZà s! bd ^ as! b and it remains a perfect interpolator. In fact all methods
that are perfect interpolators of the residuals are perfect interpolators of ^asb even if the trend
model is incorrectly specified. Furthermore, this approach does not require iterations. The
naïve predictor is not a best linear unbiased predictor, however, and the mean square predic-
tion error should not be calculated by the usual formulas for kriging variances (e.g., [9.45]).
The naïve approach can be generalized in the sense that one might use any suitable method
for trend removal to obtain residuals, krige those, and add the kriged residuals to the esti-
mated trend. This is the rationale behind median polish kriging recommended by Cressie
(1986) for random fields with drift to avoid the operational difficulties of universal kriging.
It must be noted that these difficulties do not arise in maximum likelihood (ML) or re-
stricted maximum likelihood (REML) estimation. In §9.2.4 ML and REML for estimating the
covariance parameters of a second-order stationary spatial process were discussed. Now con-
sider the spatial model
Zasb Xasb" $ asb,
where $asb is a second-order stationary random field with mean 0 and variance-covariance
matrix Da)b. Under Gaussianity, K µ aXasb" ,Da)bb and estimates of the mean parameters "
and the spatial dependency parameters ) can be obtained simultaneously by maximizing the
likelihood or restricted likelihood of Zasb. In practice one must choose a parametric covari-
ance model for Da)b, which seems to open the same Pandora's box as in the cat and mouse
game of universal kriging. The operational difficulties are minor in the case of ML or REML
estimation, however. Initially, one should estimate " by ordinary least squares and calculate
the empirical semivariogram of the residuals. From a graph of the empirical semivariogram
possible parametric models for the semivariogram (covariogram) can be determined. In con-
trast to the least squares based methods discussed above one does not estimate ) based on the
s cw D" aZasb 1.
:SO aZà s! b . sb
must be calculated for each location at which predictions are desired. Although only the
vector c of covariances between ^as! b and Zasb must be recalculated every time s! changes,
even for moderately sized spatial data sets the inversion (and storage) of the matrix D is a
formidable problem. A solution to this problem is to consider for prediction of ^as! b only ob-
served data points within a neighborhood of s! , called the kriging neighborhood. As s!
changes this is akin to sliding a window across the domain and to exclude all points outside
the window in calculating the kriging predictor. If 8as! b #& points are in the neighborhood
at s! , then only a a#& #&b matrix must be inverted. Using a kriging neighborhood rather
than all the data is sometimes referred to as local kriging. It has its advantages and disadvan-
tages.
Among the advantages of local kriging is not only computational efficiency; it might also
be reasonable to assume that the mean is at least locally stationary, even if the mean is glo-
bally nonstationary. This justification is akin to the reasoning behind using local linear regres-
sions in the nonparametric estimation of complicated trends (see §4.7). Ordinary kriging
performed locally is another approach to avoid the operational difficulties with universal krig-
ing performed globally. Local kriging essentially assigns kriging weight -as3 b ! to all
points s3 outside the kriging neighborhood. Since the best linear unbiased predictor is ob-
tained by allowing all data points to contribute to the prediction of ^as! b, local kriging pre-
dictors are no longer best. In addition, the user needs to decide on the size and shape of the
kriging neighborhood. This is no trivial task. The optimal kriging neighborhood depends in a
complex fashion on the parameters of the semivariogram, the large-scale trend, and the spatial
configuration of the sampling points.
At first glance it may seem reasonable to define the kriging neighborhood as a circle
around s! with radius equal to the range of the semivariogram. This is not a good solution, be-
cause although points further removed from s! than the range are not spatially autocorrelated
with ^as! b, they are autocorrelated with points that lie within the range from s! . Chilès and
Delfiner (1999, p. 205) refer to this as the relay effect. A practical solution in our opinion is
to select the radius of the kriging neighborhood as the lag distance up to which the empirical
semivariogram was modeled. One half of the maximum lag distance in the data is a frequent
recommendation. The shape of the kriging neighborhood also deserves consideration. Rules
that define neighborhoods as the 8 closest points will lead to elongated shapes if the sam-
pling intensity along a transect is higher than perpendicular to the transect. We prefer circular
kriging neighborhoods in general that can be suitably expanded based on some criteria about
for example, use a circular kriging neighborhood with a "!-unit radius. If the number of
points in this radius is less than "& the radius is suitably increased to honor the minpoints=
option. If the neighborhood with radius "! contains more than $& observation the radius is
similarly decreased. If the neighborhood is defined as the nearest 8 observation, the predict
statement is replaced by (for 8 #!Ñ predict var=Z radius=10 numpoints=20;.
Positivity Constraints
In ordinary kriging, the only constraint placed on the kriging weights -as3 b is
8
"-as3 b ",
3"
which guarantees unbiasedness. This does not rule out that individual kriging weights may be
negative. For attributes that take on positive values only (yields, weights, probabilities, etc.) a
potential problem lurks here since the prediction of a negative quantity is not meaningful. But
just because some kriging weights are negative does not imply that the resulting predictor
8
"- a s 3 b ^ a s 3 b
3"
is negative and in many applications the predicted values will honor the positivity require-
ment. To exclude the possibility of negative predicted values additional constraints can be im-
posed on the kriging weights. For example, rather than minimizing
#
Ô 8 ×
E ^ as! b "-as3 b^ as3 b
Õ 3" Ø
subject to !83" -as3 b ", one can minimize the mean square error subject to !83" -as3 b "
and -as3 b ! for all s3 . Barnes and Johnson (1984) solve this minimization problem through
quadratic programming and find that a solution can always be obtained in the case of an un-
known but constant mean, thereby providing an extension of ordinary kriging. The positivity
and the sum-to-one constraint together also ensure that predicted values lie between the
smallest and largest value at observed locations. In our opinion this is actually a drawback.
Unless there is a compelling reason to the contrary, one should allow the predictions to ex-
tend outside the range of observed values. A case where it is meaningful to restrict the range
of the predicted values is indicator kriging (§A9.9.6) where the attribute being predicted is a
binary a!ß "b variable. Since the mean of a binary random variable is a probability, predictions
outside of the a!ß "b interval are difficult to justify. Cressie (1993, p. 143) calls the extra con-
straint of positive kriging weights “heavy-handed.”
• If a spatial data set consists of more than one attribute and stochastic rela-
tionships exist among them, these relationships can be exploited to improve
predictive ability.
• Commonly one attribute, ^" asb, say, is designated the primary attribute
and ^# asbß âß ^5 asb are termed the secondary attributes.
The spatial prediction methods discussed thus far predict a single attribute ^asb at unobserved
locations s! . In most applications, data collection is not restricted to a single attribute. Other
variables are collected at the same or different spatial locations or the same attribute is
observed at different time points. Consider the case of two spatially varying attributes ^" and
^# for the time being. To be general it is not required that ^" and ^# are observed at the same
locations although this will often be the case. The vectors of observations on ^" and ^# are
denoted
Z" as" b c^" as"" bß âß ^" as"8" bd
Z# as# b c^# as#" bß âß ^" as#8# bd.
If s"4 s#4 , then ^" and ^# are said to be colocated, otherwise the attributes are termed non-
colocated. Figure 9.23 shows the sampling locations at which soil samples were obtained in a
chisel-plowed field and the relationship between soil carbon and total soil nitrogen at the
sampled locations. Figure 9.24 shows the relationship between total organic carbon percen-
tage and sand percentage in sediment samples from the Chesapeake Bay collected through the
Environmental Monitoring and Assessment Program (EMAP) of the U.S.-EPA. In both cases
300 2.0
250
200 1.5
Carbon
150
y
1.0
100
50
0.5
0
Figure 9.23. Sample locations in a field where carbon and nitrogen were measured in soil
samples (left panel). Relationship between soil carbon and total soil nitrogen (right panel).
Data kindly provided by Dr. Thomas G. Mueller, Department of Agronomy, University of
Kentucky. Used with permission.
4
Total Organic Carbon %
10 30 50 70 90
Sand %
Figure 9.24. Total organic carbon percentage as a function of sand percentage collected at %(
base stations in Chesapeake Bay in 1993 through the US Environmental Protection Agency's
Environmental Monitoring and Assessment Program (EMAP).
The attributes are usually not symmetric in that one attribute is designated the primary
variable of interest and the other attributes are secondary or auxiliary variables. Without loss
of generality we designate ^" asb as the primary attribute and ^# asbß âß ^5 asb as the second-
Ordinary Cokriging
The goal of cokriging is to find a best linear unbiased predictor of ^" as! b, the primary attri-
bute at a new location s! , based on Z" asb and Z# asb. Extending the notation from ordinary
kriging the predictor can be written as
8" 8#
:" aZ" ß Z# à s! b "-"3 ^" as3 b "-#3 ^# as4 b -"w Z" asb -#w Z# asb. [9.48]
3" 4"
It is not required that ^" asb and ^# asb are colocated. Certain assumptions must be made, how-
ever. It is assumed that the means of ^" asb and ^# asb are constant across the domain H and
that ^" asb has covariogram G" ahb and ^# asb has covariogram G# ahb. Furthermore, there is a
cross-covariance function that expresses the spatial dependency between ^" asb and
^# as hb,
G"# ahb Covc^" asbß ^# as hbd. [9.49]
The unbiasedness requirement implies that Ec:" aZ" ß Z# à s! bd Ec^" as! bd ." which implies
in turn that 1w -" " and 1w -# !. Minimizing the mean square prediction error
Ee^" as! b :" aZ" ß Z# à s! bf#
where D"" VarcZ" asbd, D## VarcZ# asbd, D"# CovcZ" asbß Z# asbd, 7" and 7# are La-
grange multipliers and c"! CovcZ" asbß ^" as! bd, c#! CovcZ# asb,^" as! bd. These equations
are solved for -" , -# , 7" and 7# and the minimized mean square prediction error, the co-
kriging variance, is calculated as
#
5GO as! b Varc^" as! bd -"w c"! -#w c#! 7" .
Cokriging utilizes two types of correlations. Spatial autocorrelation due to spatial proximity
aD"" and D## b and correlation among the attributes aD"# b. To get a better understanding of
the cokriging system of equations we first consider the special case where the secondary
variable is not correlated with the primary variable. Then D"# 0 and c#! 0 and the co-
kriging equations reduce to
ó D"" -" 17" c"! 1 w 7" "
ô D## -# 17# c#! ´ 0 1 w 7# ! .
Equations ó are the ordinary kriging equations (see §A9.9.4) and -" will be identical to the
ordinary kriging weights. From ô one obtains -# D" w " w "
## ac#! 1 1 D## c#! Îa1 D## 1bb 0.
The cokriging predictor reduces to the ordinary kriging predictor,
:" aZ" ß Z# à s! b :SO aZ" à s! b -"SO Z" asb,
# #
and 5GO as! b reduces to 5SO as! b. There is no benefit in using a secondary attribute unless it is
correlated with the primary attribute. There is also no harm in doing so.
Now consider the special case where ^3 asb is the observation of ^ asb at time >3 so that
the variance-covariance matrix D"" describes spatial dependencies at time >" and D## spatial
dependencies at time ># . Then D"# contains the covariances of the single attribute across
space and time and the kriging system can be used to produce maps of the attribute at future
points in time.
Extensions of ordinary cokriging to universal cokriging are relatively straightforward in
principle. Chilès and Delfiner (1999, Ch. 5.4) consider several cases. If the mean functions
EcZ" asbd X" "" and EcZ# asbd X# "# are unrelated, that is, each variable has a mean func-
tion on its own and the coefficients are not related, the cokriging system [9.50] is extended to
Universal cokriging has not received much application since it is hard to imagine a situation
in which the mean functions of the attributes are unrelated with D"# not being a zero matrix.
Chilès and Delfiner (1999, p. 301) argue that cokriging is typically performed as ordinary
cokriging for this reason.
The error term of this model is a (second-order) stationary spatial process $" asb with mean !
and covariance function G" ahb (semivariogram #" ahb). The complete model can then be
written as
^" asb xw asb" $" asb
Ec$" asbd ! [9.53]
Covc$" asbß $" as hbd G" ahb.
This model resembles the universal kriging model but while there xasb is a polynomial
response surface in the spatial coordinates, here xasb is a function of secondary attributes ob-
served at locations s. It is for this reason that the multiple spatial regression model is particu-
larly meaningful in our opinion. In many applications colorful maps of an attribute are
52 30 15 2 26 35 36 6 4 47 32 48 18 10 50 39 17 56 33 46 27 23
40
30 7 43 48 9 14 49 35 12 40 41 49 44 9 25 42 11 1 7 34 37
4 52 46 13 39 26 20 53 5 10 41 16 27 15 22 6 25 32 55 47 33 2
30 42 8 18 3 19 11 31 36 28 17 56 40 54 24 51 29 37 45 38 1 44 21
Latitude
49 8 41 46 7 11 51 18 54 15 20 23 25 21 2 52 53 40 34 23 50
36 47 26 30 35 31 10 27 33 13 19 38 55 48 29 34 12 6 44 37 22 24
20
52 53 54 55 56 9 16 4 32 42 45 14 17 39 43 50 56 5 1 28 3
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
10
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
1 2 3 4 5 6 7
0
0 5 10 15 20 25
Longitude
Figure 9.25. Layout of wheat variety trial at Alliance, Nebraska. Lines show block
boundaries, numbers identify the placement of varieties within blocks. There are four blocks
and &' varieties. Drawn from data in Littell et al. (1996).
In the notation of §9.4 we are concerned with the spatial mean model
^ asb .asb $ asb,
We maintain the dependency of the design/regressor matrix Xasb on the spatial location since
Xasb may contain, apart from design (e.g., block) and treatment effects, other variables that
depend on the spatial location of the experimental units, or the coordinates of observations
themselves although that is not necessarily so. Zimmerman and Harville (1991) refer to [9.54]
as a random field linear model. Since the spatial autocorrelation structure of $asb is modeled
through a semivariogram or covariogram we take a direct approach to modeling spatial
dependence rather than an autoregressive approach (in the vernacular of §9.3). This can be
rectified with the earlier observation that data from field experiments are typically lattice data
where autoregressive methods are more appropriate by considering each observation as
concentrated at the centroid of the experimental unit (see Ripley 1981, p. 94, for a contrasting
view that utilizes block averages instead of point observations).
The model for the semivariogram/covariogram is critically important for the quality of
spatial predictions in kriging methods. In spatial random field models, where the mean func-
tion is of primary importance, it turns out that it is important to do a reasonable job at mod-
eling the second order structure of $asb, but as Zimmerman and Harville (1991) note, treat-
ment comparisons are relatively insensitive to the choice of covariance functions (provided
the set of functions considered is a reasonable one and that the mean function is properly
where the experimental errors /34 are uncorrelated random variables with mean ! and variance
5 # . A trend analysis changes this model to
]34 . 73 )56 /34 , [9.55]
where )56 is a polynomial in the row and column indices of the experimental units (Brownie,
Bowman, and Burton 1993). If <5 is the 5 th row and -6 the 6th column of the field layout, then
one may choose )56 "" <5 "# -6 "$ <5# "% -5# "& <5 -6 , for example, a second-order re-
sponse surface in the row and column indices. The difference to a random field linear model
is that the deterministic term )56 is assumed to account for the spatial dependency between
experimental units. It is a fixed effect. It does, however, appeal to the notion of a smooth-
scale variation in the sense that the spatial trends move smoothly across block boundaries.
The block effects have disappeared from model [9.55]. Applications of these trend analysis
models can be found in Federer and Schlottfeldt (1954), Kirk, Haynes, and Monroe (1980),
and Bowman (1990). Because it is assumed that the error terms /34 remain uncorrelated they
are not spatial random field models in our sense and will not be discussed further. For a com-
parison of trend and random field analyses see Brownie et al. (1993).
A second type of model that maintains independence of the errors are the nearest-neigh-
bor models which are based on differencing observations with each other or by taking dif-
ferences between plot yields and cultivar averages. The Papadakis nearest-neighbor analysis
(Papadakis 1937), for example, calculates residuals between plot yields and arithmetic treat-
ment averages in the East-West and North-South direction and uses these residuals as co-
variances in the mean model (the )56 part of the trend analysis model). The Schwarzbach
analysis relies on adjusted cultivar means which are arithmetic means corrected for average
responses in neighboring plots (Schwarzbach 1984).
In practical applications it may be difficult to choose between these various approaches
to model spatial dependencies and to discriminate between different models. For example,
changing the fixed effects trend by including or eliminating terms in a trend analysis will
change the autocorrelation of the model residuals. Brownie and Gumpertz (1997) conclude
that it is necessary to account for major spatial trends as fixed effects in the model but also
that random field analyses are surprisingly robust to moderate misspecification of the fixed
trend and retain a high degree of validity of tests and estimates of precision. The reason, in
our opinion, is that a model which simultaneously models large- and small-scale stochastic
trends is able within limits to capture omitted trends in the mean model through the spa-
The three approaches differ in what is considered the correct model for analysis and how
it is used. In (1) the correct model stems from the error-control, treatment, and observational
design components. Treatment comparisons will always be unbiased under this approach, but
can be inefficient if the design was not chosen carefully (as is the case in the Alliance-
Nebraska case). In (2) the analyst is charged to develop a suitable model. Statistical inference
and the parameters of the model to be estimated are 9 c" ß )dw . ) relates to the spatial
dependency structure and " to the large-scale trend. As models for Da)b we usually consider
covariograms that are derived from the isotropic semivariogram models in §9.2.2, keeping the
number of parameters in ) small. Because we work with covariances, it is assumed that the
process is second-order stationary so that its covariogram is well-defined. Two general
approaches to parameter estimation can be distinguished. Likelihood and likelihood-type
methods which estimate ) and " simultaneously and least squares methods that estimate "
given an externally obtained estimate of the spatial dependency.
Since Da)b is usually unknown we are faced with a similar quandary as in universal kriging.
These steps can (and should) be iterated, replacing SPW residuals in step 2. with KPW
residuals after the first iteration. The final estimates of the mean parameters are estimated
generalized least square estimates
"
s IKPW Xw DÐs
" )Ñ" X Xw DÐs
)Ñ" Zasb. [9.57]
The same issues as in §9.4.4 must be raised here. The residuals lead to a biased estimate
of the semivariogram of $asb and " s SPW is an inefficient estimator of the large-scale trend
parameters. Since the emphasis in spatial random field linear models is often not on predict-
ing but on estimation and hypothesis testing about " these issues are not quite as critical as in
the case of universal kriging. If the results of a random field linear model analysis are used to
predict ^as! b as a function of covariates and the spatial autocorrelation structure, the issues
regain importance.
Likelihood Methods
Likelihood methods circumvent these problems because the mean and covariance parameters
are estimated simultaneously. On the other hand they require distributional assumptions about
^asb or $ asb. If $ asb is a Gaussian random field, then twice the negative log-likelihood of
Zasb is
The profiled (negative) log likelihood is obtained by substituting this expression back into
°a"ß )à zasbb which is then only a function of ) and is minimized with respect to ) . The
resulting estimate s
)Q is the maximum likelihood estimate of ) and the MLE of " is
"
s Q Xw DÐs
" )Q Ñ" X Xw DÐs
)Q Ñ" Zasb. [9.58]
Software Implementation
The three methods, GLS, ML, and REML, lead to very similar formulas for the " estimates.
The mixed procedure in The SAS® System can be used to obtain any one of the three. The
spatial covariance structure of $asb is specified through the repeated statement of the
procedure. In contrast to clustered data models in §7, all data points are potentially auto-
correlated which calls for the subject=intercept option of the repeated statement.
Assume that an analysis of OLS residuals leads to an exponential semivariogram with
practical range %Þ&, partial sill "!Þ&, and nugget #Þ!. The spatial coordinates of the data points
are stored in variables xloc and yloc of the SAS data set. The mean model consists of treat-
ment effects and a linear response surface in the coordinates. The following statements obtain
the EGLS estimates [9.57], preventing proc mixed from iteratively updating the covariance
parameters (noiter option of parms statement). The noprofile option prevents the profiling
of an extra scale parameter from Da)b. The Table of Covariance Parameter Estimates will
contain three rows entitled Variance, SP(EXP), and Residual. These correspond to the partial
sill, the range, and the nugget effect, respectively. Notice that the parameterization of the
exponential covariogram in proc mixed considers the range parameter to be one third of the
practical range.
/* ----------------------------------------------------- */
/* Fit the model by EGLS for fixed covariogram estimates */
/* ----------------------------------------------------- */
proc mixed data=RFLMExample noprofile ;
class treatment;
model Z = treatment xloc yloc xloc*yloc / s;
parms /* sill */ ( 10.5 )
/* range */ ( 1.5 )
/* nugget */ ( 2.0 ) / noiter;
/* The local option of the repeated statement adds the */
/* nugget effect */
repeated /subject=intercept local type=sp(exp)(xloc yloc);
run; quit;
Restricted maximum likelihood estimates are obtained in proc mixed with the statements
proc mixed data=RFLMExample noprofile ;
class treatment;
model Z = treatment xloc yloc xloc*yloc / s;
parms /* sill */ ( 6 to 12 by 2 )
/* range */ ( 0.5 to 3 by 1.5 )
/* nugget */ ( 1 to 4 by 1.0 );
repeated /subject=intercept local type=sp(exp)(xloc yloc);
run; quit;
• Models for spatial lattice data are close relatives of time series models.
4 4 4
3 3 3
2 2 2
1 1 1
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Figure 9.27. Counties of North Carolina and possible neighborhood definitions for two coun-
ties: Counties with adjoining borders (left part of map) or counties within $! miles of the
county seat (center part of map). County seats are shown as dots.
A binary weighing scheme is reasonable when sites are spaced regularly by uniform distances
(Figure 9.26). If sites are arranged irregularly or represent area units of different size and
shape (Figure 9.27), weights should be chosen more carefully. Some possibilities (Haining
1990) are
• A34 lls3 s4 ll# ß # !
• A34 expells3 s4 ll# f
• A34 a634 Î63 b# where 634 is the length of the common border between areas 3 and 4 and 63
is the perimeter of the border of area 3
• A34 a634 Î63 b7 Îlls3 s4 ll# .
The weights are then collected in an a8 8b weight matrix W cA34 d and the statistical
model is parameterized to incorporate the large-scale mean structure as well as the interactive
spatial correlation structure. The two approaches that are common lead to the simultaneous
and the conditional spatial autoregressive models. These combine the user's choice of neigh-
bors with an appropriate model for the marginal or conditional distribution of the data that is
consistent with the neighborhood structure.
The choice of the neighborhood structure is largely subjective. The fact that two sites
have nonzero connectivity weights does not imply a causal relationship between the respons-
es at the sites. It is a representation of the local variation due to extraneous conditions. Besag
(1975) calls this “"third-party" dependence.” Imagine a locally varying regressor variable on
which ^asb depends has not been observed. The localized neighborhood structure supplants
the missing information by defining groups of sites which would have been affected similarly
by the unobserved variable because they are in spatial proximity of each other.
In contrast to a spatial regression model, where secondary attributes are used as regressors
and their values are considered fixed, the SSAR model regresses ^as3 b onto neighboring
It follows that EcZasbd Xasb" and if the /as3 b are homoscedastic with variance 5=# that
VarcZasbd 5=# aI 3= Wb" aI 3= Ww b" .
Instead of choosing a semivariogram or covariogram model for the smooth-scale varia-
tion as in a spatial regression model this spatial autoregressive model for lattice data requires
estimation of only one parameter associated with the spatial autocorrelations, 3= . The struc-
ture and degree of the autocorrelation is determined jointly by the structure of W and the
magnitude of 3= .
The second class of autoregressive models for lattice data, the conditional spatial autore-
gressive (CSAR) models, commence with the conditional mean and variance of a site's re-
sponse given the observed values at all other sites. Denote by zasb3 the vector of observed
values at all sites except the 3th one. A CSAR model is then defined through
8
Ec^ as3 blzasb3 d .as3 b 3- "A34 azas 4 b .as4 bbß Varc^ as3 blzasb3 d 53# . [9.61]
4"
For spatial data the conditional and simultaneous formulations lead to different models.
Assume that the conditional variances are identical, 53# ´ 5-# . The marginal variance in the
CSAR model is then given by VarcZasbd 5-# aI 3- Wb" which is to be compared against
VarcZasbd 5=# aI 3= Wb" aI 3= Ww b" in the simultaneous scheme. Even if 5-# 5=# and
3- 3= , the variance-covariance matrices will differ. Furthermore, since the variance-cova-
riance matrix of Zasb is symmetric it is necessary in the CSAR model with constant condi-
tional variance that the weight matrix W be symmetric. The SSAR model imposes no such
restriction on the weights. If a lattice consists of irregularly shaped area units asymmetric
weights are often reasonable. Consider a study of urban sprawl with an irregular lattice of
counties where a large county containing a metropolitan area is surrounded by smaller rural
counties. It is reasonable to assume that what happens in a small county is very much deter-
mined by developments in the metropolitan area while the development of the major city will
be much less influenced by a rural county. In a regular lattice asymmetric dependency param-
eters may also be possible. A site located on the edge of the lattice will depend on an interior
site differently from how an interior site depends on an edge site. In these cases an asym-
metric neighborhood structure is called for which rules out the CSAR model unless the condi-
tional variances are adjusted.
SSAR models have disadvantages, too. The model disturbances /as3 b and the responses
^as4 b are not uncorrelated in these models which is in contrast to autoregressive time series
models. This causes ordinary least squares estimators to be inconsistent. In matrix/vector
notation the CSAR model can be written as
Zasb .asb 3- WaZasb .asbb / asb,
For square lattices 3 is restricted to !Þ#& 3 !Þ#& as the size of the lattice increases.
The range of the interaction parameter can be affected by standardization. If rows of W are
standardized to sum to one then l3l " (Haining 1990, p. 82). For regular lattices and the
rook neighborhood definition without row standardization the permissible ranges for 3 are
shown in the next table.
Models [9.60] and [9.61] are termed first-order models since they involve only one set of
neighborhood weights and a single interaction parameter. To make Zasb a function of two
interaction parameters that measure different distance effects, the SSAR model can be modi-
fied to a second order model as
Zasb Xasb" a3" W" 3# W# baZasb Xasb" b easb.
For example, W" can be a rook neighborhood structure and W# a bishop neighborhood
structure. The CSAR model can be similarly extended to a higher order scheme (see Whittle
1954, Besag 1974, Haining 1990).
and one could estimate the mean parameters by least squares, minimizing
SSAR: 5=# aZasb Xasb" bw aI 3= Ww baI 3= WbaZasb Xasb" b
CSAR: 5-# aZasb Xasb" bw aI 3- WbaZasb Xasb" b.
Unfortunately, because the errors /as3 b and data ^as4 b in the SSAR model are not uncorrelat-
ed, the least squares estimates in the simultaneous scheme are not consistent (Whittle 1954,
Mead 1967, Ord 1975). Ord (1975) devised a modified least squares procedure when
Ec^ asbd ! that yields consistent estimates but comments on its low efficiency. The CSAR
model does not suffer from this shortcoming and least squares estimation is possible there.
When the spatial autoregressive model contains reactive effects a" b in addition to an autore-
gressive structure it is desirable to obtain estimates of the large-scale mean structure and the
interaction parameters simultaneously. The maximum likelihood method seems to be the
method of choice. Unless the distribution of Zasb is Gaussian, maximum likelihood estima-
tion is numerically cumbersome, however. Ord (1975) adapted an iterative procedure de-
veloped by Cochrane and Orcutt (1949) for estimation in simultaneous time series models to
obtain maximum likelihood estimates in the SSAR model. Haining (1990, p. 128) notes that
no proof exists that this adapted algorithm converges to a local minimum in the spatial case.
If the Gaussian assumption is reasonable we prefer maximum likelihood estimation of the pa-
rameters in a simultaneous scheme with a profiling algorithm as outlined in §A9.9.7.
is a biased estimator of 5 # .
Misspecification of the geographic weight matrix tends to suppress statistical efficiency.
If "s E are the estimates obtained under misspecification (using A instead of V), then
s E Ó VarÒcw "
VarÒcw " s Z Ó. Moderate to strong positive autocorrelation that is ignored nearly de-
stroys the efficiency of the ordinary least squares estimator.
Griffith (1996) recommends the following:
1. It is better to posit some reasonable geographic weight matrix than to assume indepen-
dence.
2. A relatively large number of area units should be employed, at least '!.
3. Lower-order spatial models should be given preference over those of higher-order.
4. It is better to employ a somewhat underspecified than a somewhat overspecified geo-
graphic weight matrix as long as W Á 0Þ
The upshot is that first-order neighbor connectivity definitions are usually sufficient on
regular lattices and little is gained by extending to second-order definitions. On irregular
lattices complex neighborhood definitions are often not supported and large data sets are nec-
essary to distinguish between competing specifications of the W matrix. Less is more.
9.7.1. Introduction
The preceding discussion considered the random field e^ asb À s H § # f where H was a
fixed set. H was discrete in the case of lattice data and continuous in the case of geostatistical
data. A spatial point pattern (SPP) is a random field where H is a random set of locations at
which certain events of interest occurred. Unless ^asb is itself a random variable — a situa-
tion we will exclude from the discussion here (see Cressie 1993, Ch. 8.7 on marked point
processes) — the focal point of statistical inquiry is the random set H itself. What kind of
statistical questions may be associated with studying the set of locations H ? For example, one
Figure 9.28. Location of &"% maple trees in Lansing Woods, Clinton County, Michigan. Data
described by Gerrard (1969), appear in Diggle (1983), and are included in S+SpatialStats® .
The events (location of trees) graphed in Figure 9.28 represent a mapped point pattern where
all events within the study region have been located. A sampled point pattern on the other
hand is one where a finite number of sample points is selected. At each point one collects
either an area sample by counting the number of events in a sampling area around the point or
a distance sample by recording the distance between the sample points and nearby events.
This chapter is concerned only with the analysis of mapped patterns. Diggle (1983) is an
excellent reference for the analysis of sampled patterns.
A data set containing the results of mapping a spatial point pattern is deceivingly simple. It
may contain only the longitude and latitude of the recorded events. Answering such simple
questions as are the points distributed at random requires tools, however, that are quite dif-
ferent from what the reader has been exposed to so far. For example, little rigorous theory is
available to derive the distribution of even simple test statistics and testing hypotheses in spa-
tial point patterns relies heavily on Monte Carlo (computer simulation) methods. It is our
opinion that point pattern data is collected quite frequently in agronomic studies but rarely
recognized and analyzed as such. This chapter is a brief introduction into spatial point pattern
analysis. The interested reader is encouraged to further the limited discussion we provide with
resources such as Ripley (1981), Diggle (1983), Ripley (1988), and our §A9.9.10 to A9.9.13.
In [9.62] .s is an infinitesimal region (a small disk centered at location s) and l. sl is its area
(volume). As the radius of the disk shrinks toward zero the expected number of events in this
area goes to zero but so does the area l. sl. The function -asb obtained in the limit is the first-
order intensity function of the spatial point process. Once -asb is known, the expected num-
ber of events in a region E can be determined by integrating the first-order intensity,
Uniformity of events, one of the conditions of complete spatial randomness, implies that
-asb -. The average number of events per unit area a-asbb does not depend on spatial loca-
tion, it is the same everywhere. A point process with this property is termed homogeneous or
first-order stationary. The expected number of events in E is then simply -lEl, the (constant)
expected number of events per unit area times the area. It is now seen that the assumption of a
constant first-order intensity is the SPP equivalent to the assumption for geostatistical and lat-
tice processes that the mean Ec^asbd is constant. For the latter data types constancy of the
mean does not imply the absence of spatial autocorrelations. By the same token spatial point
processes where -asb is independent of location are not necessarily CSR processes.
The first-order intensity conveys no information about the possible interaction of events
just as the means of two random variables tell us nothing about their covariance. The CSR
hypothesis requires that beyond uniformity the number of events in disjoint regions are inde-
pendent, CovcR aEbß R aF bd ! if E F gÞ The spatial point process that embodies CSR
is the homogeneous Poisson process (HPP). Testing the CSR hypothesis is equivalent to
A deviation from CSR implies that points are either not independent or not uniformly
distributed. A point pattern is called aggregated or clustered if events separated by short
distances occur more frequently than is expected under CSR, and regular if they occur less
frequently than in a homogeneous Poisson process (Figure 9.29).
A: CSR B: SSI
1.0
1.0
0.8
0.8
0.6
0.6
y
y
0.4
0.4
0.2
0.2
0.0
0.0
C: CLU
1.0
0.8
0.6
y
0.4
0.2
0.0
Figure 9.29. Realizations of a CSR (A), regular (B), and clustered (C) process. Each pattern
has "!! events on the unit square. SSI is the simple sequential inhibition process (Diggle et
al. 1976, Diggle 1983) which does not permit events within a minimum distance of other
events (see §A9.9.11).
Aggregated patterns are common and several theories have been developed to explain the
formation of clusters in biological applications. One explanation for clustering is through a
contagious process where the presence of one or more organisms increases the probability of
other organisms occurring in the same sample. In an aggregated process, contagiousness is
positive, resulting in an excess of events at small distances and fewer events separated by
large distances compared to a CSR process. Aggregation has also been explained in terms of
has an asymptotic ;# distribution with 7 " degrees of freedom. Significantly small values
of \ # indicate regularity and significantly large values of \ # indicate aggregation. For the ;#
approximation to hold, counts in each quadrat should exceed % a &b in )!% of the cells and
should be greater than " everywhere. This rule can be used to find a reasonable grid size to
partition E.
The quadrat count statistic is closely related to the index of dispersion which is a ratio of
two variance estimates obtained with and without making any distributional assumptions.
Under CSR the number of events in any one of the 7 subregions is a Poisson random
variable whose mean and variance are estimated by 8. Regardless of the spatial point process
that generated the observed data, W # a7 "b" !7 3" a83 8b
#
estimates the variance of
#
the quadrat counts. The ratio M W Î8 is called the index of dispersion and is related to
[9.64] through
" # "
M W \#. [9.65]
8 7"
If the process is clustered the quadrat counts will vary more than what is expected under
CSR, and M will be large. If the process is regular the counts will vary less since all 83 will be
similar and similar to the mean count 8. The index of dispersion will be small. For a CSR
process the index will be about one on average.
To test the CSR hypothesis with quadrat counts for the point patterns shown in Figure
9.29, 5 % bin classes were used to partition the unit square into % % "' 7 quadrats.
The resulting quadrat counts follow.
Notice that !83" 83 8 "!! for all three patterns and 8 'Þ#&. The even distribution
of the sequential inhibition process (B) and the uneven distribution of the counts for the clus-
Table 9.7. \ # statistics for quadrat counts of CSR, SSI, and CLU processes
Process \# Pra;#"& \ # b #
Pra;"& \#b
A: CSR *Þ%% !Þ"%'' !Þ)&$%
B: SSI #Þ%! !Þ!!!" !Þ****
C: CLU %(Þ)% !Þ***) !Þ!!!#
Using quadrat counts and the Chi-square test is simple, but the method is sensitive to the
choice of the subregions (grid size). Statistical tests based on distances between events or
sampling points and events avoid this problem.
event locations
0.9 sample locations
0.8
0.7
0.6
0.5
y
0.4
0.3
0.2
0.1
0.0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
x
Figure 9.30. Distance measurements used in CSR tests. Solid lines: sample point to nearest
event distances. Dashed lines: inter-event distances (also called event-to-event distances).
Dotted lines: nearest-neighbor distances (also called event-to-nearest-event distances).
As a test statistic we may choose the average nearest-neighbor distance C, the average inter-
event distance >, the estimate K s " aC! b of the probability that the nearest-neighbor distance is at
s
most C! , or L " a>! b. It is the user's choice which test statistic to use and in the case of the
empirical distribution functions how to select C! and/or >! . It is important that the test statistic
constructed from event distances can be interpreted in the context of testing the CSR hypothe-
sis. Compared to the average nearest-neighbor distance expected under CSR, C will be
smaller in a clustered and larger in a regular pattern. If C! is chosen small, K s " aC! b will be
larger than expected under CSR in a clustered pattern and smaller in a regular pattern.
The sampling distributions of distance-based test statistics are usually complicated and
elusive, even under the assumption of complete spatial randomness. An exception is the quick
test proposed by Ripley and Silverman (1978) that is based on one of the first ordered inter-
event distances. If >" mine>34 f, then >#" has an exponential distribution under CSR and an
exact test is possible. Because of possible inaccuracies in determining locations, Ripley and
Silverman recommend using the third smallest interevent distance. The asymptotic Chi-squa-
red distribution of these order statistics of interevent distances under CSR is given in their
paper. Advances in computing power has made possible to conduct tests of the CSR hypo-
thesis by Monte Carlo (MC) procedures. Among their many advantages is to yield exact :-
values (exact within simulation variability) and to accommodate irregularly shaped regions.
Recall that the :-value of a statistical test is the probability to obtain a more extreme out-
come than what was observed if the null hypothesis is true (§1.6). If the :-value is smaller
than the user-selected Type-I error level, the null hypothesis is rejected. An MC test is based
on simulating = " independent sets of data under the null hypothesis and calculating the test
statistic for each set. Then the test statistic is obtained from the observed data and combined
with the values of the test statistics from simulation to form a set of = values. If the observed
value of the test statistic is sufficiently extreme among the = values, the null hypothesis is
rejected.
Formally, let ?" be the value of a statistic Y calculated from the observed data and let
?# ß ÞÞÞß ?= be the values of the test statistic generated by independent sampling from the distri-
bution of Y under the null hypothesis aL! b. If the null hypothesis is true we have
Pra?" maxe?3 ß 3 "ß ÞÞÞß =fb "Î=.
Notice that we consider ?" , the value obtained from the actual (nonsimulated) data, to be part
of the sequence of all = values. If we reject L! when ?" ranks 5 th largest or higher, this is a
one-sided test of size 5Î=. When values of the ?3 are tied, one can either randomly sort the
?3 's within groups of ties, or choose the least extreme rank for ?" . We prefer the latter method
perform a nearest-neighbor analysis for the maple data in Figure 9.28. For exposition we use
= " #! which should be increased in real applications. The observed pattern has the
smallest average nearest-neighbor distance with C" !Þ!"()#) (Output 9.1, sim=0). This
yields a :-value for the hypothesis of complete spatial randomness of : !Þ!%(' against the
clustered alternative and : !Þ*&#% against the regular alternative (Output 9.2).
Output 9.1.
Obs ybar sim rank
1 0.017828 0 1
2 0.021152 1 2
3 0.021242 1 3
4 0.021348 1 4
5 0.021498 1 5
6 0.021554 1 6
7 0.021596 1 7
8 0.021608 1 8
9 0.021630 1 9
10 0.021674 1 10
11 0.021860 1 11
12 0.022005 1 12
13 0.022013 1 13
14 0.022046 1 14
15 0.022048 1 15
16 0.022070 1 16
17 0.022153 1 17
18 0.022166 1 18
19 0.022172 1 19
20 0.022245 1 20
21 0.023603 1 21
Output 9.2.
One One
Test # of MC Sided Sided
Statistic runs rank Left P Right P
Along with the ranking of the observed test statistic it is useful to prepare a graph of the
simulation envelopes for the empirical distribution function of the nearest-neighbor distance
and are plotted against KaCb, the average empirical distribution at C from the simulations,
" s 4 aC b .
K aC b "K
= " 4Á3
This plot is overlaid with the observed K -function ÐK s " aC bÑ as shown in Figure 9.31. Clus-
tering is evidenced by the K function rising above the %&-degree line that corresponds to the
CSR process (dashed line), regularity by a K function below the dashed line. When the ob-
served K function crosses the upper or lower simulation envelopes the CSR hypothesis is re-
jected. For the maple data there is very strong evidence that the distribution of maple trees in
the particular area is clustered.
1.0
0.8
0.6
0.4
0.2
0.0
0.2 0.4 0.6 0.8 1.0
Gbar(y)
Figure 9.31. Upper [Y aCb] and lower [PaCb] simulation envelopes for #! simulations and
observed empirical distribution function (step function) for maple data (Figure 9.28). Dashed
line represents K -function for a CSR process.
MC tests require a procedure to simulate the process under the null distribution. To test a
point pattern for CSR requires an efficient method to generate a homogeneous Poisson proc-
ess. The following algorithm simulates this process on the rectangle a!ß !b a+ß ,b.
1. Generate a random number from a Poissona-+,b distribution Ä 8.
2. Order 8 independent Y a!ß +b random variables Ä \" \# â \8 .
3. Generate 8 independent Y a!ß ,b random variables Ä ]" ß âß ]8 .
4. Return a\" ß ]" bß âß a\8 ß ]8 b as the coordinates of the two-dimensional Poisson
process on the rectangle.
If the point process is second-order stationary, then -asb - and -# as" ß s# b -# as" s# b;
the first-order intensity is constant and the second-order intensity depends only on the spatial
separation between points. If the process is furthermore isotropic, the second-order intensity
does not depend on the direction, only the distance between pairs of points: -# as" ß s# b
-# alls" s# llb -# a2b. Notice that any process for which the intensity depends on locations
cannot be second-order stationary. The second-order intensity function depends on the expec-
ted value of the cross-product of counts in two regions, similar to the covariance between two
random variables \ and ] , which is a function of their expected cross-product,
Covc\ß ] d Ec\] d Ec\ dEc] d. A downside of -# as" s# b is its lack of physical inter-
pretability. The remedy is to use interpretable measures that are functions of the second-order
intensity or to perform the interaction analysis in the spectral domain (§A9.9.12).
O-function analysis in point patterns takes the place of semivariogram analysis in geosta-
tistical data. The assumption of a constant mean there is replaced with the assumption of a
constant intensity function here. The O -function has several advantages over the second-or-
der intensity function [9.68].
• Its definition suggests a method of estimating O a2b from the average number of
events less than distance 2 apart (see §A9.9.10).
9.8 Applications
The preceding sections hopefully have given the reader an appreciation of the unique stature
of statistics for spatial data in the research worker's toolbox and a glimpse of the many types
of spatial models and spatial analyses. In the preface to his landmark text, Cressie (1993)
notes that “this may be the last time spatial Statistics will be squeezed between two covers.”
History proved him right. Since the publication of the revised edition of Cressie's Statistics
for Spatial Data, numerous texts have appeared that deal with primarily one of the three types
of spatial data (geostatistical, lattice, point patterns) at length, comparable to this entire
volume. The many aspects and methods of this rapidly growing discipline we have failed to
address are not countable. Some of the applications that follow are chosen to expose the
reader to some topics by way of example.
§9.8.1 reiterates the importance of maintaining the spatial context in data that are geo-
referenced and the ensuing perils to modeling and data interpretation if this context is over-
looked. Global and local versions of Moran's M statistics as a measure of spatial autocorrela-
tion are discussed there. In the analysis of geostatistical data the semivariogram or covario-
gram plays a central role. Kriging equations depend critically on it and spatial regres-
sion/ANOVA models require information about the spatial dependency to estimate
coefficients and treatment effects efficiently. §9.8.2 estimates empirical semivariograms and
fits theoretical semivariogram models by least squares, (restricted) maximum likelihood, and
composite likelihood. Point and block kriging are illustrated in §9.8.3 with an interesting
application concerning the amount of lead and its spatial distribution on a shotgun range.
Treatment comparisons of random field models and spatial regression models are examined in
§9.8.4 and §9.8.5. Most methods for spatial data analysis we presented assume that the
response variable is continuous. Many applications with georeferenced data involve discrete
or non-Gaussian data. Spatial random field models can be viewed as special cases of mixed
models. Extensions of generalized linear mixed models (§8) to the spatial context are
discussed in §9.8.6 where the Hessian fly data are tackled with a spatially explicit model.
Upon closer inspection many spatial data sets belong in the category of lattice data but are
often modeled as if they were geostatistical data. Lattice models can be extremely efficient in
explaining spatial variation. In §9.8.7 the spatial structure of wheat yields from a uniformity
trial are examined with geostatistical and lattice models. It turns out that a simple lattice
model explains the spatial structure more efficiently than the geostatistical approaches. The
final application, §9.8.8, demonstrates the basic steps in analyzing a mapped point pattern,
estimating its first-order intensity, Ripley's O -function, and Monte-Carlo inferences based on
distances to test the hypothesis of complete spatial randomness. A supplementary application
concerning the spectral analysis of point patterns can be found in Appendix A1 (§A9.9.13).
While The SAS® System is our computing environment of choice for statistical analyses,
its capabilities for spatial data analysis at the time of this writing did not extend to point
patterns and lattice data. The variogram procedure is a powerful tool for estimating empirical
10 10
8 8
6 6
y
4 4
2 2
0 0
0 2 4 6 8 10 0 2 4 6 8 10
x x
Figure 9.32. Two simulated lattice arrangements with identical frequency distribution of
points. Area of dots is proportional to the magnitude of the values.
The importance of retaining spatial information can be demonstrated with the following
example. A rectangular "! ‚ "! lattice was filled with "!! observations drawn at random
from a Ka!ß "b distribution. Lattice A is a completely random assignment of observations to
0.4
0.3
0.2
0.1
0.0
-3.3 -2.3 -1.3 -0.3 0.7 1.7 2.7 3.7
Figure 9.33. Histogram of the "!! realizations in lattices A and B along with kernel density
estimate. Both lattices produce identical sample frequencies.
1
Ave Z(s+h); ||h|| = 1
-1
-2
-2 -1 0 1 2
Z(s)
Figure 9.34. Lag-" plots for lattices A (full circles) and B (open circles) of Figure 9.32.
There is no trend between a value and the average value of its immediate neighbors in lattice
A but a very strong trend in lattice B.
20 25
19 24
23
18 22
17 21
16 20
15 19
18
14 17
13 16
12 15
Column
11 14
Row
13
10 12
9 11
8 10
7 9
8
6 7
5 6
4 5
3 4
3
2 2
1 1
3.0 3.5 4.0 4.5 5.0 3.0 3.5 4.0 4.5 5.0
Grain Yield Grain Yield
Figure 9.35. Row (left) and column (right) box-plots for Mercer and Hall grain yield data.
Other graphical displays and numerical summary measures were developed specifically
for spatial data, for example, to describe, diagnose, and test the degree of spatial autocorre-
lation. With geostatistical data the empirical semivariogram provides an estimate of the
spatial structure. With lattice data join-count statistics have been developed for binary and
nominal data (see, for example, Moran 1948, Cliff and Ord 1973, and Cliff and Ord 1981).
Moran (1950) and Geary (1954) developed autocorrelation coefficients for continuous attrib-
utes observed on lattices. These coefficients are known as Moran's M and Geary's - and like
many other autocorrelation measures compare an estimate of the covariation among the ^asb
to an estimate of their variation. Since the distribution of M and - tends to a Gaussian distribu-
tion with increasing sample size, these summary autocorrelation measures can also be used in
confirmatory fashion to test the hypothesis of no (global) spatial autocorrelation in the data.
In the remainder of this application we introduce Moran's M , estimation and inference based
8 8
! ! A34 ?3 ?4
8 3" 4"
M 8 . [9.69]
!A34 ! ?#
3ß4 3
3"
In the absence of spatial autocorrelation M has expected value EcM d "Îa8 "b and
values M EcM d indicate positive, values M EcM d negative autocorrelation. It should be noted
that the Moran test statistic bears a great resemblance to the Durbin-Watson (DW) test
statistic used in linear regression analysis to test for serial dependence among residuals
(Durbin and Watson 1950, 1951, 1971). The DW test replaces u with a vector of least squares
residuals and considers squared lag-1 serial differences in place of uw Wu. To determine
whether a deviation of M from its expectation is statistically significant one relies on the
asymptotic distribution of M which is Gaussian with mean "Îa8 "b and variance 5M# . The
hypothesis of no spatial autocorrelation is rejected at the ! "!!% significance level if
lM EcM dl
l^9,= l
5M
is more extreme than the D!Î# cutoff of a standard Gaussian distribution. Right-tailed (left-
tailed) tests for positive (negative) autocorrelation compare ^9,= to D! aD"! b cutoffs.
Two approaches are common to derive the variance 5M# . One can assume that the ^as3 b
are Gaussian or adopt a randomization framework. In the Gaussian approach the ^as3 b are
assumed Ka.ß 5 # b, so that Y3 µ a!ß 5 # a" "Î8bb under the null hypothesis. In the randomi-
zation approach the ^ as3 b are considered fixed and are randomly permuted among the 8 lat-
tice sites. There are 8x equally likely random permutations and 5M# is the variance of the 8x
Moran M values. A detailed derivation of and formulas for the variances under the two
assumptions can be found in Cliff and Ord (1981, Ch. 2.3). If one adopts the randomization
framework an empirical :-value for the test of no spatial autocorrelation can be calculated if
one ranks the observed value of M among the 8x " possible remaining permutations. For
even medium-sized lattices this is a computationally expensive procedure. The alternatives
are to rely on the asymptotic Gaussian distribution to calculate : -values or to compare the
observed M against only a random sample of the possible permutations.
The SAS® macro %MoranI (contained on CD-ROM) calculates the ^9,= statistics and : -
values under the Gaussian and randomization assumption. A data set containing the W matrix
title1 "Moran's I for Mercer and Hall Wheat Yield Data, Rook's Move";
%ContWght(rows=20,cols=25,move=rook,out=rook);
%MoranI(data=mercer,y=grain,row=row,col=col,w_data=rook);
produce Output 9.3. The observed M value of !Þ%!'' is clearly greater than the expected
value. The standard errors 5M based on randomization and Gaussianity differ only in the
fourth decimal place in this application. In other instances the difference will be more
substantial. There is overwhelming evidence that the data exhibit positive autocorrelation.
Moran's M is somewhat sensitive to the choice of the neighborhood matrix W. If the rook
definition (edges abut) is replaced by the bishop's move (touching corners),
title1 "Moran's I for Mercer and Hall Wheat Yield Data, Bishop's Move";
%ContWght(rows=20,cols=25,move=bishop,out=bishop);
%MoranI(data=mercer,y=grain,row=row,col=col,w_data=bishop);
the autocorrelation remains significant but the value of the test statistic is reduced by about
&!% (Output 9.4). But it is even more sensitive to large scale trends in the data. For a
significant test result based on Moran's M to indicate spatial autocorrelation it is necessary that
the mean of ^ asb is stationary. Otherwise subtracting ^ is not the appropriate shifting of the
data that produces zero mean random variables Y3 . In fact, the M test may indicate significant
"autocorrelation" if data are independent but have not been properly detrended.
Output 9.3.
Moran's I for Mercer and Hall Wheat Yield Data, Rook's Move
Output 9.4.
Moran's I for Mercer and Hall Wheat Yield Data, Bishop's Move
The test indicates strong positive "autocorrelation" which is an artifact of the changes in
Ec^ d rather than stochastic spatial dependency among the sites.
Output 9.5.
Moran's I for independent data with large-scale trend
Observed Pr(Z >
_Type_ I E[I] SE[I] Zobs Zobs)
The data set xmat contains the regressor variables excluding the intercept. It should not
contain any additional variables. This code fits a large-scale mean model with cubic column
effects and no row effects (adding higher order terms for column effects leaves the results
Output 9.6.
Moran's I for Mercer and Hall Wheat Yield Data
Calculated for Regression Residuals
OLS Estimates
Estimate StdErr Tobs Pr>|T|
Global Moran's I
I* E[I*] SE[I*] Zobs Pr > Zobs
0.32156 -0.0075 0.03202 10.2773 0
The %RegressI() and %MoranI() macros have an optional parameter local=. When set
to " (default is local=0) the macros will not only calculate the global M (or M ) statistic but
local versions thereof. The idea of a local indicator of spatial association (LISA) is due to
Anselin (1995). His notion was that although there may be no spatial autocorrelation globally,
there may exist local pockets of positive or negative spatial autocorrelation in the data, so
called hot-spots. This is only one possible definition of what constitutes a hot-spot. One could
also label as hot-spots sites that exceed (or fall short of) a certain threshold level. Hot-spot
definitions based on autocorrelation measures designate sites as unusual if the spatial depen-
dency is locally much different from other sites. The LISA version of Moran's M is
8
8
M3 8 ?3 "A34 ?4 , [9.72]
!?#3 4
3
where 3 indexes the sites in the data set. That is, for each site s3 we calculate an M statistic
based on information from neighboring sites. For a "! "! lattice there are a total of "!" M
statistics. The global M according to [9.69] or [9.71] and "!! local M statistics according to
[9.72]. The expected value of M3 is EcM3 d A3 Îa8 "b with A3 !84" A34 . The interpre-
tation is that if M3 EcM3 d then sites connected to s3 have attribute values dissimilar from
^ as3 b. A high (low) value at s3 is surrounded by low (high) values. If M3 EcM3 d, sites
25
19
Column
13
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Row
Figure 9.36. Local indicators of positive spatial autocorrelation (M3 EcM3 d) calculated from
regression residuals after removing column trends.
250
150
100
50
Figure 9.37. Total soil carbon data. Size of dots is proportional to soil carbon percentage.
Data kindly provided by Dr. Thomas G. Mueller, Department of Agronomy, University of
Kentucky. Used with permission.
Halved Squared Diff.
0.4
0.2
0.0
lag distance
Halved Squared Diff.
0.4
0.2
o o o o o o o o o o o o o o o o o
o o o
0.0
lag distance
0.8
Sq. Root Diff.
o
0.4
o o o o o o o o o o o o o o o o
o o o
0.0
lag distance
Figure 9.38. Semivariogram cloud (upper panel) and box-plot of halved squared differences
(middle panel) for median-polished residuals for total soil carbon percentage. The bottom
panel shows box-plots of Èl^ as3 b ^ as4 bl. Lag distance in feet.
There appears to be spatial structure in these data, the medians increase for small lag dis-
tances. The assumption of second-order stationarity is not unreasonable as the medians re-
main relatively constant for larger lag distances. The square root difference plot is not an esti-
mate of the semivariogram. The square root differences are, however, the basic ingredients of
the robust Cressie and Hawkins semivariogram estimator. The halved squared differences are
the basic elements of the Matheron estimator. The classical and robust empirical semivario-
gram estimators are obtained in The SAS® System with the code
proc variogram data=NoTillData outvar=svar1;
compute lagdistance=10 maxlags=40 robust;
coordinates xcoord=x ycoord=y;
var TC;
run;
proc print data=svar1; run;
The first call to proc variogram calculates the estimator for forty lags of width "!, the
second call calculates the semivariogram for twenty-nine lags of width (. Thus, the two semi-
variograms will extend to %!! and #!! feet, respectively (Figure 9.39). The number of pairs
in a particular lag class is stored as variable count in the output data set of the variogram pro-
cedure (Output 9.7). The first observation corresponding to LAG=-1 lists the number of obser-
vations, their sample mean (AVERAGE=0.83672), and their sample variance (COVAR=0.025998).
The average distance among data pairs in the first lag class was )Þ%"' feet, the classical semi-
variogram estimate at that distance was !Þ!!'&$$ and the robust semivariogram estimate was
!Þ!!&#''. The estimate of the covariance function at this lag is !Þ!#$'%*. Recall that (i) the
estimate of the covariogram is biased and (ii) that SAS® reports the semivariogram estimates
although the columns are labels VARIOG and RVARIO. The number of pairs at each lag distance
are sufficient to produce reliable estimates. Notice that the recommendation is to have at least
Output 9.7.
Obs LAG COUNT DISTANCE AVERAGE VARIOG COVAR RVARIO
The empirical semivariogram in the upper panel of Figure 9.39 shows an interesting rise
at lag #!! ft. The number of pairs in lag classes ") ## is sufficient to obtain a reliable esti-
mate, so that sparseness of observations cannot be the explanation. The semivariogram
appears to have a sill around !Þ!# for lags less than #!! feet and a sill of !Þ!$ for lags greater
than #!! feet. Possible explanations are nonstationarity in the mean, and/or a nested sto-
chastic process. The spatial (stochastic) variability may consist of two smooth-scale processes
that differ in their range. Whether this feature of the semivariogram is important depends on
the intended use of the semivariogram. If the purpose is that of spatial prediction of soil car-
bon at unobserved locations and kriging is performed within a local neighborhood of "&!
0.00
Semivariance
0.01
Figure 9.39. Classical and robust empirical semivariogram estimates for soil carbon percen-
tage. Upper panel shows semivariances up to llhll %!!, lower panel up to llhll #!! ft.
Another important feature of the semivariogram in the uper panel of Figure 9.39 is its in-
creasingly erratic nature for lags in excess of $&! feet. There are fewer pairs for large distance
classes and the (approximate) variance of the empirical semivariogram increases with the
square of #ahb (see [9.13], p. 590). Finally, we note that the robust and classical semivario-
grams differ little. The robust semivariogram is slightly downward-biased but traces the
profile of the classical estimator closely. This suggests that these data are not afflicted with
outlying observations. In the case of outliers, the classical estimator often appears shifted up-
ward from the robust estimator by a considerable amount.
In the remainder of this application we fit theoretical semivariogram models to the two
semivariograms in Figure 9.39 by ordinary and weighted nonlinear least squares, by
(restricted) maximum likelihood and composite likelihood. Recall that maximum and restrict-
ed maximum likelihood estimation operates on the raw data, not pairwise squared differences,
so that one cannot restrict the lag distance. With composite likelihood estimation, this is
possible (see §9.2.4 for a comparison of the estimation methods). The semivariogram models
investigated are the exponential, spherical, and gaussian models (§9.2.2). We need to decide
which semivariogram model best describes the stochastic dependency in the data and whether
fit the semivariogram by weighted nonlinear least squares. The bounds statement ensures that
the estimate of the nugget parameter is positive. Notice that the term (sill-nugget) is the
partial sill. The asymptotic confidence interval for the nugget parameter includes zero and
based on this fact one might be inclined to conclude that a nugget is not needed in the particu-
lar model (Output 9.8). The confidence intervals are based on the asymptotic estimated stan-
dard errors which are suspect. First, the data points are correlated and the weighted least
squares fit takes into account only the heteroscedasticity among the empirical semivariogram
values (and that only approximately), not their correlation. Second, the standard errors de-
pend on the number of data points which are the result of a user-driven grouping into lag
classes. Because the standard errors are not reliable, the confidence interval should not be
trusted. A sum of squares reduction test comparing the fit of a full and a reduced model is
also not a viable alternative. The weights depend on the semivariogram being fitted and
changing the model (i.e., dropping the nugget) changes the weights. The no-nugget model can
be fit by simply removing the nugget from the parameters statement in the previous code and
fixing its value at zero:
proc nlin data=svar2 noitprint nohalve;
parameters sill=0.02 range=80;
nugget = 0;
semivariogram = nugget + (sill-nugget)*(1-exp(-3*distance/range));
_weight_ = 0.5*count/(semivariogram**2);
model variog = semivariogram;
run;
The sum of squares from two fits with different weights are not comparable (compare
Output 9.8 and Output 9.9). For example, the uncorrected total sums of squares are ($%Þ$ and
((#Þ(, respectively.
To settle the issue on whether to include a nugget effect with a formal test we can call
upon the likelihood ratio principle. Provided we are willing to make the assumption of a
Gaussian random field the nugget and no-nugget models can be fit with proc mixed of The
SAS® System.
/* nugget model */
proc mixed data=NoTillData method=ml noprofile;
model tc = ;
repeated / subject=intercept type=sp(exp)(x y) local;
parms /* partial sill */ ( 0.025 )
/* range */ ( 32 )
/* nugget */ ( 0.005 );
run;
/* no-nugget model */
proc mixed data=NoTillData method=ml;
model tc = ;
repeated / subject=intercept type=sp(exp)(x y);
parms /* range */ ( 32 )
/* sill */ ( 0.025 ) ;
run;
To fit a model with nugget effect the local option is added to the repeated statement
and the noprofile option is added to the proc mixed statement. The latter is necessary to
prevent proc mixed from estimating an extra scale parameter that it would profile out of the
likelihood. The parms statement lists starting values for the covariance parameters. In the no-
nugget model the local and noprofile options are removed. Also notice that the order of the
covariance parameters in the parms statement changes between the nugget and no-nugget
models. The correct order in which to enter starting values in the parms statement can be
gleaned from the Covariance Parameter Estimates table of the proc mixed output. The
subject= option of the repeated statement informs the procedure which observations are
considered correlated in the data. Observations with different values of the subject variable
are considered independent. By specifying subject=intercept the variable identifying the
clusters in the data is a column of ones. Spatial data is treated as if it comprises a single
cluster of size 8 (see §2.6 on the progression of clustering from independent to spatial data).
The converged parameter estimates in the full model (containing a nugget effect) are par-
tial sill !Þ!##**, nugget !Þ!!$&&(, and (&Þ!&)) for the range parameter. Since proc
mixed parameterizes the exponential correlation function as " expe llhllÎ!f, the estimat-
ed practical range is $!
s $(&Þ!&)) ##&Þ#' feet, considerably larger than the estimates
Model Information
Data Set WORK.NOTILLDATA
Dependent Variable TC
Covariance Structures Spatial Exponential, Local Exponential
Subject Effect Intercept
Estimation Method ML
Residual Variance Method None
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Between-Within
Fit Statistics
-2 Log Likelihood -331.9
AIC (smaller is better) -323.9
AICC (smaller is better) -323.7
BIC (smaller is better) -310.7
Model Information
Data Set WORK.NOTILLDATA
Dependent Variable TC
Covariance Structure Spatial Exponential
Subject Effect Intercept
Estimation Method ML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Between-Within
Fit Statistics
-2 Log Likelihood -315.6
AIC (smaller is better) -309.6
AICC (smaller is better) -309.5
BIC (smaller is better) -299.7
Table 9.8. Results of least squares fits. )! , 0, ! denote the nugget, sill, and (practical) range,
respectively. WWV is residual sum of squares in ordinary least squares fit. Data #!!
refers to lower panel of Figure 9.39 with lags restricted to #!! feet
WLS OLS
Data Model Nugget )! 0 ! )! 0 ! WWV †
#!! Exponential No !Þ!#" *"Þ%' !Þ!#" *%Þ** "%*
Yes !Þ!!!& !Þ!#" *'Þ#(
Gaussian No !Þ!#! #*Þ'% !Þ!#! &)Þ!* "#$
Yes !Þ!!% !Þ!#" '*Þ!( !Þ!!$ !Þ!#! '&Þ!% "!#
Spherical No !Þ!#! ''Þ!* !Þ!#! (#Þ#$ "!)
Yes !Þ!!# !Þ!#! (*Þ""
%!! Exponential No !Þ!#) #!#Þ$ !Þ!$% $&$Þ* *!(
Yes !Þ!!* !Þ!%# )%!Þ" !Þ!!) !Þ!%% )'(Þ) ($(
Gaussian No !Þ!#& $'Þ)) !Þ!#) "!!Þ& "&(0
Yes !Þ!!' !Þ!#( "")Þ) !Þ!"$ !Þ!$& $((Þ! (**
Spherical No !Þ!#' *(Þ#& !Þ!$# #*!Þ! "#"0
Yes !Þ!"" !Þ!$) &$%Þ( !Þ!!* !Þ!$% %""Þ# ($&
†
"!%
Based on the ordinary least squares fit one would select a gaussian model with nugget
effect for the data in the lower panel of Figure 9.39 and a spherical semivariogram with
nugget effect for the data in the upper panel. The Pseudo-V # measures for these models are
!Þ)( and !Þ(&, respectively. While the nugget and sill estimates are fairly stable it is note-
worthy that the range estimates can vary widely among different models. Since the range
essentially determines the strength of the spatial dependency, kriging predictions from spatial
processes that differ greatly in their range can be very different. For the maxlag %!! data, for
example, the exponential and spherical models with nugget effects have very similar residual
sums of squares a($( and ($&b, but the range of the exponential model is more than twice that
of the spherical model (OLS results). The lesser spatial continuity of the exponential model
can be more than offset by a doubling of the range. The weighted least squares estimates of
the range appear particularly unstable. An indication of a well-fitting model is good agree-
ment between the ordinary and unweighted least squares estimates. On these grounds the
gaussian no-nugget models and the spherical no-nugget model for the second data set can be
ruled out. The fitted semivariograms for the gaussian and spherical nugget models are shown
in Figure 9.40.
Based on the (restricted) maximum likelihood fits, the spherical nugget model emerges as
the best-fitting model (Table 9.9). Nugget and no-nugget models can be compared via likeli-
hood ratio tests which indicates for the exponential and gaussian models that a nugget effect
is needed. To compare across semivariogram models the AIC criterion is used since the expo-
nential, gaussian, and spherical models are not nested. This leads to the selection of the
spherical nugget model as the best-fitting model in this group. Its range is considerably less
than that of the corresponding model fitted by least squares. Notice that the REML estimates
0.02
0.00
Semivariance
0.01
Figure 9.40. Semivariograms fitted by least squares. Spherical nugget semivariogram for max
lag %!! and gaussian nugget semivariogram for max lag #!!. Ordinary least squares fits
shown.
Finally, we conclude this application by fitting the basic semivariogram models by the
composite likelihood (CL) principle. This principle is situated between genuine maximum
likelihood and least squares. It has features in common with both. As the empirical Matheron
Table 9.10. Results of composite likelihood fits. )! , 0, ! denote the nugget, sill, and
(practical) range, respectively. Data #!! refers to lower panel of Figure 9.39
with lags restricted to #!! feet
The CL algorithm converged for all but one setting. The estimates are in general close to
the least squares estimates in Table 9.8. In situations where the least squares fits to the empiri-
cal semivariogram did poorly (e.g., gaussian no-nugget model), so did the composite likeli-
hood fit. The weighted residual sums of squares are not directly comparable among the
models as the weights depend on the model. Based on their magnitude alone the spherical
model is ruled out for the maxlag #!! data and no-nugget models are ruled out for the maxlag
%!! data, however.
0 100000
0 50 100 150
distance
5
4
gamma
3
2
1
0
0 50 100 150
distance
Figure 9.41. Semivariograms of lead (top panel) and lneleadf (bottom panel) concentrations
before detrending.
yields a sill estimate of %Þ$" and a range "#)Þ) meters. The SvarFit() function was
developed by the authors and is contained on the CD-ROM.
It is likely that even after the transformation the mean of lne^ asbf is not stationary.
Removing by ordinary least squares a response surface in the coordinates and fitting a
spherical semivariogram by weighted least squares to the residuals leads to Output 9.12 and
Figure 9.42. The S+SpatialStats® statements producing the output and figure are
sg.lm <- lm(lglead ~ x + y + x*y + x^2,data=sg)
sg.lmres <- sg$lglead - predict(sg.lm)
sg.varlmres <- variogram(sg.lmres ~ loc(x,y),data=sg,method="robust",
lag=7,nlag=22)
SvarFit(data=sg.varlmres,type="spherical",weighted=T,
start=list(sill=1.5,range=60))
The sill of the semivariogram is drastically reduced as well as the range compared to the
nondetrended data. The reduction in sill shows the smaller variability of the model residuals,
the reduction in range that spurious autocorrelation caused by a large-scale trend was re-
moved.
Parameters:
Value Std. Error t value
sill 2.0647 0.124854 16.53680
range 94.9438 11.400800 8.32783
0 50 100 150
Distance lag
Figure 9.42. Spherical semivariogram fit by weighted least squares to ordinary least squares
residuals.
An estimate of the total lead concentration can be obtained by integrating the surface in
Figure 9.43. A quick estimate is calculated as the average ordinate of the surface, which
yields "'"Þ##* 1Î7# and a total load of ""Þ'!) tons. Although we believe this estimate of the
total lead concentration to be fairly accurate, two problems remain to be resolved. In estimat-
ing the total on the logarithmic scale and exponentiating the result some transformation bias is
incurred. The total amount of lead will be underestimated. Also, no standard error is available
for this estimate. In order to predict the total amount of lead on the original scale without bias
we need to consider the block average
^ aEb ( ^ aub. u,
E
where E is the rectangle a!ß #%!b a!ß $!!b. Some details on block-kriging appear in
§A9.9.6. But there appears to be little spatial structure in the lead concentrations (Figure 9.41,
top panel). The solution is to model the large-scale trend allowing the mean of ^asb to
capture the two large spikes in Figure 9.43. The semivariogram can then be developed based
on the residuals of this fit. Alternatively, one can exclude the two spikes in calculating the
empirical semivariogram. With the former procedure it was determined that a spherical semi-
Figure 9.44. Predicted surface of lead in kg/m# obtained by universal kriging on the original
scale.
Chisel-Plow
No Tillage
300
250
200
Y-Coordinate (ft)
150
100
50
Figure 9.45. Sampling locations at which total soil R (%) and total soil G (%) were observed
for two tillage treatments. Treatment strips are oriented in East-West direction.
The target attribute for this application is the GÎR ratio and a simplistic pooled >-test
comparing the two tillage treatments leads to a :-value of !Þ)!* from which one would con-
clude that there are no differences in the average GÎR ratios. This test does not account for
spatial autocorrelation treating the "*& samples on chisel-plow strips and #!! samples on no-
till strips as independent. Furthermore, it does not convey whether there are differences in the
spatial structure of the treatments. Even if the means are the same the spatial dependency
might develop differently. This, too, would be a difference in the treatments that should be
recognized by the analyst. Omnidirectional semivariograms were calculated with the
variogram procedure in The SAS® System and spherical semivariogram models were fit to
the empirical semivariograms (Figure 9.46) with proc nlin by weighted least squares:
proc sort data=CNRatio; by tillage; run;
proc variogram data=CNRatio outvar=svar;
compute lagdistance=13.6 maxlag=19 robust;
coordinates xcoord=x ycoord=y;
var cn;
by tillage;
run;
proc nlin data=fitthis nohalve method=newton noitprint;
parameters sillC=0.093 sillN=0.1414 rangeC=116.6 rangeN=197.2
nugget=0.1982;
if tillage='ChiselPlow' then
sphermodel = nugget + (distance <= rangeC)*sillC*(1.5*(distance/rangeC) -
0.5*((distance/rangeC)**3)) + (distance > rangeC)*sillC;
else
sphermodel = nugget + (distance <= rangeN)*sillN*(1.5*(distance/rangeN) -
0.5*((distance/rangeN)**3)) + (distance > rangeN)*sillN;
model rvario = sphermodel;
_weight_ = 0.5*count/(sphermodel**2);
run;
0.35
0.30
Semivariance
0.25
0.20
Figure 9.46. Omnidirectional empirical semivariograms for GÎR ratio under chisel-plow
(open circles) and no-till (full circles) treatments. Weighted least squares fit of spherical
semivariograms are shown.
Next we obtain generalized least squares estimates of the treatment effect as well as
predictions of the GÎR ratio over the entire field with proc mixed of The SAS® System. A
data set containing the prediction locations for both treatments (data set filler) is created
The call to proc mixed has several important features. The model statement describes the
mean structure of the model. GÎR ratios are assumed to depend on the tillage treatments. The
outp=p option of the model statement produces a data set (named p) containing the predicted
values. The repeated statement identifies the spatial covariance structure to be spherical
(type=sp(sph)(x y)). The subject=intercept option indicates that the data set comprises a
single subject, all observations are assumed to be correlated. The group=tillage option re-
quests that the spatial covariance parameters are varied by the values of the tillage variable.
This allows modeling separate covariance structures for the chisel-plow and no-till treatments
to reflect the differences in spatial structure evident in Figure 9.46. Finally, the local option
adds a nugget effect. Since proc mixed adds only a single nugget effect, it was important in
fitting the semivariograms to ensure that the nugget effect was held the same for the two
treatments. The parms statement provides starting values for the covariance parameters. The
order in which the values are listed equals the order in which the values appear in the
Covariance Parameter Estimates table of the proc mixed output. A trial run is sometimes
necessary to determine the correct order. The starting values are set at the converged iterates
from the weighted least squares fit of the theoretical semivariogram (Output 9.13). The
noiter option of the parms statement prevents iterations of the covariance parameters and
holds them fixed at the starting values provided. To produce restricted maximum likelihood
estimates of the covariance parameters, simply remove the noiter option. The noprofile
option of the proc mixed statement prevents profiling of the nugget variance. Without this
option proc mixed would make slight adjustments to the sill and nugget even if the /noiter
option is specified.
The Dimensions table indicates that $*& observations were used in model fitting and
$"'# observations were not used (Output 9.14). The latter comprise the filler data set of
prediction locations for which the CN variable was assigned a missing value. The -2 Res Log
Likelihood of &(!Þ$ in the table of Fit Statistics equals minus twice the residual log
likelihood in the Parameter Search table. The latter table gives the likelihood for all sets of
starting values. Here only one set of starting values was used and the equality of the -2 Res
Log Likelihood values shows that no iterative updates of the covariance parameters took
place. The estimates shown in the Covariance Parameter Estimates table are identical to the
Output 9.14.
The Mixed Procedure
Model Information
Dimensions
Covariance Parameters 5
Columns in X 3
Columns in Z 0
Subjects 1
Max Obs Per Subject 395
Observations Used 395
Observations Not Used 3162
Total Observations 3557
Parameter Search
Fit Statistics
Num Den
Effect DF DF F Value Pr > F
tillage 1 393 0.06 0.7990
Chisel-Plow No Tillage
Figure 9.47. Predicted GÎR surface under chisel-plow and no-till treatments.
The predicted surfaces in Figure 9.47 were obtained from the generalized least squares fit
which assumed that the supplied starting values of the covariance parameters are the true
values. This is akin to the assumption in kriging methods that the semivariogram values used
in solving the kriging equations are known. Removing the noiter option of the parms state-
ment in proc mixed the spatial covariance parameters are updated iteratively by the method of
restricted maximum likelihood. Twice the negative residual log likelihood at convergence can
be compared to the same statistic calculated from the starting values. This likelihood ratio test
indicates whether the REML estimates are a significant improvement over the starting values.
The mixed procedure displays the result of this test in the PARMS Model Likelihood Ratio
Fit Statistics
-2 Res Log Likelihood 562.9
AIC (smaller is better) 572.9
AICC (smaller is better) 573.1
BIC (smaller is better) 592.8
1.4
Chisel-Plow
No Tillage
1.2
Total C (%)
1.0
0.8
0.6
0.4
Figure 9.48. Relationship between total G (%) and total R (%) of chisel-plow and no-till
areas.
250 250
200 200
150
y
150
y
100 100
50 50
0 0
0 100 200 300 400 0 100 200 300 400
x x
Figure 9.49. Contour plots of ordinary kriging predictions of total G (%) and total R (%)
irrespective of tillage treatment.
where the errors /as3 b are spatially autocorrelated. We emphasize again that such a spatial re-
gression model differs conceptually from a cokriging model where primary and secondary
attribute are spatially autocorrelated and models for the semivariogram (covariogram) of
X G as3 b, X R as3 b and the cross-covariogram of X G as3 b and X R as3 b must be derived. X R as3 b
is considered fixed in [9.73] and the only semivariogram that needs to be modeled is that of
X G as3 b after adjusting its mean for the dependency on X R as3 b. The spatial regression model
expresses the relationship between X G as3 b and X R as3 b not through a cross-covariogram but
models it as deterministic dependency of EcX G as3 bd on X R as3 b; they are simpler to fit com-
pared to cokriging models and standard statistical procedures such as proc mixed of The
SAS® System can be employed.
The semivariogram of /as3 b is modeled in two steps. First, the model is fit by ordinary
least squares and the empirical semivariogram of the OLS residuals is computed to suggest a
theoretical semivariogram model. This theoretical model is fit to produce starting values for
The final step is to submit the data fitthis to proc mixed to fit the spatial regression
model [9.73] by (restricted) maximum likelihood:
proc mixed data=fitthis noprofile;
model TC = TN / ddfm=contain s outp=p;
repeated / subject=intercept type=sp(exp)(x y) local;
parms /* sill */ 0.000864
/* range */ 72.3276
/* nugget */ 0.000946 ;
run;
The starting values for the autocorrelation parameters are chosen as the converged iter-
ates in Output 9.16. Because the parms statement does not have a /noiter option proc mixed
will estimate these parameters iteratively commencing at the starting values. The outp=p
option of the model statement creates a data set containing the predicted values.
The Dimensions table of the output shows that the data are representing a single subject
and that &$" of the *#' observations in the data set have not been used in estimation (Output
9.17). These are the observations at the prediction locations for which the response variable
TC was set to missing. Minus twice the (residual) log likelihood evaluated to –"&#'Þ! at the
starting values and was subsequently improved upon during five iterations. At convergence -2
Res Log Like "&#(Þ('. The difference between the initial and converged -2 Res Log
Like can be used to test whether the iterations significantly improved the model fit. The dif-
ference is not statistically significant (: !Þ'#&$, see PARMS Model Likelihood Ratio
Test). The iterated REML estimates of sill, range, and nugget are shown in the Covariance
Parameter Estimates table.
Dimensions
Covariance Parameters 3
Columns in X 2
Columns in Z 0
Subjects 1
Max Obs Per Subject 395
Observations Used 395
Observations Not Used 531
Total Observations 926
Parameter Search
CovP1 CovP2 CovP3 Res Log Like -2 Res Log Like
0.000946 49.7380 0.000784 763.0047 -1526.0093
Convergence criteria met.
Fit Statistics
-2 Res Log Likelihood -1527.8
AIC (smaller is better) -1521.8
AICC (smaller is better) -1521.7
BIC (smaller is better) -1509.8
The estimates of the regression coefficients are " s ! !Þ!"&%# and " s " ""Þ"")%,
respectively (Solution for Fixed Effects Table). With every additional percent of total R
the total G percentage increases by ""Þ"" units. It is interesting to compare these estimates to
the ordinary least squares estimates in the model
X G3 !! !" X R3 /3 , /3 µ 33. !ß 5 # ,
which does not incorporate spatial autocorrelation (Output 9.18). The estimates are slightly
different and their standard errors are very optimistic (too small). Furthermore, for a given
value of X R , the prediction of X G is the same under the classical regression model, regard-
less of where the X R observation is located. In the spatial regression model with auto-
correlated errors, the best linear unbiased predictions of X Gas3 b take the spatial correlation
structure into account. At two sites with identical X R values the predicted values in the
spatial model will differ depending on where the sites are located. Compare, for example the
two observations in Output 9.19. Estimates of EcX G as3 bd would be calculated as
s! "
scX G as3 bd "
E s " X R as3 b !Þ!"&%# ""Þ"")%!Þ!&('#$ !Þ'#&#$
regardless of the location s3 . The values computed by proc mixed are predictions of X Gas3 b
and vary by location. If estimates of the mean are desired one can add the statement outpm=pm
to the code above.
Output 9.18.
Solution for Fixed Effects
Standard
Effect Estimate Error DF t Value Pr > |t|
The spatial predictions of the total carbon percentage are shown in the left panel of
Figure 9.50 and estimates of the mean in the right panel. Because the underlying X R surface
is spatially variable (right panel of Figure 9.49) so are the estimates of EcX G as3 bd which
follow the pattern of X R very closely. The predictions of X Gas3 b follow the same pattern as
EcX G as3 bd, but exhibit more variability. The left-hand panel of Figure 9.50 is less smooth.
250 250
200 200
150 150
y
100 100
50 50
0 0
0 100 200 300 400 0 100 200 300 400
x x
where 1" a(b . is the inverse link function, ( is the linear predictor of the form xw " and /
is a random error with mean ! and variance Varc/d 2a.b<. In the randomized block
Hessian fly experiment, if one assumes that ]34 is a Binomial proportion, one obtains (34
! 34 73 , Varc/34 d 2a.b 1" a(34 ba" 1" a(34 bbÎ834 and < ". Choosing 1ab as the
logit transform is common for such data. In vector/matrix notation the model for the complete
data is written as
Y 1" a( b e, e µ a0ß Diage2a.bfb a0ß Ha.bb.
A spatially varying process that induces autocorrelations can be accommodated in two ways.
The marginal formulation replaces the error vector e with the vector d such that
½
Varcdd Ha.b½ RHa.b .
The matrix R is a spatial correlation matrix and the diagonal matrices Ha.b adjust the corre-
lations to yield the correct variances and covariances. The matrix R typically corresponds to
the correlation model derived from one of the basic isotropic semivariogram models (§9.2.2).
For example, if the spatial dependency between experimental units can be described by an
exponential semivariogram, elements of R are calculated as expe $lls5 s6 llÎ!f, where s5
and s6 are the spatial coordinates representing two units. The marginal formulation was cho-
sen by Gotway and Stroup (1997) in modeling the Hessian fly data.
The conditional formulation of a spatial generalized linear model assumes that condi-
tionally on the realization of the spatial process the observations are uncorrelated. This
formulation is akin to the generalized linear mixed models of §8. In vector notation we can
put
Y 1" a( Uasbb e, e µ a0ß Ha.bb. [9.74]
where 73 , 3 "ß âß "', denotes the entries, and 34 , 4 "ß âß %, the block effects. The link
function was chosen as the logit, and consequently, loge134 Îa" 134 bf (34 .
proc genmod data=HessianFly;
class block entry;
model z/n = block entry / link=logit dist=binomial type3;
ods output ParameterEstimates=GLMEst;
ods output type3=GLMType3;
run;
The two proc genmod calls fit the regular GLM and the overdispersed GLM (by adding
the dscale option to the model statement) and save estimates as well as the tests for treatment
effects in data sets. After processing the output data sets (see code on CD-ROM), we obtain
Output 9.20. The estimate of the intercept, block and treatment effects are the same in both
models. The standard errors of the overdispersed model are uniformly "Þ'' times larger than
the standard errors in the regular GLM. This is also reflected in the test of entry effects. The
Chi-square statistic in the overdispersed model is "Þ''# #Þ(& times smaller than the corre-
sponding statistic in the GLM. Not accounting for overdispersion overstates the precision of
parameter estimates. Test statistics are too large and :-values too small.
Output 9.20.
GLM GLM Overd. Overd.
Parameter Level Estimate StdErr Estimate StdErr
Intercept -1.2936 0.3908 -1.2936 0.6487
block 1 -0.0578 0.2332 -0.0578 0.3870
block 2 -0.1838 0.2303 -0.1838 0.3822
block 3 -0.4420 0.2328 -0.4420 0.3863
entry 1 2.9509 0.5397 2.9509 0.8958
entry 2 2.8098 0.5158 2.8098 0.8561
entry 3 2.4608 0.4956 2.4608 0.8225
entry 4 1.5404 0.4564 1.5404 0.7575
entry 5 2.7784 0.5293 2.7784 0.8785
entry 6 2.0403 0.4889 2.0403 0.8115
entry 7 2.3253 0.4966 2.3253 0.8242
entry 8 1.3006 0.4754 1.3006 0.7890
entry 9 1.5605 0.4569 1.5605 0.7582
entry 10 2.3058 0.5203 2.3058 0.8635
entry 11 1.4957 0.4710 1.4957 0.7818
entry 12 1.5068 0.4767 1.5068 0.7911
entry 13 -0.6296 0.6488 -0.6296 1.0768
entry 14 0.4460 0.5126 0.4460 0.8507
entry 15 0.8342 0.4698 0.8342 0.7798
The previous analysis maintains that observations from different experimental units are
independent; it simply allows the variance of the observations to exceed the variability dic-
tated by the Binomial law. If the data are overdispersed relative to the Binomial model be-
cause of positive spatial autocorrelation among the experimental units, the spatial process can
be modeled directly. The following code analyzes the Hessian fly experiment with model
[9.74], where Y asb has exponential semivariogram without nugget effect (options Covmod=E,
nugget=0). The sx= and sy= parameters denote the variables of the data set containing longi-
tude and latitude information, the margin= parameter specifies the marginal variance function
2a.b. Starting values for the sill and range are set at "Þ& and &, respectively, and the range
parameter is constrained to be at least #. Setting a minimum value for the range is recom-
The stmts=%str( ) block of the macro call assembles statements akin to proc mixed
syntax. The s option of the model statement requests a printout of the fixed effects estimates
(solutions). For predicted values add the p option to the model statement. The algorithm con-
verged after fourteen iterations, that is, the parameters of the exponential semivariogram were
updated fourteen times following an update of the block and entry effects.
The sill and range parameter of the exponential semivariogram (covariogram) are esti-
mated as !Þ$(& and *Þ'*% 7, respectively. Notice that these estimates differ from those of
Gotway and Stroup (1997) who estimated the range at ""Þ' 7 and the sill at $Þ$). Their
model uses a marginal formulation, whereas the model fitted here is a conditional one that in-
corporates a latent random field inside the link function. Furthermore, in their marginal
formulation Gotway and Stroup (1997) settled on a spherical semivariogram. For the condi-
tional model we found an exponential model to fit the semivariogram of the transformed resi-
duals better. Finally, their method does not iterate between updates of the fixed effects and
updates of the semivariogram parameters.
The estimates of the fixed effects in the spatial model (Output 9.21) differ from the cor-
responding estimates in the generalized linear model (compare to Output 9.20), as expected.
The overall impact of different estimates is difficult to judge since the treatment effects,
which are averaged across the blocks, are of interes. If s(34 is the estimated linear predictor for
entry 3 in block 4, we need to determine s(3Þ , the effect of the 3th entry after removing and
averaging over the block effect. The estimates s(3Þ are shown in the Least Squares Means table
(Output 9.22). The least square mean for entry ", for example, is calculated from the param-
eter estimates in Output 9.21 as
"
s("Þ "Þ%!#' a !Þ"%%% !Þ"#'* !Þ%%$" !b $Þ$##) "Þ(%"'.
%
The probability that a plant of entry " is infected with the Hessian Fly is then obtained by
applying the inverse link function,
"
1
s" !Þ)&.
" expe "Þ(%"'f
The table titled Differences of Least Square Means in Output 9.22 can be used to assess
differences in the infestation probabilities among pairs of entries. Only part of the lengthy
table of least squares mean differences are shown.
Output 9.22.
Least Squares Means (WORK._LSM)
Std.Err Std.Err of
of Predicted Predicted
Effect Level LS Mean LSMean Mean Mean
To compare the results of the GLM and spatial analysis in terms of infection probabili-
ties, predicted probabilities of infestation with the Hessian fly for the "' entries in the study
were graphed in Figure 9.51. Four methods were employed to calculate these probabilities.
The upper left-hand panel shows the probabilities calculated from the entry least squares
means in the overdispersed GLM analysis. The predictions in the other three panels are ob-
tained through different techniques of accounting for spatial correlations among experimental
units. Gotway and Stroup refers to the noniterative technique of Gotway and Stroup (1997),
Pseudo-Likelihood to the techniques by Wolfinger and O'Connell (1993) that are coded in the
%glimmix() macro (www.sas.com). It is noteworthy that the predicted probabilities are very
similar for the spatial analyses and that the (overdispersed) GLM results in predictions quite
similar to the spatial analyses. The standard errors of the predicted probabilities are very
homogeneous across entries in the GLM analysis. The dots are of similar size. The spatial
analyses show much greater heterogeneity in the standard errors for the predicted infestation
probabilities. There is little difference in the standard errors among the three spatial analyses,
however.
0.8
0.6
0.4
Predicted Probabilities
0.2
0.0
Composite Likelihood Pseudo-Likelihood
1.0
0.8
0.6
0.4
0.2
0.0
1 2 3 4 5 6 7 8 9 10111213141516
Cultivar (Entry)
Figure 9.51. Predicted probability of infection by entry for four different methods of incorpo-
rating overdispersion or spatial correlations. The size of the dots is proportional to the stan-
dard error of the predicted probability.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
15
10
5
Entry
15
10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Entry
Figure 9.52. Results of pairwise comparisons among entries. A dot indicates a significant dif-
ference in infestation probabilities among a pair of entries (at the &% level).
Differences in predicted probabilities and their standard errors are reflected in multiple
comparisons of entries (Figure 9.52). The three spatial analyses produce very similar results.
The overdispersed GLM yields fewer significant differences in this application which is due
to the large and homogeneous standard errors of the predicted probabilities (Figure 9.51)
0.45
Semivariogram of Transformed Residuals
0.40
0.35
0.30
0.25
0.20
0 5 10 15 20 25
Lag distance (m)
shows significant spatial autocorrelation (Output 9.23) among plot yields which may be
caused by a nonstationary mean.
26
25
Grain Yield
24
23
22
0 20 40 60 80 100 120
Rows
26
25
Grain Yield
24
23
2 4 6 8 10 12
Series
Figure 9.54. Row and column (series) sample medians for Wiebe's wheat yield data.
Output 9.23.
Spatial Correlation Estimate
Correlation = 0.2311
Variance = 3.484e-4
Std. Error = 0.01866
Before the SSAR model can be fit the neighborhood structure must be defined. We
choose a rook definition and standardize the weights to sum to one to mirror the SSAR
analysis in Griffith and Layne (1999):
wwy.snhbr <- neighbor.grid(nrow=12,ncol=125,neighbor.type="first.order")
n <- wwy.snhbr[length(wwy.snhbr[,1]),1]
for (i in 1:n) {
wwy.snhbr$weights[wwy.snhbr$row.id==i] <- 1/sum(wwy.snhbr$row.id == i)
The abbreviated output shows a large estimate of the spatial interaction parameter
as
3 !Þ)%'*, Output 9.24b and the likelihood ratio test for L! : 3 ! is soundly rejected
a: !b. There is significant spatial autocorrelation in these data beyond row and column
effects.
Output 9.24.
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 24.1736 0.4704 51.3909 0.0000
row 0.0129 0.0050 2.5732 0.0102
series -0.1243 0.0456 -2.7261 0.0065
rho = 0.8469
Note: The estimates for intercept, row and column effects and their standard errors differ from those in Griffith and
Layne (1999). These authors standardize the square root yield to have sample mean 0 and sample variance 1 and
standardize the row and series effects to have mean 0.
The ordinary least squares analysis assuming independence of the plot yields with the
statements
wwy.OLS <- lm(ryield ~ row + series, data=wwy)
summary(wwy.OLS)
yields parameter estimates that are not too different from the SSAR estimate (Output 9.25)
but their standard errors are too optimistic.
Output 9.25.
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 24.7626 0.1292 191.6848 0.0000
row 0.0138 0.0013 10.5847 0.0000
series -0.2264 0.0136 -16.6716 0.0000
-2
-6
22 23 24 25 26
-2
-6
22 23 24 25 26
-2
-6
22 23 24 25 26
Fitted Value MP
Figure 9.55. Residuals in SSAR model, OLS, and median polished residuals for Wiebe's
wheat yield data. Only row and column trends were removed as large-scale trends.
The quality of the fit of the three models is assessed by plotting residuals against the
predicted (fitted) yields (Figure 9.55). The spread of the fitted values indicates the variability
in the predictions under the respective models. The SSAR model yields the least dispersed
fitted values followed by the OLS fit and the median polishing. The OLS residuals exhibit a
definite trend which shows the incomplete removal of the large-scale trend. No such trend is
apparent in the SSAR residuals. The spatial neighborhood structure has supplanted the mis-
sing quadratic and cubic trends in the large-scale model well. Maybe surprisingly, median
polishing performs admirably in removing the trend in the data. The residuals exhibit almost
no trend. Compared to the SSAR fit the median polished residuals are considerably more dis-
persed, however. If one compares the sample variance of the various residuals, the incomplete
trend removal in the OLS fit and the superior quality of the SSAR fit are evident:
=#SPW $Þ#*, =#Q T "Þ)%, =#WWEV "Þ"'.
How well the methods accounted for the spatial variability in the plot yields can be
studied by calculating the empirical semivariograms of the respective residuals. If spatial
variability both large-scale and small-scale is accounted for, the empirical semivariogram
should resemble a nugget-only model. An assumption of second-order stationarity can be
made for the three residual semivariograms but not for the raw data (Figure 9.56). The empir-
ical semivariogram of the SAR residuals shows the small variability (low sill) of these resid-
uals and the complete removal of spatial autocorrelation. Residual spatial dependency re-
mains in the median polished and OLS residuals. The sills of the residual semivariograms
agree well with the sample variances. The lattice model clearly outperforms the other two
methods of trend removal. It is left to the reader to examine the relative performance of the
three approaches if not only linear row and column trends are removed but quadratic or cubic
trends.
5
4
4
3
3
gamma
gamma
2
2
1
1
0
0
0 20 40 60 80 100 0 20 40 60 80 100
distance distance
5
4
4
3
3
gamma
gamma
2
2
1
1
0
0 20 40 60 80 100 0 20 40 60 80 100
distance distance
Figure 9.56. Empirical semivariograms of raw data, SSAR, OLS, and median polished resi-
duals for Wiebe's wheat yield data. All semivariograms are scaled identically to highlight the
differences in the sill.
that represents the number of events per unit area. If the intensity does not vary with spatial
location, the process is first-order stationary (= homogeneous).
200
150
Y-coordinate
100
50
0
X-coordinate
Figure 9.57. Mapped point pattern of 8 ")! events on the a!ß %!!b Ð!ß #!!Ñ rectangle.
The common intensity estimators are discussed in §A9.9.10. The naïve estimator is
simply, -s 8ÎlEl, where lEl is the area of the domain considered. Here, 8 ")! and
lEl %!!#!!, hence - s !Þ!!##&. This estimator does not vary with spatial location, it is
appropriate only if the process is homogeneous. Location-dependent estimators can be
obtained in a variety of ways. One can grid the domain and count the number of events in a
grid cell. This process is usually followed by some type of smoothing of the raw counts.
S+SpatialStats® terms this the binning estimator. One can also apply nonparametric smooth-
ing techniques such as kernel estimation (§A4.8.7) directly. The smoothness (= spatial resolu-
tion) of binning estimators depends on the number of grid cells and the smoothness of kernel
estimators on the choice of bandwidth (§4.7.2) The statements below calculate the binning
estimator on a #! "! grid and kernel estimators with gaussian weight function for three
different bandwidths (Figure 9.58).
par(mfrow=c(2,2))
image(intensity(sppattern,method="binning",nx=20,ny=10))
title(main="20*10 Binning w/ LOESS")
image(intensity(sppattern,method="gauss2d",bw=25))
title(main="Kernel, Bandwidth=25")
image(intensity(sppattern,method="gauss2d",bw=50))
title(main="Kernel, Bandwidth=50")
image(intensity(sppattern,method="gauss2d",bw=100))
title(main="Kernel, Bandwidth=100")
With increasing bandwidth the kernel smoother approaches the naïve estimator and the
location-dependent features of the intensity can no longer be discerned. The binning estimator
as well as the kernel smoothers with bandwidths #& and &! show a concentration of events in
the southeast and northwest corners of the area.
200
200
150
150
100
100
50
50
0
0
200
150
150
100
100
50
50
0
Figure 9.58. Spatially explicit intensity estimators for point pattern in Figure 9.57. Lighter
colors correspond to higher intensity.
The first three panels of Figure 9.58 suggest that events tend to group in certain areas,
and that the process appears to be clustered. At this point there are three possible
explanations, and further progress depends on which is trusted.
1. The number of events in nonoverlapping areas are independent. There is no repulsion
or attraction of events. Instead, the first-order intensity (the mean function) -asb is
simply a function of spatial location. An inhomogeneous Poisson process is a reason-
able model and it remains to estimate the intensity function -asb.
2. The first-order intensity does not depend on the spatial location, i.e., -asb -. The
grouping of events is due only to spatial interaction of events. The second-order prop-
erties of the point pattern (the spatial dependency) suffice to explain the nonhomoge-
neous distribution of events.
3. In addition to interactions among events the first-order intensity is not constant.
The three conditions are roughly equivalent to the following scenarios for geostatistical
data. Independent observations with large-scale variation in the mean (1), a constant mean
with spatial autocorrelation (2), and large-scale variations combined with spatial autocorrela-
tion (3). While random field models for geostatistical and lattice data allow the separation of
large-scale and smooth-scale spatial variation, less constructive theory is available for point
pattern analysis. Second-order methods for point pattern analysis require stationarity of the
intensity just as semivariogram analysis for geostatistical data requires stationarity of the
mean function. There we can either detrend the data or rely on methods that simultaneously
estimate the mean and second-order properties (e.g., maximum likelihood). With point
patterns, this separation is not straightforward. If we consider explanation 2 we can proceed
graph because a graph of O s a2b against the CSR benchmark 12# often fails to reveal
the subtle deviations from complete spatial randomness. The P s 2 versus 2 plot
amplifies the deviation from CSR visually. The CSR process is represented by a hori-
sa2b 2 rises above the zero
zontal line at ! in this plot. Clustering is indicated when P
line. Because the variance of O s a2b increases sharply with 2 interpretation of these
graphs should be restricted to a distance no greater than one half of the length of the
shorter side of the bounding rectangle (here 2 "!!).
After making the GAnalysis() function available to S+SpatialStats® , all of these tasks
are accomplished by the function call GAnalysis(sppattern,n=180,sims=100,cluster="
"). A descriptive string can be assigned to the cluster= argument which will be shown on
the output (Figure 9.59, argument was omitted here). Based on the Monte Carlo test of near-
est-neighbor distances with "!! simulations we conclude that the observed pattern exhibits
clustering. Among all "!" point patterns (one observed, one hundred simulated), the observed
pattern had the smallest nearest neighbor distance (rank "), leading to a :-value of !Þ!!**
against the clustered alternative. The observed K s function is close to the upper simulation
s
envelope and crosses it repeatedly. The Pa2b 2 plot shows the elevation above the zero line
that corresponds to a CSR process. The expected number of extra events within distance 2
from an arbitrary event is larger than under CSR. To see a drop of the P sa2b 2 plot below
zero for larger distances in clustered processes is common. This occurs when distances are
larger than the cluster diameters and cover a lot of white space. Recall the recommendation
sa2b 2 not be interpreted for distances in excess of one half of the length of the
that P
smaller side of the bounding rectangle. Up to 2 "!! clustering of the process is implied.
s analysis and the P
Having concluded that this is a clustered point pattern, based on the K s
function, we would like to know whether the conclusion is correct. It is indeed. The point
pattern in Figure 9.57 was simulated with S+SpatialStats® with the statements
1.0
200
0.8
150
0.6
Ghat
100
y
0.4
rank = 1
p(Reg) = 0.9901
50
0.2
p(Clu) = 0.0099
0
0.0
0 100 200 300 400 0.0 0.2 0.4 0.6 0.8 1.0
x CSR G(y)
6
0.8
Ghat and envelopes
4
Lhat-distance
0.6
2
0.4
0
0.2
-2
0.0
-4
Figure 9.59. Results of analyzing the point pattern in Figure 9.57 with GAnalysis().
The set.seed() statement fixes the seed of the random number generator at a given
value. Subsequent runs of the program with the same seed will produce identical point
patterns. The make.pattern() function simulates the realization of a particular point process.
Here, a cluster process is chosen with parameters radius=35 and cpar=25. Twenty-five
parent events are placed according to a homogeneous Poisson process. Around each parent,
offspring events are placed independently of each other within radius $& of the parent
location. Finally, the parent events are deleted and only the offspring locations are retained.
This is known as a Poisson Cluster process, special cases of which are the Neyman-Scott
processes (see §A9.9.11 and Neyman and Scott 1972). Although this is difficult to discern
from Figure 9.57, the process consists of #& clusters. Furthermore, following explanation 2.
above was the correct course of action. This Neyman-Scott process is a stationary process.
Agresti, A. (1990) Categorical Data Analysis. John Wiley & Sons, New York
Akaike, H. (1974) A new look at the statistical model identification. IEEE Transaction on Automatic
Control, AC-19:716-723
Allen, D.M. (1974) The relationship between variable selection and data augmentation and a method of
prediction. Technometrics, 16:125-127
Allender, W.J. (1997) Effect of trifluoperazine and verapamil on herbicide stimulated growth of cotton.
Journal of Plant Nutrition, 20(1):69-80
Allender, W.J., Cresswell, G.C., Kaldor, J., and Kennedy, I.R. (1997) Effect of lithium and lanthanum
on herbicide induced hormesis in hydroponically-grown cotton and corn. Journal of Plant
Nutrition, 20:81-95
Amateis, R.L. and Burkhart, H.E. (1987) Cubic-foot volume equations for loblolly pine trees in cutover,
site-prepared plantations. Southern Journal of Applied Forestry, 11:190-192
Amemiya, T. (1973) Regression analysis when the variance of the dependent variable is proportional to
the square of its expectation. Journal of the American Statistical Association, 68:928-934
Anderson, J.A. (1984) Regression and ordered categorical variables. Journal of the Royal Statistical
Society (B), 46(1):1-30
Anderson, R.L. and Nelson, L.A. (1975) A family of models involving intersecting straight lines and
concomitant experimental designs useful in evaluating response to fertilizer nutrients. Biometrics,
31:303-318
Anderson, T.W. and Darling, D.A. (1954) A test of goodness of fit. Journal of the American Statistical
Association, 49:765-769
Andrews, D.F., Bickel, P.J., Hampel, F.R., Huber, P.J., Rogers, W.H., and Tukey, J.W. (1972) Robust
Estimates of Location: Survey and Advances. Princeton University Press, Princeton, NJ
Andrews, D.F. and Herzberg, A.M. (1985) Data. A Collection of Problems from Many Fields for the
Student and Research Worker. Springer-Verlag, New York.
Anscombe, F.J. (1948) The transformation of Poisson, binomial, and negative-binomial data.
Biometrika, 35:246-254
Anscombe, F.J. (1960) Rejection of outliers. Technometrics, 2:123-147
Anselin, L. (1995) Local indicators of spatial association — LISA. Geographical Analysis, 27:93-115
Armstrong, M. and Delfiner, P. (1980) Towards a more robust variogram: A case study on coal.
Technical Report N-671. Centre de Géostatistique, Fontainebleau, France
Baddeley, A.J. and Silverman, B.W. (1984) A cautionary example on the use of second-order methods
for analyzing point patterns. Biometrics, 40:1089-1093
Bailey, R.L. (1994) A compatible volume-taper model based on the Schumacher and Hall generalized
form factor volume equation. Forest Science, 40:303-313
Barnes, R.J. and Johnson, T.B. (1984) Positive kriging. In: Geostatistics for Natural Resource
Characterization Part 1 (Verly, G., David, M., Journel, A.G. and Maréchal, A. eEds.) Reidel,
Dortrecht, The Netherlands, p. 231-244
Barnett, V. and Lewis, T. (1994) Outliers in Statistical Data, 3rd ed. John Wiley & Sons, New York
Bartlett, M.S. (1937a) Properties of sufficiency and statistical tests. Proceedings of the Royal Statistical
Society, Series A, 160:268-282
Bartlett, M.S. (1937b) Some examples of statistical methods of research in agriculture and applied
biology. Journal of the Royal Statistical Society, Suppl., 4:137-183
Bartlett, M.S. (1938) The approximate recovery of information from field experiments with large
blocks. Journal of Agricultural Science, 28:418-427
Bartlett, M.S. (1978a) Nearest-neighbour models in the analysis of field experiments (with discussion).
Journal of the Royal Statistical Society (B), 40:147-174
Bartlett, M.S. (1978b) Stochastic Processes. Methods and Applications. Cambridge University Press,
London
Bates, D.M. and Watts, D.G. (1980) Relative curvature measures of nonlinearity. Journal of the Royal
Statistical Society (B), 42:1-25
Bates, D. M., and Watts, D.G. (1981) A relative offset orthogonality convergence criterion for nonlinear
least squares. Technometrics, 123:179-183.
Beale, E.M.L. (1960) Confidence regions in non-linear estimation. Journal of the Royal Statistical
Society (B), 22:41-88
Beaton, A.E. and Tukey, J.W. (1974) The fitting of power series, meaning polynomials, illustrated on
band-spectroscopic data. Technometrics, 16:147-185
Beck, D.E. (1963) Cubic-foot volume tables for yellow poplar in the southern Appalachians. USDA
Forest Service, Research Note SE-16.
Becker, M.P. (1989) Square contingency tables having ordered categories and GLIM. GLIM Newsletter
No. 19. Royal Statistical Society, NAG Group
Becker, M.P. (1990a) Quasisymmetric models for the analysis of square contingency tables. Journal of
the Royal Statistical Society (B), 52:369-378
Becker, M.P. (1990b) Algorithm AS 253; Maximum likelihood estimation of the RC(M) association
model. Applied Statistics, 39:152-167
Beltrami, E. (1998) Mathematics for Dynamic Modeling. 2nd ed. Academic Press, San Diego, CA
Berkson, J. (1950) Are there two regressions? Journal of the American Statistical Association, 45:164-
180
Besag, J.E. (1974) Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal
Statistical Society (B), 36:192-236
Besag, J.E. (1975) Statistical analysis of non-lattice data. The Statistician, 24:179-195.
Besag, J. and Kempton, R. (1986) Statistical analysis of field experiments using neighboring plots.
Biometrics, 42(2):231-251
Biging, G.S. (1985) Improved estimates of site index curves using a varying parameter model. Forest
Science, 31:248-259
Binford, G.D., Blackmer, A.M., and Cerrato, M.E. (1992) Relationship between corn yield and soil
nitrate in late spring. Agronomy Journal, 84:53-59
Birch, J.B. and Agard, D.B. (1993) Robust inference in regression: a comparative study.
Communications in Statistics Simulation, 22(1):217-244
Black, C.A. (1993) Soil Fertility Evaluation and Control. Lewis Publishers, Boca Raton, FL
Blackmer, A.M., Pottker, D., Cerrato, M.E., and Webb, J. (1989) Correlations between soil nitrate
concentrations in late spring and corn yields in Iowa. Journal of Production Agriculture, 2:103-
109
Bleasdale, J.K.A. and Nelder, J.A. (1960) Plant population and crop yield. Nature, 188:342
Bleasdale, J.K.A. and Thompson, B. (1966) The effects of plant density and the pattern of plant
arrangement on the yield of parsnips. Journal of Horticultural Science 41:145-153
Bose, R.C. and Nair, K.R. (1939) Partially balanced incomplete block designs. Sankhya, 4:337-372
Bowman, D.T. (1990) Trend analysis to improve efficiency of agronomic trials in flue-cured tobacco.
Agronomy Journal, 82:499-501
Box, G.E.P. (1954a) Some theorems on quadratic forms applied in the study of analysis of variance
problems, I. Effects of inequality of variance in the one-way classification. Annals of
Mathematical Statistics, 25:290-302
Box, G.E.P. (1954b) Some theorems on quadratic forms applied in the study of analysis of variance
problems, II. Effects of inequality of variance and of correlations between errors in the two-way
classification. Annals of Mathematical Statistics, 25:484-498
Box, G.E.P. and Andersen, S.L. (1955) Permutation theory in the derivation of robust criteria and the
study of departures from assumption. Journal of the Royal Statistical Society (B), 17:1-26
Box, G.E.P. and Cox, D.R. (1964) The analysis of transformations. Journal of the Royal Statistical
Society (B), 26:211-252
Box, G.E.P. Jenkins, G.M., and Reinsel, G.C. (1994) Time Series Analysis: Forecasting and Control.
Prentice Hall, Englewood Cliffs, NJ
Bozdogan, H. (1987) Model selection and Akaike's information criterion (AIC): the general theory and
its analytical extensions. Psychometrika, 52:345-370
Brain, P. and Cousens, R. (1989) An equation to describe dose responses where there is stimulation of
growth at low doses. Weed Research, 29: 93-96
Breslow, N.E. and Clayton, D.G. (1993) Approximate inference in generalized linear mixed models.
Journal of the American Statistical Association, 88:9-25
Brown, M.B. and Forsythe, A.B. (1974) Robust tests for the equality of variances. Journal of the
American Statistical Association, 69:364-367
Brown, R.L., Durbin, J., and Evans, J.M. (1975) Techniques for testing the constancy of regression
relationships over time. Journal of the Royal Statistical Society (B), 37:149-192
Brownie, C., Bowman, D.T., and Burton, J.W. (1993) Estimating spatial variation in analysis of data
from yield trials: a comparison of methods. Agronomy Journal, 85:1244-1253
Brownie, C. and Gumpertz, M.L. (1997) Validity of spatial analysis for large field trials. Journal of
Agricultural, Biological, and Environmental Statistics, 2(1):1-23
Bunke, H. and Bunke, O. (1989) Nonlinear Regression, Functional Relationships and Robust Methods.
John Wiley & Sons, New York
Burkhart, H.E. (1977) Cubic-foot volume of loblolly pine to any merchantable top limit. Southern
Journal of Applied Forestry, 1:7-9
Carroll, R.J. and Ruppert, D. (1984) Power transformations when fitting theoretical models to data.
Journal of the American Statistical Association, 79:321-328
Carroll, R.J. Ruppert, D., and Stefanski, L.A. (1995) Measurement Error in Nonlinear Models.
Chapman and Hall, New York
Cerrato, M.E. and Blackmer, A.M. (1990) Comparison of models for describing corn yield response to
nitrogen fertilizer. Agronomy Journal, 82:138-143
Chapman, D.G. (1961) Statistical problems in population dynamics. In: Proceedings of the Fourth
Berkeley Symposium on Mathematical Statistics and Probability. University of California Press,
Berkeley
Chauvet, P. (1982) The variogram cloud. In: Proceedings of the "(th APCOM International Symposium.
Golden, CO, 757-764
Chilès, J.-P. and Delfiner, P. (1999) Geostatistics. John Wiley & Sons, New York
Cleveland, W.S. (1979) Robust locally weighted regression and smoothing scatterplots. Journal of the
American Statistical Association, 74:829-836
Cleveland, W.S., Devlin, S.J., and Grosse, E. (1988) Regression by local fitting. Journal of
Econometrics, 37:87-114
Cliff, A.D. and Ord, J.K. (1973) Spatial Autocorrelation. Pion, London
Cliff, A.D. and Ord, J.K. (1981) Spatial Processes; Models and Applications, Pion, London
Clutter, J.L., Fortson, J.C., Pienaar, L.V. Brister, G.H., and Bailey, R.L. (1992) Timber Management.
Krieger Publishing, Malabar, FL
Cochran, W.G. (1941) The distribution of the largest of a set of estimated variances as a fraction of their
total. Annals of Eugenics, 11:47-52
Cochran, W.G. (1954) Some methods for strengthening the common ;# tests. Biometrics, 10:417-4517
Cochran, W.G. and Cox, G.M. (1957) Experimental Design 2nd ed. John Wiley & Sons, New York
Cochrane, D. and Orcutt, G.H. (1949) Applications of least square regression to relationships containing
autocorrelated error terms. Journal of the American Statistical Association, 44:32-61
Cole, J.W.L. and Grizzle, J.E. (1966) Applications of multivariate analysis of variance to repeated
measures experiments. Biometrics, 22:810-828
Cole, T.J. (1975) Linear and proportional regression models in the prediction of ventilatory function.
Journal of the Royal Statistics Society (A), 138:297-333
Coleman, D., Holland, P., Kaden, N., Klema, V., and Peters, S. C. (1980) A system of subroutines for
iteratively re-weighted least-squares computations. ACM Transactions on Mathematical
Software, 6:327-336.
Colwell, J.D., Suhet, A.R., and Van Raij, B. (1988) Statistical procedures for developing general soil
fertility models for variable regions. Report No. 93, CSIRO Division of Soils (Australia),
Cook, R.D. (1977) Detection of influential observations in linear regression. Technometrics, 19:15-18
Cook, R.D. and Tsai, C.-L. (1985) Residuals in nonlinear regression. Biometrika, 72:23-29
Corbeil, R.R. and Searle, S.R. (1976) A comparison of variance component estimators, Biometrics,
32:779-791
Courtis, S.A. (1937) What is a growth cycle? Growth, 1:247-254
Cousens, R. (1985) A simple model relating yield loss to weed density. Annals of Applied Biology,
107:239-252
Cox, C. (1988) Multinomial regression models based on continuation ratios. Statistics in Medicine,
7:435-441.
Cox, D.R. and Snell, E.J. (1989) The Analysis of Binary Data, 2nd ed. Chapman and Hall, London
Craig, J.R., Edwards, D., Rimstidt, J.D., Scanlon, P.F., Collins, T.K., Schabenberger, O., and Birch, J.B.
(2002) Lead distribution on a public shotgun range. Environmental Geology, 41:873-882
Craven, P. and Wahba, G. (1979) Smoothing noisy data with spline functions. Numerical Mathematics,
31:377-403
Cressie, N. (1985) Fitting variogram models by weighted least squares. Journal of the International
Association for Mathematical Geology, 17:563-586
Cressie, N.A.C. (1986) Kriging nonstationary data. Journal of the American Statistical Association,
81:625-634
Cressie, N.A.C. (1993) Statistics for Spatial Data. Revised Ed. John Wiley & Sons, New York
Cressie, N.A.C. and Hawkins, D.M. (1980) Robust estimation of the variogram, I. Journal of the
International Association for Mathematical Geology, 12:115-125
Crowder, M.J. and Hand, D.J. (1990) Analysis of Repeated Measures. Chapman and Hall, New York
Curriero, F.C. and Lele, S. (1999) A composite likelihood approach to semivariogram estimation.
Journal of Agricultural, Biological, and Environmental Statistics, 4(1):9-28
Davidian, M. and Giltinan, D.M. (1993) Some general estimation methods for nonlinear mixed-effects
models. Journal of Biopharmaceutical Statistics, 3(1):23-55
Davidian, M. and Giltinan, D.M. (1995) Nonlinear Models for Repeated Measurement Data. Chapman
and Hall, New York
Delfiner, P. (1976) Linear estimation of nonstationary spatial phenomena. In: Advanced Geostatistics in
the Mining Industry (M. Guarascio, M. David, C. Huijbregts, eds.) Reidel, Dortrecht, The
Netherlands, pp. 49-68
Delfiner, P., Renard D., and Chilès, J.P. (1978) Bluepack-3D Manual, Centre de Geostatistique,
Fontainebleau, France
Diggle, P. (1983) Statistical Analysis of Spatial Point Patterns. Academic Press, London
Diggle, P.J. (1988) An approach to the analysis of repeated measurements. Biometrics, 44:959-971
Diggle, P.J. (1990) Time Series: A Biostatistical Introduction. Clarendon Press, Oxford, UK
Diggle, P., Besag, J.E. and Gleaves, J.T. (1976) Statistical analysis of spatial patterns by means of
distance methods. Biometrics, 32:659-667
Diggle, P.J., Liang, K.-Y., and Zeger, S.L. (1994) Analysis of Longitudinal Data. Clarendon Press,
Oxford, UK
Draper, N.R. and Smith, H. (1981) Applied Regression Analysis. 2nd ed. John Wiley & Sons, New York
Dunkl, C.F. and Ramirez, D.E. (2001) Computation of the generalized F distribution. The Australian
and New Zealand Journal of Statistics, 43:21-31
Durbin, J. and Watson, G.S. (1950) Testing for serial correlation in least squares regression. I.
Biometrika, 37:409-428
Durbin, J. and Watson, G.S. (1951) Testing for serial correlation in least squares regression. II.
Biometrika, 38:159-178
Durbin, J. and Watson, G.S. (1971) Testing for serial correlation in least squares regression. III.
Biometrika, 58:1-19
Eisenhart, C. (1947) The assumptions underlying the analysis of variance. Biometrics, 3:1-21
Engel, J. (1988) Polytomous logistic regression. Statistica Neerlandica, 42(4):233-252.
Emerson, J.D. and Hoaglin, D.C. (1983) Analysis of two-way tables by medians. In: Understanding
Robust and Exploratory Data Analysis (Hoaglin D.C., Mosteller, F., and Tukey, J.W., eds.), John
Wiley & Sons, New York, pp. 166-207
Emerson, J.D. and Wong, G.Y. (1985) Resistant nonadditive fits for two-way tables. In: Exploring
Data Tables, Trends, and Shapes (Hoaglin, D.C., Mosteller, F., and Tukey, J.W., eds.), John
Wiley & Sons, New York, pp. 67-124
Engelstad, O.P. and Parks, W.L. (1971) Variability in optimum N rates for corn. Agronomy Journal,
63:21-23
Epanechnikov, V. (1969) Nonparametric estimates of a multivariate probability density. Theory of
Probability and its Applications, 14:153-158
Eubank, R.L. (1988) Spline Smoothing and Nonparametric Regression. Marcel Dekker, New York
Fahrmeir, L. and Tutz, G. (1994) Multivariate Statistical Modelling Based on Generalized Linear
Models. Springer-Verlag, New York
Federer, W.T. and Schlottfeldt, C.S. (1954) The use of covariance to control gradients in experiments.
Biometrics, 10:282-290
Fedorov, V.V. (1974) Regression problems with controllable variables subject to error. Biometrika,
61:49-56
Fieller, E.C. (1940) The biological standardization of insulin. Journal of the Royal Statistical Society
(Suppl.), 7:1-64
Fienberg, S.E. (1980) The Analysis of Cross-classified Categorical Data. MIT Press, Cambridge, MA
Finney, D.J. (1978) Statistical Methods in Biological Assay, 3rd ed. Macmillan, New York
Firth, D. (1988) Multiplicative errors: log-normal or gamma. Journal of the Royal Statistical Society
(B), 50:266-268
Fisher, R.A. (1935) The Design of Experiments. Oliver and Boyd, Edinburgh
Fisher, R.A. (1947) The Design of Experiments, 4th ed. Oliver and Boyd, Edinburgh
Folks, J.L. and Chhikara, R.S. (1978) The inverse Gaussian distribution and its statistical application: a
review. Journal of the Royal Statistical Society (B), 40:263-275
Freney, J.R. (1965) Increased growth and uptake of nutrients by corn plants treated with low levels of
simazine. Australian Journal of Agricultural Research, 16:257-263
Gabriel, K.R. (1962) Ante-dependence analysis of an ordered set of variables. Annals of Mathematical
Statistics, 33:201-212
Gallant, A.R. (1975) Nonlinear regression. The American Statistician, 29:73-81
Gallant, A.R. (1987) Nonlinear Statistical Models. John Wiley & Sons, New York
Gallant, A.R. and Fuller, W.A. (1973) Fitting segmented poynomial regression models whose join
points have to be estimated. Journal of the American Statistical Association, 68:144-147
Galpin, J.S.and Hawkins, D.M. (1984) The use of recursive residuals in checking model fit in linear
regression. The American Statistician, 38(2):94-105
Galton, F. (1886) Regression towards mediocrity in hereditary stature. Journal of the Anthropological
Institute, 15:246-263
Gayen, A.K. (1950) The distribution of the variance ratio in random samples of any size drawn from
non-normal universes. Biometrika, 37:236-255
Geary, R.C. (1947) Testing for normality. Biometrika, 34:209-242
Geary, R.C. (1954) The contiguity ratio and statistical mapping. The Incorporated Statistician, 5:115-
145
Geisser, S. and Greenhouse, S.W. (1958) An extension of Box's results on the use of the F-distribution
in multivariate analysis. Annals of Mathematical Statistics, 29:885-891
Gerrard, D.J. (1969) Competition quotient: a new measure of the competition affecting individual forest
trees. Research Bulletin No. 20, Michigan Agricultural Experiment Station, Michigan State
University
Gillis, P.R. and Ratkowsky, D.A. (1978) The behaviour of estimators of the parameters of various yield-
density relationships. Biometrics, 34:191-198
Gilmour, A.R., Cullis, B.R., and Verbyla, A.P. (1997) Accounting for natural and extraneous variation
in the analysis of field experiments. Journal of Agricultural, Biological, and Environmental
Statistics, 2(3):269-293
Godambe, V.P. (1960) An optimum property of regular maximum likelihood estimation. Annals of
Mathematical Statistics, 31:1208-1211
Golden, M.S., Knowe, S.A., and Tuttle, C.L. (1982) Cubic-foot volume for yellow-poplar in the hilly
coastal plain of Alabama. Southern Journal of Applied Forestry, 6:167-171
Goldberg, R.R. (1961) Fourier Transforms. Cambridge University Press, Cambridge
Goldberger, A.S. (1962) Best linear unbiased prediction in the generalized linear regression model,
Journal of the American Statistical Association, 57:369-375
Gompertz, B. (1825) On the nature of the function expressive of the law of human mortality, and on a
new method of determining the value of life contingencies. Phil. Trans. Roy. Soc., 513-585
Goodman, L.A. (1979a) Simple models for the analysis of association in cross-classifications having
ordered categories. Journal of the American Statistical Association, 74:537-552
Goodman, L.A. (1979b) Multiplicative models for square contingency tables with ordered categories.
Biometrika, 66:413-418
Goodman, L.A. (1985) The analysis of cross-classified data having ordered and/or unordered categories:
association models, correlation models, and asymmetry models for contingency tables with or
without missing entries. Annals of Statistics, 13:10-69
Goovaerts, P. (1997) Geostatistics for Natural Resources Evaluation. Oxford University Press, New
York
Goovaerts, P. (1998) Ordinary cokriging revisited. Journal of the International Association of
Mathematical Geology, 30:21-42
Gotway, C.A. and Stroup, W.W. (1997) A generalized linear model approach to spatial data analysis
and prediction. Journal of Agricultural, Biological, and Environmental Statistics, 2(2):157-178.
Graybill, F.A. (1969) Matrices with Applications in Statistics. 2nd ed. Wadsworth International,
Belmont, CA.
Greenhouse, S.W. and Geisser, S. (1959) On methods in the analysis of profile data. Psychometrika,
32:95-112
Greenwood, C. and Farewell, V. (1988) A comparison of regression models for ordinal data in an
analysis of transplanted-kidney function. Canadian Journal of Statistics, 16(4):325-335.
Gregoire, T.G. (1985) Generalized error structure for yield models fitted with permanent plot data.
Ph.D. dissertation, Yale University, New Haven, CT
Gregoire, T.G. (1987) Generalized error structure for forestry yield models. Forest Science, 33:423-444
Gregoire, T.G., Brillinger, D.R., Diggle, P.J., Russek-Cohen, E., Warren, W.G., and Wolfinger, R.D.
(eds). (1997) Modelling Longitudinal and Spatially Correlated Data. Springer-Verlag, New
York, 402 pp.
Gregoire, T.G., Schabenberger, O., and Barrett, J.P. (1995) Linear modelling of irregularly spaced,
unbalanced, longitudinal data from permanent plot measurements. Canadian Journal of Forest
Research, 25(1):137-156
Gregoire, T.G. and Schabenberger, O. (1996a) Nonlinear mixed-effects modeling of cumulative bole
volume with spatially correlated within-tree data. Journal of Agricultural, Biological, and
Environmental Statistics, 1(1):107-119
Gregoire, T.G. and Schabenberger, O. (1996b) A non-linear mixed-effects model to predict cumulative
bole volume of standing trees. Journal of Applied Statistics, 23a2&3b:257-271
Griffith, D.A. (1996) Some guidelines for specifying the geographic weights matrix contained in Spatial
statistical models. In: Practical Handbook of Spatial Statistics (S.L. Arlinghaus, ed.), CRC Press,
Boca Raton, FL, pp. 65-82
Griffith, D.A. and Layne, L.J. (1999) A Casebook for Spatial Statistical Data Analysis. A Compilation
of Analyses of Different Thematic Data Sets. Oxford University Press, New York
Grondona, M.O. and Cressie, N.A. (1991) Using spatial considerations in the analysis of experiments.
Technometrics, 33:381-392
Härdle, W. (1990) Applied Nonparametric Regression. Cambridge University Press, Cambridge
Haining, R. (1990) Spatial Data Analysis in the Social and Environmental Sciences. Cambridge
University Press, Cambridge
Hampel, F.R. (1974) The influence curve and its role in robust estimation. Journal of the American
Statistical Association, 69:383-393
Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., and Stahel, W.A. (1986) Robust Statistics, The
Approach Based on Influence Functions. John Wiley & Sons, New York
Hanks, R.J., Sisson, D.V., Hurst, R.L., and Hubbard, K.G. (1980) Statistical analysis of results from
irrigation experiments using the line source sprinkler system. Journal of the American Soil
Science Society, 44:886-888
Harris, T.R. and Johnson, D.E. (1996) A regression model with spatially correlated errors for comparing
remote sensing and in-situ measurements of a grassland site. Journal of Agricultural, Biological,
and Environmental Statistics, 1:190-204
Hart, L.P. and Schabenberger, O. (1998) Variability of vomitoxin in a wheat scab epidemic. Plant
Disease, 82:625-630.
Hartley, H.O. (1950) The maximum J -ratio as a short-cut test for heterogeneity of variance.
Biometrika, 31:249-255
Hartley, H.O. (1961) The modified Gauss-Newton method for the fitting of nonlinear regression
functions by least squares. Technometrics, 3:269-280
Hartley, H.O. (1964) Exact confidence regions for the parameters in nonlinear regression laws.
Biometrika, 51:347-353
Hartley, H.O. and Booker, A. (1965) Nonlinear least square estimation. Annals of Mathematical
Statistics, 36(2):638-650
Harville, D.A. (1974) Bayesian inference for variance components using only error contrasts.
Biometrika, 61:383-385
Harville, D.A. (1976a) Extension of the Gauss-Markov theorem to include the estimation of random
effects. The Annals of Statistics, 4:384-395
Harville, D.A. (1976b) Confidence intervals and sets for linear combinations of fixed and random
effects. Biometrics, 32:320-395
Harville, D.A. (1977) Maximum-likelihood approaches to variance component estimation and to related
problems. Journal of the American Statistical Association, 72:320-340
Harville, D.A. and Jeske, D.R. (1992) Mean squared error of estimation or prediction under a general
linear model. Journal of the American Statistical Association, 87:724-731
Hastie, T.J. and Tibshirani, R.J. (1990) Generalized Additive Models. Chapman and Hall, New York
Haseman, J.K. and Kupper, L.L. (1979) Analysis of dichotomous response data from certain toxicolo-
gical experiments. Biometrics, 35:281-293
Hayes, W.L. (1973) Statistics for the Social Sciences. Holt, Rinehart and Winston, New York
Heagerty, P.J. and Lele, S.R. (1998) A composite likelihood approach to binary spatial data. Journal of
the American Statistical Association, 93:1099-1111
Healy, M.J.R. (1986) Matrices for Statistics. Clarendon Press, Oxford, UK
Hearn, A.B. (1972) Cotton spacing experiments in Uganda. Journal of Agricultural Science, 48:19-28
Hedeker, D. and Gibbons, R.D. (1994) A random effects ordinal regression model for multilevel analy-
sis. Biometrics, 50:933-944
Henderson, C.R. (1950) The estimation of genetic parameters. The Annals of Mathematical Statistics,
21:309-310
Henderson, C.R. (1963) Selection index and expected genetic advance. In: Statistical Genetics and
Plant Breeding (NRC Publication 982), Washington, D.C. National Academy of Sciences, pp.
141-163
Henderson, C.R. (1973) Sire evaluation and genetic trends. In: Proceedings of the Animal Breeding and
Genetics Symposium in Honor of Dr. J.L. Lush, Champaign, IL: ASAS and ADSA, pp. 10-41
Heyde, C.C. (1997) Quasi-likelihood and Its Application: A General Approach to Optimal Parameter
Estimation. Springer-Verlag, New York
Himmelblau, D.M. (1972) A uniform evaluation of unconstrained optimization techniques. In:
Numerical Methods for Nonlinear Optimization (F.A. Lootsma, ed.), Academic Press, London
Hinkelmann, K. and Kempthorne, O. (1994) Design and Analysis of Experiments. Volume I.
Introduction to Experimental Design. John Wiley & Sons, New York
Hoerl, A.E. and Kennard, R.W. (1970a) Ridge regression: biased estimation for nonorthogonal
problems. Technometrics, 12:55-67
Hoerl, A.E. and Kennard, R.W. (1970b) Ridge regression: applications to nonorthogonal problems.
Technometrics, 12:69-82
Holland, P.W. and Welsch, R.E. (1977) Robust regression using iteratively reweighted least squares.
Communications in Statistics A, 6:813-888
Holliday, R. (1960) Plant population and crop yield: Part I. Field Crop Abstracts, 13:159-167
Hoshmand, A.R. (1994) Experimental Research Design and Analysis. CRC Press, Boca Raton, FL
Hsiao, A.I., Liu, S.H. and Quick, W.A. (1996) Effect of ammonium sulfate on the phytotoxicity, foliar
uptake, and translocation of imazamethabenz in wild oat. Journal of Plant Growth Regulation,
15:115-120
Huber, O. (1981) Robust Statistics. John Wiley & Sons, New York
Huber, P.J. (1964) Robust estimation of a location parameter. Annals of Mathematical Statistics, 35:
73-101
Huber, P.J. (1973) Robust regression: asymptotics, conjectures, and Monte Carlo. Annals of Statistics,
1:799-821
Hurvich, C.M. and Simonoff, J.S. (1998) Smoothing parameter selection in nonparametric regression
using an improved Akaike information criterion. Journal of the Royal Statistical Society (B),
60:271-293
Huxley, J.S. (1932) Problems of Relative Growth. Dial Press, New York
Huynh, H. and Feldt, L.S. (1970) Conditions under which mean square ratios in repeated measurements
designs have exact F-distributions. Journal of the American Statistical Association, 65:1582-1589
Huynh, H. and Feldt, L.S. (1976) Estimation of the Box correction for degrees of freedom from sample
data in the randomized block and split plot designs. Journal of Educational Statistics, 1:69-82
Isaaks, E. and Srivastava, R. (1989) An Introduction to Applied Geostatistics. Oxford University Press,
New York
Jansen, J. (1990) On the statistical analysis of ordinal data when extravariation is present. Applied
Statistics, 39:75-84
Jennrich, R.J. and Schluchter, M.D. (1986) Unbalanced repeated-measures models with structured
covariance matrices. Biometrics, 42:805-820
Jensen, D.R. and Ramirez, D.E. (1998) Some exact properties of Cook's HM . In: Handbook of Statistics,
Vol. 16 (Balakrishnan, N. and Rao, C.R. eds)., pp. 387-402 Elsevier Science Publishers,
Amsterdam
Jensen, D.R. and Ramirez, D.E. (1999) Recovered errors and normal diagnostics in regression. Metrica,
49:107-119
Johnson, N.L., Kotz, S., and Kemp, A.W. (1992) Univariate Discrete Distributions, 2nd. ed., John
Wiley & Sons, New York
Johnson, N.L., Kotz, S. and Balakrishnan, N. (1995) Univariate Continuous Distributions, Vol. 2, 2nd
ed. Wiley and Sons, New York
Jones, R.H. (1993) Longitudinal Data with Serial Correlation: A State-space Approach. Chapman and
Hall, New York
Jones, R.H. and Boadi-Boateng, F. (1991) Unequally spaced longitudinal data with ARa"b serial corre-
lation. Biometrics, 47:161-176
Journel, A.G. and Huijbregts, C.J. (1978) Mining Geostatistics. Academic Press, London
Kackar, R.N. and Harville, D.A. (1984) Approximations for standard errors of fixed and random effects
in mixed linear models. Journal of the American Statistical Association, 79:853-862
Kaluzny, S.P., Vega, S.C., Cardoso, T.P., and Shelly, A.A. (1998) S+ SpatialStats. User's Manual for
Windows® and Unix. Springer-Verlag, New York
Kempthorne, O. (1952) Design and Analysis of Experiments. John Wiley & Sons, New York
Kempthorne, O. (1955) The randomization theory of experimental inference. Journal of the American
Statistical Association, 50:946-967
Kempthorne, O. (1975) Fixed and mixed model analysis of variance. Biometrics, 31:473-486
Kempthorne, O. and Doerfler, T.E. (1969) The behaviour of some significance tests under randomiza-
tion. Biometrika, 56:231-248
Kendall, M.G. and Stuart, A. (1961) The Advanced Theory of Statistics, Vol 2. Griffin, London
Kenward, M.G. (1987) A method for comparing profiles of repeated measurements. Applied Statistics,
36:296-308
Kenward, M.G. and Roger, J.H. (1997) Small sample inference for fixed effects from restricted
maximum likelihood. Biometrics, 53:983-997
Kianifard, F. and Swallow, W. H. (1996) A review of the development and application of recursive
residuals in linear models. Journal of the American Statistical Association, 91:391-400
Kirby, E.J.M. (1974) Ear development in spring wheat. Journal of the Agricultural Society, 82:437-447
Kirk, H.J., Haynes, F.L., and Monroe, R.J. (1980) Application of trend analysis to horticultural field
trials. Journal of the American Society of Horticultural Science, 105:189-193
Kirk, R.E. (1995) Experimental Design: Procedures for the Behavioral Sciences, 3rd ed., Duxbury
Press, Belmont, CA
Kitanidis, P.K. (1983) Statistical estimation of polynomial generalized covariance functions and hydro-
logical applications. Water Resources Research, 19:909-921
Kitanidis, P.K. and Lane, R.W. (1985) Maximum likelihood parameter estimation of hydrological spa-
tial processes by the Gauss-Newton method. Journal of Hydrology, 79:53-71
Kitanidis, P.K. and Vomvoris, E.G. (1983) A geostatistical approach to the inverse problem in ground-
water modeling (steady state) and one-dimensional simulations. Water Resources Research,
19:677-690
Knoebel, B.R., Burkhart, H.E., and Beck, D.E. (1984) Stem volume and taper functions for yellow-
poplar in the southern Appalachians. Southern Journal of Applied Forestry, 8:185-188
Korn, E.L. and Whittemore, A.S. (1979) Methods for analyzing panel studies of acute health effects of
air pollution. Biometrics, 35:795-802
Kvålseth, T.O. (1985) Cautionary note about R# . The American Statistician, 39(4):279-285
Läärä, E. and Matthews, J. N. S. (1985) The equivalence of two models for ordinal data. Biometrika,
72:206-207.
Lærke, P.E. and Streibig, J.C. (1995) Foliar absorption of some glyphosate formulations and their
efficacy on plants. Pesticide Science, 44:107-116
Laird, A.K. (1965) Dynamics of relative growth. Growth, 29:249-263
Laird, N.M. (1988) Missing data in longitudinal studies. Statistics in Medicine, 7:305-315
Laird, N.M. and Louis, T.A. (1982) Approximate posterior distributions for incomplete data problems.
Journal of the Royal Statistical Society (B), 44:190-200
Laird, N.M. and Ware, J.H. (1982) Random-effects models for longitudinal data. Biometrics, 38:963-
974
Lee, K.R. and Kapadia, C.H. (1984) Variance component estimators for the balanced two-way mixed
model. Biometrics, 40:507-512
Lele, S. (1997) Estimating functions for semivariogram estimation. In: Selected Proceedings of the
Symposium on Estimating Functions (I.V. Basawa, V.P. Godambe, and R.L. Taylor, eds.),
Hayward, CA: Institute of Mathematical Statistics, pp. 381-396.
Lerman, P.M. (1980) Fitting segmented regression models by grid search. Applied Statistics, 29:77-84
Levenberg, K. (1944) A method for the solution of certain problems in least squares. Quarterly Journal
of Applied Mathematics, 2:164-168
Levene, H. (1960) Robust test for equality of variances. In Contributions to Probability and Statistics,
I. Olkin (ed.). pp. 278-292. Stanford University Press, Stanford, CA
Lewis, P.A.W. and Shedler, G.S. (1979) Simulation of non-homogeneous Poisson processes by
thinning. Naval Research Logistics Quarterly, 26:403-413
Liang, K.-Y. and Zeger, S.L. (1986) Longitudinal data analysis using generalized linear models.
Biometrika, 73:13-22
Liang, K.-Y., Zeger, S.L., and Qaqish, B. (1992) Multivariate regression analysis for categorical data.
Journal of the Royal Statistical Society (B), 54:3-40
Lindsay, B.G. (1988), Composite likelihood methods. Contemporary Mathematics, 80:221-239
Lindstrom, M.J. and Bates, D.M. (1988) Newton-Raphson and EM algorithms for linear mixed-effects
models for repeated measures data. Journal of the American Statistical Society, 83:1014-1022
Lindstrom, M.J. and Bates, D.M. (1990) Nonlinear mixed effects models for repeated measures data.
Biometrics, 46:673-687
Littell, R.C., Milliken, G.A., Stroup, W.W., and Wolfinger, R.D. (1996). SAS ® System for Mixed
Models. SAS Institute Inc., Cary, NC
Little, R.J. and Rubin, D.B. (1987) Statistical Analysis with Missing Data. John Wiley & Sons, New
York
Longford, N.T. (1993) Random Coefficient Models. Clarendon Press, Oxford, UK
Lumer, H. (1937) The consequences of sigmoid growth for relative growth functions. Growth, 1:140-
154
Machiavelli, R.E. and Arnold, S.F. (1994) Variable order antedependence models. Communications in
Statistics - Theory and Methods, 23:2683-2699
Maddala, G.S. (1983) Limited-Dependent and Qualitative Variables in Econometrics. Cambridge
University Press, Cambridge, MA
Magee, L. (1990) V # measures based on Wald and likelihood ratio joint significance tests. The Ameri-
can Statistician, 44:250-253
Magnus, J.R. (1988) Matrix Differential Calculus with Applications in Statistics and Econometrics.
John Wiley & Sons, New York
Mallows, C.L. (1973) Some comments on C: Þ Technometrics, 15:661-675
Marquardt, D.W. (1963) An algorithm for least squares estimation of nonlinear parameters, Journal of
the Society for Industrial and Applied Mathematics, 2:431-441
Matheron, G. (1962) Traite de Geostatistique Appliquee, Tome I. Memoires du Bureau de Recherches
Geologiques et Minieres, No. 14. Editions Technip, Paris
Matheron, G. (1963) Principles of geostatistics. Economic Geology, 58:1246-1266
Matheron, G. (1971) The theory of regionalized variables and its applications. Cahiers du Centre de
Morphologie Mathematique, No. 5. Fontainebleau, France
Mays, J., Birch, J.B., and Starnes, B. (2001) Model robust regression: combining parametric, nonpara-
metric, and semiparametric methods. Journal of Nonparametric Statistics, 13:245-277
McCullagh, P. (1980) Regression models for ordinal data. Journal of the Royal Statistical Society (B),
42:109-142.
McCullagh, P. (1983) Quasi-likelihood functions. The Annals of Statistics, 11:59-67
McCullagh, P. (1984) On the elimination of nuisance parameters in the proportional odds model. Jour-
nal of the Royal Statistical Society (B), 46:250-256.
McCullagh, P. and Nelder Frs, J.A. (1989) Generalized Linear Models. 2nd ed. Chapman and Hall, New
York
McKean, J.W. and Schrader, R.M. (1987) Least absolute errors analysis of variance. In: Statistical Data
Analysis Based on the L" -Norm and Related Methods (Dodge, Y., ed.), North-Holland, New York
McLean, R.A., Sanders, W.L., and Stroup, W.W. (1991) A unified approach to mixed linear models.
The American Statistician, 45:54-64
McPherson, G. (1990) Statistics in Scientific Investigation. Springer-Verlag, New York
McShane, L.M., Albert, P.S., and Palmatier, M.A. (1997) A latent process regression model for spatially
correlated count data. Biometrics, 53:698-706
Mead, R. (1967) A mathematical model for the estimation of inter-plant competition. Biometrics,
23:189-205
Mead, R. (1970) Plant density and crop yield. Applied Statistics, 19:64-81
Mead, R. (1979) Competition experiments. Biometrics, 35:41-54
Mead, R. Curnow, R.N. and Hasted, A.M. (1993) Statistical Methods in Agriculture and Experimental
Biology, 2nd ed. Chapman and Hall/CRC Press LLC, New York and Boca Raton, FL
Mercer, W.B. and Hall, A.D. (1911) The experimental error of field trials. Journal of Agricultural
Science, 4:107-132
Miller, M.D., Mikkelsen, D.S., and Huffaker, R.C. (1962) Effects of stimulatory and inhibitory levels of
2,4-D, iron, and chelate supplements on juvenile growth of field beans. Crop Science, 2:111-114
Milliken, G.A. and Johnson, D.E. (1992) Analysis of Messy Data. Volume 1: Designed Experiments.
Chapman and Hall, New York
Minot, C.S. (1908) The Problem of Age, Growth and Death: A Study of Cytomorphosis. Knickerbocker
Press, New York
Mitscherlich, E.A. (1909) Das Gesetz des Minimums und das Gesetz des Abnehmenden Bodenertrags.
Zeitschrift für Pflanzenernährung, Düngung und Bodenkunde, 12:273-282
Moore, E.H. (1920) On the reciprocal of the general algebraic matrix. Bulletin of the American Mathe-
matical Society, 26:394-395
Moran, P.A.P. (1948) The interpretation of statistical maps. Journal of the Royal Statistical Society (B),
10:243-251
Moran, P.A.P. (1950) Notes on continuous stochastic phenomena. Biometrika, 37:17-23
Moran, P.A.P. (1971) Estimating structural and functional relationships. Journal of Multivariate
Analysis, 1:232-255
Morgan, P.H., Mercer, L.P., and Flodin, N.W. (1975) General model for nutritional responses of higher
organisms. Proceedings of the National Academy of Science, USA, 72:4327-4331
Morris, G.L. and Odell, P.L. (1968) A characterization for generalized inverses of matrices. SIAM
Review, 10(2):208-211
Mueller, T.G. (1998) Accuracy of soil property maps for site-specific management. Ph.D. dissertation,
Michigan State University, East Lansing, MI (Diss Abstr. 99-22353, Diss Abstr. Int. 60B:0901)
Mueller, T.G., Pierce, F.J., Schabenberger, O., and Warncke, D.D. (2001) Map quality for site-specific
fertility management. Journal of the Soil Science Society of America, 65: 1547-1558
Myers, R.H. (1990) Classical and Modern Regression with Applications, 2nd ed. Duxbury Press,
Boston
Nadaraya, E.A. (1964). On estimating regression. Theory of Probability and its Applications, 10:186-
190
Nagelkerke, N.J.D. (1991) A note on a general definition of the coefficient of determination.
Biometrika, 78:691-692
Nelder, J.A. and Wedderburn, R.W.M. (1972) Generalized linear models. Journal of the Royal
Statistical Society (A), 135:370-384
Neter, J., Wasserman, W., and Kutner, M.H. (1990) Applied Linear Statistical Models. 3rd ed., Irwin,
Boston, MA
Neuman, S.P. and Jacobson, E.A. (1984) Analysis of nonintrinsic spatial variability by residual kriging
with applications to regional groundwater levels. Journal of the International Association of
Mathematical Geology, 16:499-521
Newberry, J.D. and Burk, T.E. (1985) SF distribution-based models for individual tree merchantable
volume-total volume ratios, Forest Science, 31:389-398
Neyman, J. and Scott, E.L. (1972) Processes of clustering and applications. In: Stochastic Point
Processes (P.A.W. Lewis, ed.). Wiley and Sons, New York, pp. 646-681
Nichols, M.A. (1974a) Effect of sowing rate and fertilizer application on the yield of dwarf beans. New
Zealand Journal of Experimental Agriculture, 2:155-158
Nichols, M.A. (1974b) A plant spacing study with sweet corn. New Zealand Journal of Experimental
Agriculture, 2:377-379
Nichols, M.A. and Nonnecke, I.L. (1974) Plant spacing studies with processing peas in Ontario, Canada.
Scientia Horticulturae 2:112-122
Nichols, M.A., Nonnecke, I.L., and Pathak, S.C. (1973) Plant density studies with direct seeded
tomatoes in Ontario, Canada. Scientiae Horticulturae, 1:309-320
Olkin, I., Gleser, L.J., and Derman, C. (1978) Probability Models and Applications. Macmillan
Publishing, New York
Ord, J.K. (1975) Estimation methods for models of spatial interaction. Journal of the American Statisti-
cal Association, 70:120-126
Papadakis, J.S. (1937) Méthode statistique pour des expériences sur champ. Bull. Inst. Amelior. Plant.
Thessalonique, 23
Patterson, H.D. and Thompson, R. (1971) Recovery of inter-block information when block sizes are un-
equal. Biometrika, 58:545-554
Pázman, A. (1993) Nonlinear Statistical Models. Kluwer Academic Publishers, London
Pearl, R. and Reed, L.J. (1924) The probable error of certain constraints of the population growth curve.
American Journal of Hygiene, 4(3):237-240
Pearson, E.S. (1931) The analysis of variance in case of non-normal variation. Biometrika, 23:114-133
Penrose, R.A. (1955) A gerneralized inverse for matrices. Proceedings of the Cambridge Philosophical
Society, 51:406-413
Petersen, R.G. (1994) Agricultural Field Experiments. Design and Analysis. Marcel Dekker, New York.
Pierce, F.J., Fortin, M.-C., and Staton, M.J. (1994) Periodic plowing effects on soil properties in a no-till
farming system. Journal of the American Soil Science Society, 58:1782-1787
Pierce, F.J. and Warncke, D.D. (2000) Soil and crop response to variable-rate liming in two Michigan
fields. Journal of the Soil Science Society of America, 64:774-780
Pinheiro, J.C. and Bates, D.M. (1995) Approximations to the log-likelihood function in the nonlinear
mixed-effects model. Journal of Computational and Graphical Statistics, 4:12-35.
Potthoff, R.F. and Roy, S.N. (1964) A generalized mutivariate analysis of variance model useful
especially for growth curve problems. Biometrika, 51:313-326
Prasad, N.G.N. and Rao, J.N.K. (1990) The estimation of the mean squared error of small-area
estimators. Journal of the American Statistical Association, 85:163-171
Prentice, R.L. (1988) Correlated binary regression with covariates specific to each binary observation.
Biometrics, 44:1044-1048
Press, W.H, Teukolsky, S.A., Vetterling, W.T., and Flannery, B.P. (1992) Numerical Recipes. The Art
of Scientific Computing. 2nd ed. Cambridge University Press, New York
Priebe, D.L. and Blackmer, A.M. (1989) Preferential movement of oxygen-18-labeled water and nitro-
gen-15-labeled urea through macropores in a Nicollet soil. Journal of Environmental Quality,
18:66-72
Quiring, D.P. (1941) The scale of being according to the power formula. Growth, 2:335-346
Radosevich, S.R. and Holt, J.S. (1984) Weed Ecology. John Wiley & Sons, New York
Ralston, M.L. and Jennrich, R.I. (1978) DUD, a derivative-free algorithm for nonlinear least squares.
Technometrics, 20:7-14
Rao, C.RÞ (1965) The theory of least squares when the parameters are stochastic and its application to
the analysis of growth curves. Biometrika, 58:545-554
Rao, C.R. and Mitra, S.K. (1971) Generalized Inverse of Matrices and its Applications. John Wiley &
Sons, New York
Rasse, D.P., Smucker, A.J.M., and Schabenberger, O. (1999) Modifications of soil nitrogen pools in res-
ponse to alfalfa root systems and shoot mulch. Agronomy Journal, 91:471-477
Ratkowsky, D.A. (1983) Nonlinear Regression Modeling. Marcel Dekker, New York
Ratkowsky, D.A. (1990) Handbook of Nonlinear Regression Models. Marcel Dekker, New York
Reed, R.R. (2000) Factors influencing biotite weathering. M.S. Thesis, Department of Crop and Soil
Environmental Sciences, Virginia Polytechnic Institute and State University (Available at
http://scholar.lib.vt.edu/theses)
Rennolls, K. (1993) Forest height growth modeling. In: Proceedings from the IUFRO Conference,
Copenhagen, June 14-17, 1993. Forskningsserien Nr. 3, 231-238
Richards, F.J. (1959) A flexible growth function for empirical use. Journal of Experimental Botany,
10:290-300
Rigas, A.G. (1991) Spectral analysis of stationary point processes using the fast Fourier transform
algorithm. Journal of Time Series Analysis. 13:441-450
Ripley, B.D. (1976) The second-order analysis of stationary point processes. Journal of Applied
Probability, 13:255-266
Ripley, B.D. (1977) Modeling spatial patterns. Journal of the Royal Statistical Society (B), 39:172-192
(with discussion, 192-212)
Ripley, B.D. (1981) Spatial Statistics. John Wiley & Sons, New York
Ripley, B.D. (1988) Statistical Inference for Spatial Processes. Cambridge University Press, Cambridge
Ripley, B.D. and Silverman, B.W. (1978) Quick tests for spatial interaction. Biometrika, 65:641-642
Roberts, H.A., Chancellor, R.J., and Hill, T.A. (1982) The biology of weeds. In: Weed Control
Handbook: Principles, 7th ed. (H.A. Roberts, ed.). Blackwell Scientific, Oxford, pp. 1-36
Robertson, T.B. (1923) The Chemical Basis of Growth and Senescence. J.P. Lippincott Co., Phila-
delphia and London
Robinson, G.K. (1991) That BLUP is a good thing: the estimation of random effects. Statistical Science,
6(1):15-51
Rohde, C.A. (1966) Some results on generalized inverses. SIAM Review, 8(2):201-205
Rubin, D.R. (1976) Inference and missing data. Biometrika, 63:581-592
Rubinstein, R.Y. (1981) Simulation and the Monte Carlo Method. John Wiley & Sons, New York
Russo, D. (1984) Design of an optimal sampling network for estimating the variogram. Journal of the
Soil Science Society of America, 48:708-716
Russo, D. and Bresler, E. (1981) Soil hydraulic properties as stochastic processes, 1. An analysis of
field spatial variability. Journal of the Soil Science Society of America, 45:682-687
Russo, D. and Jury, W.A. (1987a) A theoretical study of the estimation of the correlation scale in
spatially variable fields. 1. Stationary fields. Water Resources Research, 7:1257-1268
Russo, D. and Jury, W.A. (1987b) A theoretical study of the estimation of the correlation scale in
spatially variable fields. 2. Nonstationary fields. Water Resources Research, 7:1269-1279
Sahai, H. and Ageel, M.I. (2000) The Analysis of Variance. Fixed, Random and Mixed Models. Birk-
häuser, Boston
Sandland, R.L. (1983) Mathematics and the growth of organisms — some historical impressions.
Mathematical Scientist, 8:11-30
Sandland, R.L. and McGilchrist, C.A. (1979). Stochastic growth curve analysis. Biometrics, 35:255-272
Sandral, G.A., Dear, B.S., Pratley, J.E., and Cullis, B.R. (1997) Herbicide dose rate response curves in
subterranean clover determined by a bioassay. Australian Journal of Experimental Agriculture,
37:67-74
Satterthwaite, F.E. (1946) An approximate distribution of estimates of variance components. Biometrics,
2:110-114
Schabenberger, O. (1994) Nonlinear mixed effects growth models for repeated measures in ecology. In:
Proceedings of the Section on Statstics and the Environment, Annual Joint Statistical Meetings,
Toronto, Canada, Aug. 13-18, 1994, pp. 156-161
Schabenberger, O. (1995) The use of ordinal response methodology in forestry. Forest Science,
41(2):321-336.
Schabenberger, O. and Birch, J.B. (2001) Statistical dose-response models with hormetic effects.
International Journal of Human and Ecological Risk Assessment, 7(4):891-908
Schabenberger, O. and Gregoire, T.G. (1995) A conspectus on estimating function theory and its
applicability to recurrent modeling issues in forest biometry. Silva Fennica, 29(1):49-70
Schabenberger, O. and Gregoire, T.G. (1996) Population-averaged and subject-specific approaches for
clustered categorical data. Journal of Statistical Computation and Simulation, 54:231-253
Schabenberger, O., Gregoire, T.G., and Burkhart, H.E. (1995) Commentary: Multi-state models for
monitoring individual trees in permanent observation plots by Urfer, W., Schwarzenbach, F.H.
Kütting, J., and Müller, P. Journal of Environmental and Ecological Statistics, 1(3):171-199
Schabenberger, O., Gregoire, T.G., and Kong, F. (2000) Collections of simple effects and their relation-
ship to main effects and interactions in factorials. The American Statistician, 54:210-214
Schabenberger, O., Tharp, B.E., Kells, J.J., and Penner, D. (1999) Statistical tests for hormesis and
effective dosages in herbicide dose response. Agronomy Journal, 91:713-721
Schnute, J. and Fournier, D. (1980) A new approach to length-frequency analysis: growth structure.
Canadian Journal of Fisheries and Aquatic Science, 37:1337-1351
Schrader, R.M. and Hettmansberger, T.P. (1980) Robust analysis of variance based on a likelihood
criterion. Biometrika, 67:93-101
Schrader, R.M. and McKean, J.W. (1977) Robust analysis of variance. Communications in Statistics A,
6:979-894
Schulz, H. (1888) Über Hefegifte. Pflügers Archiv der Gesellschaft für Physiologie, 42:517-541
Schumacher, F.X. (1939) A new growth curve and its application to timber yield studies. Journal of
Forestry, 37:819-820
Schwarz, G. (1978) Estimating the dimension of a model. Annals of Statistics, 6:461-464
Schwarzbach, W. (1984) A new approach in the evaluation of field trials: The determination of the most
likely genetic ranking of varieties. Proceedings EUCARPIA Cer. Sect. Meet., Vortr. Pflanzen-
zucht, 6:249-259
Schwertman, N.C. (1996) A connection between quadratic-type confidence limits and fiducial limits.
The American Statistician, 50(3):242-243
Searle, S.R. (1971) Linear Models. John Wiley & Sons, New York
Searle, S.R. (1982) Matrix Algebra Useful for Statisticians. John Wiley & Sons, New York
Searle, S.R. (1987) Linear Models for Unbalanced Data. John Wiley & Sons, New York
Searle, S.R., Casella, G., and McCulloch, C.E. (1992) Variance Components. John Wiley & Sons, New
York
Seber, G.A.F. and Wild, C.J. (1989) Nonlinear Regression. John Wiley & Sons, New York
Seefeldt, S.S., Jensen, J.E., and Fuerst, P. (1995) Log-logistic analysis of herbicide dose-response
relationships. Weed Technology, 9:218-227
Shapiro, S.S. and Wilk, M.B. (1965) An analysis of variance test for normality (complete samples).
Biometrika, 52:591-612
Sharples, K. and Breslow, N. (1992) Regression analysis of correlated binary data: some small sample
results for the estimating equation approach. Journal of Statistical Computation and Simulation,
42:1-20
Sheiner, L.B. and Beal, S.L. (1980) Evaluation of methods for estimating population pharmacokinetic
parameters. I. Michaelis-Menten model: routine clinical pharmacokinetic data. Journal of
Pharmacokinetics and Biopharmaceutics, 8:553-571
Sheiner, L.B. and Beal, S.L. (1985) Pharmacokinetic parameter estimates from several least squares
procedures: Superiority of extended least squares. Journal of Pharmacokinetics and Biopharma-
ceutics, 13:185-201
Shinozaki, K. and Kira, T. (1956) Intraspecific competition among higher plants. VII. Logistic theory of
the C-D effect. J. Inst. Polytech. Osaka City University, D7:35-72
Snedecor, G.W. and Cochran, W.G. (1989) Statistical Methods, 8th ed. Iowa State University Press,
Ames, Iowa.
Solie, J.B., Raun, W.R., and Stone, M.L. (1999) submeter spatial variability of selected soil and ber-
mudagrass production variables. Journal of the Soil Science Society of America, 63:1724-1733
Steel, R.G.D., Torrie, J.H., and Dickey, D.A. (1997) Principles and Procedures of Statistics. A Biomet-
rical Approach. McGraw-Hill, New York.
Stein, M.L. (1999) Interpolation of Spatial Data. Some Theory of Kriging. Springer-Verlag, New York
Stevens, W.L. (1951) Asymptotic regression. Biometrics, 7:247-267
Streibig, J.C. (1980) Models for curve-fitting herbicide dose response data. Acta Agriculturæ Scandina-
vica, 30:59-63
Streibig, J.C. (1981) A method for determining the biological effect of herbicide mixtures. Weed
Science, 29:469-473
Stroup, W.W., Baenziger, P.S., and Mulitze, D.K. (1994) Removing spatial variation from wheat yield
trials: a comparison of methods. Crop Science, 86:62-66.
Sweeting, T.J. (1980) Uniform asymptotic normality of the maximum likelihood estimator. Annals of
Statistics, 8:1375-1381
Swinton, S.M. and Lyford, C.P. (1996) A test for choice between hyperbolic and sigmoidal models of
crop yield response to weed density. Journal of Agricultural, Biological, and Environmental
Statistics, 1:97-106
Tanner, M.A. and Young, M.A. (1985) Modeling ordinal scale disagreement. Psychological Bulletin,
98:408-415
Tharp, B.E., Schabenberger, O., and Kells, J.J. (1999) response of annual weed species to glufosinate
and glyphosate. Weed Technology, 13:542-547
Theil, H. (1971) Principles of Econometrics. John Wiley & Sons, New York
Thiamann, K.V. (1956) Promotion and inhibition: twin themes of physiology, The American Naturalist,
40:145-162
Thompson, R. and Baker, R.J. (1981) Composite link functions in generalized linear models. Applied
Statistics, 30:125-131
Thornley, J.H.M. and Johnson, I.R. (1990) Plant and Crop Models. Clarendon Press, Oxford, UK
Tobler, W. (1970) A computer movie simulating urban growth in the Detroit region. Economic Geogra-
phy, 46:234-240
Tukey, J.W. (1949) One degree of freedom for nonadditivity. Biometrics, 5:232-242
Tukey, J.W. (1977) Exploratory Data Analysis. Addison-Wesley, Reading, MA
Tweedie, M.C.K. (1945) Inverse statistical variates. Nature, 155:453
Tweedie, M.C.K. (1957a) Statistical properties of inverse Gaussian distributions I. Annals of Mathemat-
ical Statistics, 28:362-377
Tweedie, M.C.K. (1957b) Statistical properties of inverse Gaussian distributions II. Annals of Mathe-
matical Statistics, 28:696-705
UNSCEAR (1958) Report of the United Nations Scientific Committee on the Effects of Atomic Radia-
tion. Official Records of the General Assembly, 13th Session, Supplement No. 17.
Upton, G.J.G. and Fingleton, B. (1985) Spatial Data Analysis by Example, Vol.1: Point Pattern and
Quantitative Data. John Wiley & Sons, New York
Urquhart, N.S. (1968) Computation of generalized inverse matrices which satisfy specified conditions.
SIAM Review, 10(2):216-218
Utomo, I.H. (1981) Weed competition in upland rice. In: Proceedings of the 8th Asian-Pacific Weed
Science Society Conference, Vol II: 101-107
Valentine, H.T. and Gregoire, T.G. (2001) A switching model of bole taper. Canadian Journal of Forest
Research. To appear
Van Deusen, P.C., Sullivan, A.D., and Matney, T.G. (1981) A prediction system for cubic foot volume
of loblolly pine applicable through much of its range. Southern Journal of Applied Forestry,
5:186-189
Verbeke, G. and Molenberghs, G. (1997) Linear Mixed Models in Practice: A SAS-oriented Approach.
Springer-Verlag, New York
Verbyla, A.P., Cullis, B.R., Kenward, M.G., and Welham S.J. (1999) The analysis of designed experi-
ments and longitudinal data by using smoothing splines. Applied Statistics, 48:269-311
Vitosh, M.L, Johnson, J.W., and Mengel, D.B. (1995) Tri-state fertilizer recommendations for corn,
soybeans, wheat and alfalfa. Michigan State University Extension Bullettin E-2567.
Von Bertalanffy, L. (1957) Quantitative laws in metabolism and growth. Quarterly Reviews in Biology,
32:217-231
Vonesh, E.F. and Carter, R.L. (1992) Mixed-effects nonlinear regression for unbalanced repeated meas-
ures. Biometrics, 48:1-17
Vonesh, E.F. and Chinchilli, V.M. (1997) Linear and Nonlinear Models for the Analysis of Repeated
Measurements. Marcel Dekker, New York
Wakeley, J.T. (1949) Annual Report of the Soils-Weather Project, 1948. University of North Carolina
(Raleigh) Institute of Statistics Mimeo Series, 19
Wallsten, T.S. and Budescu, D.V. (1981) Adaptivity and nonadditivity in judging MMPI profiles.
Journal of Experimental Psychology: Human Perception and Performance, 7:1096-1109
Walters, K.J., Hosfield, G.L., Uebersax, M.A., and Kelly, J.D. (1997) Navy bean canning quality:
correlations, heritability estimates, and randomly mmplified polymorphic DNA markers associa-
ted with component traits. Journal of the American Society for Horticultural Sciences 122(3):
338-343
Wang, Y.H. (2000) Fiducial intervals: what are they? The American Statistician, 54(2):105-111
Warrick, A.W. and Myers, D.E. (1987) Optimization of sampling locations for variogram calculations.
Water Resources Research, 23:496-500
Watson, G.S. (1964). Smooth regression analysis. Sankhya (A), 26:359-372
Watts, D.G. and Bacon, D.W. (1974) Using an hyperbola as a transition model to fit two-regime
straight-line data. Technometrics, 16:369-373
Waugh, D.L., Cate Jr., R.B., and Nelson, L.A. (1973) Discontinuous models or rapid correlation, inter-
pretation, and utilization of soil analysis and fertilizer response data. International Soil Fertility
Evaluation and Improvement Program, Technical Bulletin No. 7, North Carolina State Univer-
sity, Raleigh, NC
Webster, R. and Oliver, M.A. (1992) Sample adequately to estimate variograms for soil properties.
Journal of Soil Science 43:177-192
Wedderburn, R.W.M. (1974) Quasilikelihood functions, generalized linear models and the Gauss-New-
ton method. Biometrika, 61:439-447
Welch, B.L. (1937) The significance of the difference between two means when the population varianc-
es are unequal. Biometrika, 29:350-362
White, H. (1980) A heteroskedasticity-consistent covariance matric estimator and a direct test for
heteroskedasticity. Econometrica, 48:817-838
White, H. (1982) Maximum likelihood estimation of misspecified models. Econometrics, 50:1-25
Whittle, P. (1954) On stationary processes in the plane. Biometrika, 41:434-449
Wiebe, G.A. (1935) Variation and correlation among 1500 wheat nursery plots. Journal of Agricultural
Research, 50:331-357
Wiedman, S.J. and Appleby, A.P. (1972) Plant growth stimulation by sublethal concentrations of herbi-
cides. Weed Research, 12:65-74
Wilkinson, G.N., Eckert, S.R., Hancock, T.W., and Mayo, O. (1983) Nearest neighbor (NN) analysis of
field experiments (with discussion). Journal of the Royal Statistical Society (B), 45:152-212
Winer, B.J. (1971) Statistical Principles in Experimental Design. McGraw-Hill, New York
Wishart, J. (1938) Growth rate determinations in nutrition studies with the bacon pig, and their analysis.
Biometrika, 30:16-28
Wolfinger, R. (1993a) Covariance structure selection in general mixed models. Communications in
Statistics, Simulation and Computation, 22(4):1079-1106
Wolfinger, R. (1993b) Laplace's approximation for nonlinear mixed models. Biometrika, 80:791-795
Wolfinger, R. and O'Connell, M. (1993) Generalized linear mixed models: a pseudo-likelihood
approach. Journal of Statistical Computation and Simulation, 48:233-243
Wolfinger, R., Tobias, R., and Sall, J. (1994) Computing Gaussian likelihoods and their derivatives for
general linear mixed models. SIAM Journal on Scientific and Statistical Computing, 15:1294-
1310
Xu, W., Tran, T., Srivastava, R., and Journel, A.G. (1992) Integrating seismic data in reservoir model-
ing: the collocated cokriging alternative. SPE Paper 24742, 67th Annual Technical Conference
and Exhibition.
Yandell, B.S. (1997) Practical Data Analysis for Designed Experiments. Chapman and Hall, New York
Yates, F. (1936) Incomplete randomized blocks. Annals of Eugenics, 7:121-140
Yates, F. (1940) The recovery of inter-block information in balanced incomplete block designs. Annals
of Eugenics, 10:317-325
Zeger, S.L. and Harlow, S.D. (1987) Mathematical models from laws of growth to tools for biological
analysis: fifty years of Growth. Growth, 51:1-21
Zeger, S.L. and Liang, K.-Y. (1986) Longitudinal data analysis for discrete and continuous outcomes.
Biometrics, 42:121-130
Zeger, S.L. and Liang, K.-Y. (1992) An overview of methods for the analysis of longitudinal data. Sta-
tistics in Medicine, 11:1825-1839
Zeger, S.L., Liang, K.-Y., and Albert, P.S. (1988) Models for longitudinal data: a generalized estimating
equation approach. Biometrics, 44:1049-1060
Zhao, L.P. and Prentice, R.L. (1990) Correlated binary regression using a quadratic exponential model.
Biometrika, 77:642-648
Zheng, L. and Silliman, S.E. (2000) Estimating the theoretical semivariogram from finite numbers of
measurements. Water Resources Research, 36:361-366
Zimdahl, R.L. (1980) Weed-Crop Competition: A Review. International Plant Protection Center, USA
Zimmerman, D.L. (1989) Computationally exploitable structure of covariance matrices and generalized
covariance matrices in spatial models. Journal of Statistical Computation and Simulation, 32:
1-15
Zimmerman, D.L. and Harville, D.A. (1991) A random field approach to the analysis of field-plot
experiments and other spatial experiments. Biometrics, 47:223-239.
Zimmerman, D.L. and Núñez-Antón, V. (1997) Structured antedependence models for longitudinal
data. In: Modelling Longitudinal and Spatially Correlated Data (Gregoire, T.G., Brillinger, D.R.,
Diggle, P.J., Russek-Cohen, E., Warren, W.G., and Wolfinger, R.D., eds). Springer-Verlag, New
York, pp. 63-76
Zimmerman, D.L. and Zimmerman, M.B. (1991) A comparison of spatial semivariogram estimators and
corresponding kriging predictors. Technometrics, 33:77-91