Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Statistical Model For The Plant and Soil Sciences

Download as pdf or txt
Download as pdf or txt
You are on page 1of 730

CONTEMPORARY

STATISTICAL MODELS
for the Plant and Soil Sciences

© 2002 by CRC Press LLC


CONTEMPORARY
STATISTICAL MODELS
for the Plant and Soil Sciences

Oliver Schabenberger
Francis J. Pierce

CRC PR E S S
Boca Raton London New York Washington, D.C.
© 2002 by CRC Press LLC
1405/disclaimer Page 1 Tuesday, October 2, 2001 9:43 AM

Library of Congress Cataloging-in-Publication Data

Schabenberger, Oliver.
Contemporary statistical models for the plant and soil sciences / Oliver Schabenberger
and Francis J. Pierce.
p. cm.
Includes bibliographical references (p. ).
ISBN 1-58488-111-9 (alk. paper)
1. Plants, Cultivated—Statistical methods. 2. Soil science—Statistical methods. I.
Pierce, F. J. (Francis J.) II. Title.

SB91 .S36 2001


630′.727—dc21 2001043254

This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with
permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish
reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials
or for the consequences of their use.

Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical,
including photocopying, microfilming, and recording, or by any information storage or retrieval system, without prior
permission in writing from the publisher.

The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, for creating new works,
or for resale. Specific permission must be obtained in writing from CRC Press LLC for such copying.

Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation, without intent to infringe.

Visit the CRC Press Web site at www.crcpress.com

© 2002 by CRC Press LLC

No claim to original U.S. Government works


International Standard Book Number 1-58488-111-9
Library of Congress Card Number 2001043254
Printed in the United States of America 1 2 3 4 5 6 7 8 9 0
Printed on acid-free paper

© 2002 by CRC Press LLC


To Lisa and Linda
for enduring support and patience

© 2002 by CRC Press LLC


Contents

Preface
About the Authors

1 Statistical Models
1.1 Mathematical and Statistical Models
1.2 Functional Aspects of Models
1.3 The Inferential Steps — Estimation and Testing
1.4 >-Tests in Terms of Statistical Models
1.5 Embedding Hypotheses
1.6 Hypothesis and Significance Testing — Interpretation of the : -Value
1.7 Classes of Statistical Models
1.7.1 The Basic Component Equation
1.7.2 Linear and Nonlinear Models
1.7.3 Regression and Analysis of Variance Models
1.7.4 Univariate and Multivariate Models
1.7.5 Fixed, Random, and Mixed Effects Models
1.7.6 Generalized Linear Models
1.7.7 Errors in Variable Models

2 Data Structures
2.1 Introduction
2.2 Classification by Response Type
2.3 Classification by Study Type
2.4 Clustered Data
2.4.1 Clustering through Hierarchical Random Processes
2.4.2 Clustering through Repeated Measurements
2.5 Autocorrelated Data
2.5.1 The Autocorrelation Function
2.5.2 Consequences of Ignoring Autocorrelation
2.5.3 Autocorrelation in Designed Experiments
2.6 From Independent to Spatial Data — a Progression of Clustering

© 2002 by CRC Press LLC


3 Linear Algebra Tools
3.1 Introduction
3.2 Matrices and Vectors
3.3 Basic Matrix Operations
3.4 Matrix Inversion — Regular and Generalized Inverse
3.5 Mean, Variance, and Covariance of Random Vectors
3.6 The Trace and Expectation of Quadratic Forms
3.7 The Multivariate Gaussian Distribution
3.8 Matrix and Vector Differentiation
3.9 Using Matrix Algebra to Specify Models
3.9.1 Linear Models
3.9.2 Nonlinear Models
3.9.3 Variance-Covariance Matrices and Clustering

4 The Classical Linear Model: Least Squares and Alternatives


4.1 Introduction
4.2 Least Squares Estimation and Partitioning of Variation
4.2.1 The Principle
4.2.2 Partitioning Variability through Sums of Squares
4.2.3 Sequential and Partial Sums of Squares and the
Sum of Squares Reduction Test
4.3 Factorial Classification
4.3.1 The Means and Effects Model
4.3.2 Effect Types in Factorial Designs
4.3.3 Sum of Squares Partitioning through Contrasts
4.3.4 Effects and Contrasts in The SAS® System
4.4 Diagnosing Regression Models
4.4.1 Residual Analysis
4.4.2 Recursive and Linearly Recovered Errors
4.4.3 Case Deletion Diagnostics
4.4.4 Collinearity Diagnostics
4.4.5 Ridge Regression to Combat Collinearity
4.5 Diagnosing Classification Models
4.5.1 What Matters?
4.5.2 Diagnosing and Combating Heteroscedasticity
4.5.3 Median Polishing of Two-Way Layouts
4.6 Robust Estimation
4.6.1 P" -Estimation
4.6.2 M-Estimation
4.6.3 Robust Regression for Prediction Efficiency Data
4.6.4 M-Estimation in Classification Models
4.7 Nonparametric Regression
4.7.1 Local Averaging and Local Regression
4.7.2 Choosing the Smoothing Parameter

© 2002 by CRC Press LLC


Appendix A on CD-ROM
A4.8 Mathematical Details
A4.8.1 Least Squares
A4.8.2 Hypothesis Testing in the Classical Linear Model
A4.8.3 Diagnostics in Regression Models
A4.8.4 Ridge Regression
A4.8.5 P" -Estimation
A4.8.6 M-Estimation
A4.8.7 Nonparametric Regression

5 Nonlinear Models
5.1 Introduction
5.2 Models as Laws or Tools
5.3 Linear Polynomials Approximate Nonlinear Models
5.4 Fitting a Nonlinear Model to Data
5.4.1 Estimating the Parameters
5.4.2 Tracking Convergence
5.4.3 Starting Values.
5.4.4 Goodness-of-Fit
5.5 Hypothesis Tests and Confidence Intervals
5.5.1 Testing the Linear Hypothesis
5.5.2 Confidence and Prediction Intervals
5.6 Transformations
5.6.1 Transformation to Linearity
5.6.2 Transformation to Stabilize the Variance
5.7 Parameterization of Nonlinear Models
5.7.1 Intrinsic and Parameter-Effects Curvature
5.7.2 Reparameterization through Defining Relationships
5.8 Applications
5.8.1 Basic Nonlinear Analysis with The SAS® System — Mitscherlich's
Yield Equation
5.8.2 The Sampling Distribution of Nonlinear Estimators —
the Mitscherlich Equation Revisited
5.8.3 Linear-Plateau Models and Their Relatives — a Study of Corn
Yields from Tennessee
5.8.4 Critical R S$ Concentrations as a Function of Sampling Depth —
Comparing Join-Points in Plateau Models
5.8.5 Factorial Treatment Structure with Nonlinear Response
5.8.6 Modeling Hormetic Dose Response through Switching Functions
5.8.7 Modeling a Yield-Density Relationship
5.8.8 Weighted Nonlinear Least Squares Analysis with
Heteroscedastic Errors

© 2002 by CRC Press LLC


Appendix A on CD-ROM
A5.9 Forms of Nonlinear Models
A5.9.1 Concave and Convex Models, Yield-Density Models
A5.9.2 Models with Sigmoidal Shape, Growth Models
A5.10 Mathematical Details
A5.10.1 Taylor Series Involving Vectors
A5.10.2 Nonlinear Least Squares and the Gauss-Newton Algorithm
A5.10.3 Nonlinear Generalized Least Squares
A5.10.4 The Newton-Raphson Algorithm
A5.10.5 Convergence Criteria
A5.10.6 Hypothesis Testing, Confidence and Prediction Intervals

6 Generalized Linear Models


6.1 Introduction
6.2 Components of a Generalized Linear Model
6.2.1 Random Component
6.2.2 Systematic Component and Link Function
6.2.3 Generalized Linear Models in The SAS® System
6.3 Grouped and Ungrouped Data
6.4 Parameter Estimation and Inference
6.4.1 Solving the Likelihood Problem
6.4.2 Testing Hypotheses about Parameters and Their Functions
6.4.3 Deviance and Pearson's \ # Statistic
6.4.4 Testing Hypotheses through Deviance Partitioning
6.4.5 Generalized V # Measures of Goodness-of-Fit
6.5 Modeling an Ordinal Response
6.5.1 Cumulative Link Models
6.5.2 Software Implementation and Example
6.6 Overdispersion
6.7 Applications
6.7.1 Dose-Response and LD&! Estimation in a Logistic Regression
Model
6.7.2 Binomial Proportions in a Randomized Block Design —
the Hessian Fly Experiment
6.7.3 Gamma Regression and Yield Density Models
6.7.4 Effects of Judges' Experience on Bean Canning Quality Ratings
6.7.5 Ordinal Ratings in a Designed Experiment with Factorial
Treatment Structure and Repeated Measures
6.7.6 Log-Linear Modeling of Rater Agreement
6.7.7 Modeling the Sample Variance of Scab Infection
6.7.8 A Poisson/Gamma Mixing Model for Overdispersed Poppy Counts
Appendix A on CD-ROM
A6.8 Mathematical Details and Special Topics
A6.8.1 Exponential Family of Distributions
A6.8.2 Maximum Likelihood Estimation
A6.8.3 Iteratively Reweighted Least Squares
A6.8.4 Hypothesis Testing

© 2002 by CRC Press LLC


A6.8.5 Fieller's Theorem and the Variance of a Ratio
A6.8.6 Overdispersion Mechanisms for Counts

7 Linear Mixed Models for Clustered Data


7.1 Introduction
7.2 The Laird-Ware Model
7.2.1 Rationale
7.2.2 The Two-Stage Concept
7.2.3 Fixed or Random Effects
7.3 Choosing the Inference Space
7.4 Estimation and Inference
7.4.1 Maximum and Restricted Maximum Likelihood
7.4.2 Estimated Generalized Least Squares
7.4.3 Hypothesis Testing
7.5 Correlations in Mixed Models
7.5.1 Induced Correlations and the Direct Approach
7.5.2 Within-Cluster Correlation Models
7.5.3 Split-Plots, Repeated Measures, and the Huynh-Feldt Conditions
7.6 Applications
7.6.1 Two-Stage Modeling of Apple Growth over Time
7.6.2 On-Farm Experimentation with Randomly Selected Farms
7.6.3 Nested Errors through Subsampling
7.6.4 Recovery of Inter-Block Information in Incomplete Block Designs
7.6.5 A Split-Strip-Plot Experiment for Soybean Yield
7.6.6 Repeated Measures in a Completely Randomized Design
7.6.7 A Longitudinal Study of Water Usage in Horticultural Trees
7.6.8 Cumulative Growth of Muskmelons in Subsampling Design
Appendix A on CD-ROM
A7.7 Mathematical Details and Special Topics
A7.7.1 Henderson's Mixed Model Equations
A7.7.2 Solutions to the Mixed Model Equations
A7.7.3 Likelihood Based Estimation
A7.7.4 Estimated Generalized Least Squares Estimation
A7.7.5 Hypothesis Testing
A7.7.6 The First-Order Autoregressive Model

8 Nonlinear Models for Clustered Data


8.1 Introduction
8.2 Nonlinear and Generalized Linear Mixed Models
8.3 Toward an Approximate Objective Function
8.3.1 Three Linearizations
8.3.2 Linearization in Generalized Linear Mixed Models
8.3.3 Integral Approximation Methods
8.4 Applications
8.4.1 A Nonlinear Mixed Model for Cumulative Tree Bole Volume
8.4.2 Poppy Counts Revisited — a Generalized Linear Mixed Model
for Overdispersed Count Data

© 2002 by CRC Press LLC


8.4.3 Repeated Measures with an Ordinal Response
Appendix A on CD-ROM
8.5 Mathematical Details and Special Topics
8.5.1 PA and SS Linearizations
8.5.2 Generalized Estimating Equations
8.5.3 Linearization in Generalized Linear Mixed Models
8.5.4 Gaussian Quadrature

9 Statistical Models for Spatial Data


9.1 Changing the Mindset
9.1.1 Samples of Size One
9.1.2 Random Functions and Random Fields
9.1.3 Types of Spatial Data
9.1.4 Stationarity and Isotropy — the Built-in Replication Mechanism
of Random Fields
9.2 Semivariogram Analysis and Estimation
9.2.1 Elements of the Semivariogram
9.2.2 Parametric Isotropic Semivariogram Models
9.2.3 The Degree of Spatial Continuity (Structure)
9.2.4 Semivariogram Estimation and Fitting
9.3 The Spatial Model
9.4 Spatial Prediction and the Kriging Paradigm
9.4.1 Motivation of the Prediction Problem
9.4.2 The Concept of Optimal Prediction
9.4.3 Ordinary and Universal Kriging
9.4.4 Some Notes on Kriging
9.4.5 Extensions to Multiple Attributes
9.5 Spatial Regression and Classification Models
9.5.1 Random Field Linear Models
9.5.2 Some Philosophical Considerations
9.5.3 Parameter Estimation
9.6 Autoregressive Models for Lattice Data
9.6.1 The Neighborhood Structure
9.6.2 First-Order Simultaneous and Conditional Models
9.6.3 Parameter Estimation
9.6.4 Choosing the Neighborhood Structure
9.7 Analyzing Mapped Spatial Point Patterns
9.7.1 Introduction
9.7.2 Random, Aggregated, and Regular Patterns — the Notion of
Complete Spatial Randomness
9.7.3 Testing the CSR Hypothesis in Mapped Point Patterns
9.7.4 Second-Order Properties of Point Patterns
9.8 Applications
9.8.1 Exploratory Tools for Spatial Data —
Diagnosing Spatial Autocorrelation with Moran's I
9.8.2 Modeling the Semivariogram of Soil Carbon
9.8.3 Spatial Prediction — Kriging of Lead Concentrations

© 2002 by CRC Press LLC


9.8.4 Spatial Random Field Models — Comparing C/N Ratios among
Tillage Treatments
9.8.5 Spatial Random Field Models — Spatial Regression of
Soil Carbon on Soil N
9.8.6 Spatial Generalized Linear Models — Spatial Trends in the
Hessian Fly Experiment
9.8.7 Simultaneous Spatial Autoregression — Modeling Wiebe's
Wheat Yield Data
9.8.8 Point Patterns — First- and Second-Order Properties of a
Mapped Pattern
Bibliography

© 2002 by CRC Press LLC


Preface

To the Reader
Statistics is essentially a discipline of the twentieth century, and for several decades it was
keenly involved with problems of interpreting and analyzing empirical data that originate in
agronomic investigations. The vernacular of experimental design in use today bears evidence
of the agricultural connection and origin of this body of theory. Omnipresent terms, such as
block or split-plot, emanated from descriptions of blocks of land and experimental plots in
agronomic field designs. The theory of randomization in experimental work was developed
by Fisher to neutralize in particular the spatial effects among experimental units he realized
existed among field plots. Despite its many origins in agronomic problems, statistics today is
often unrecognizable in this context. Numerous recent methodological approaches and
advances originated in other subject-matter areas and agronomists frequently find it difficult
to see their immediate relation to questions that their disciplines raise. On the other hand,
statisticians often fail to recognize the riches of challenging data analytical problems
contemporary plant and soil science provides. One could gain the impressions that
• statistical methods of concern to plant and soil scientists are completely developed and
understood;
• the analytical tools of classical statistical analysis learned in a one- or two-semester
course for non-statistics majors are sufficient to cope with data analytical problems;
• recent methodological work in statistics applies to other disciplines such as human
health, sociology, or economics, and has no bearing on the work of the agronomist;
• there is no need to consider contemporary statistical methods and no gain in doing so.

These impressions are incorrect. Data collected in many investigations and the circum-
stances under which they are accrued often bear little resemblance to classically designed ex-
periments. Much of the data analysis in the plant and soil sciences is nevertheless viewed in
the experimental design framework. Ground and remote sensing technology, yield monito-
ring, and geographic information systems are but a few examples where analysis cannot
necessarily be cast, nor should it be coerced, into a standard analysis of variance framework.
As our understanding of the biological/physical/environmental/ecological mechanisms in-
creases, we are more and more interested in what some have termed the space/time dynamics
of the processes we observe or set into motion by experimentation. It is one thing to collect
data in space and/or over time, it is another matter to apply the appropriate statistical tools to
infer what the data are trying to tell us. While many of the advances in statistical
methodologies in past decades have not explicitly focused on agronomic applications, it
would be incorrect to assume that these methods are not fruitfully applied there. Geostatistical
methods, mixed models for repeated measures and longitudinal data, generalized linear
models for non-normal (= non-Gaussian) data, and nonlinear models are cases in point.

© 2002 by CRC Press LLC


The dedication of time, funds, labor, and technology to study design and data accrual
often outstrip the efforts devoted to the analysis of the data. Does it not behoove us to make
the most of the data, extract the most information, and apply the most appropriate techniques?
Data sets are becoming richer and richer and there is no end in sight to the opportunities for
data collection. Continuous time monitoring of experimental conditions is already a reality in
biomedical studies where wristwatch-like devices report patient responses in a continuous
stream. Through sensing technologies, variables that would have been observed only occasio-
nally and on a whole-field level can now be observed routinely and spatially explicit. As one
colleague put it: “What do you do the day you receive your first five million observations?”
We do not have (all) the answers for data analysis needs in the information technology age.
We subscribe wholeheartedly, however, to its emerging philosophy : Do not to be afraid to
get started, do not to be afraid to stop, and apply the best available methods along the way.
In the course of many consulting sessions with students and researchers from the life
sciences, we realized that the statistical tools covered in a one- or two-semester statistical me-
thods course are insufficient to cope successfully with the complexity of empirical research
data. Correlated, clustered, and spatial data, non-Gaussian (non-Normal) data and nonlinear
responses are common in practice. The complexity of these data structures tends to outpace
the basic curriculum. Most studies do not collect just one data structure, however. Remotely
sensed leaf area index, repeated measures of plant yield, ordinal responses of plant injury, the
presence/absence of disease and random sampling of soil properties, for example, may all be
part of one study and comprise the threads from which scientific conclusions must be woven.
Diverse data structures call for diverse tools. This text is an attempt to squeeze between two
covers many statistical methods pertinent to research in the life sciences. Any one of the main
chapters (§4 to 9) could have easily been expanded to the size of the entire text, and there are
several excellent textbooks and monographs that do so. Invariably, we are guilty of omission.

To the User
Contemporary statistical models cannot be appreciated to their full potential without a
good understanding of theory. Hence, we place emphasis on that. They also cannot be applied
to their full potential without the aid of statistical software. Hence, we place emphasis on that.
The main chapters are roughly equally divided between coverage of essential theory and
applications. Additional theoretical derivations and mathematical details needed to develop a
deeper understanding of the models can be found on the companion CD-ROM. The choice to
focus on The SAS® System for calculations was simple. It is, in our opinion, the most
powerful statistical computing platform and the most widely available and accepted com-
puting environment for statistical problems in academia, industry, and government. In rare
cases when procedures in SAS® were not available and macros too cumbersome we
employed the S-PLUS® package, in particular the S+SpatialStats® module. The important
portions of the executed computer code are shown in the text along with the output. All data
sets and SAS® or S-PLUS® codes are contained on the CD-ROM.

To the Instructor
This text is both a reference and textbook and was developed with a reader in mind who
has had a first course in statistics, covering simple and multiple linear regression, analysis of
variance, who is familiar with the principles of experimental design and is willing to absorb a

© 2002 by CRC Press LLC


few concepts from linear algebra necessary to discuss the theory. A graduate-level course in
statistics may focus on the theory in the main text and the mathematical details appendix. A
graduate-level service course in statistical methods may focus on the theory and applications
in the main chapters. A graduate-level course in the life sciences can focus on the applications
and through them develop an appreciation of the theory. Chapters 1 and 2 introduce statistical
models and the key data structures covered in the text. The notion of clustering in data is a
recurring theme of the text. Chapter 3 discusses requisite linear algebra tools, which are
indispensable to the discussion of statistical models beyond simple analysis of variance and
regression. Depending on the audiences previous exposure to basic linear algebra, this chap-
ter can be skipped. Several possible course concentrations are possible. For example,
1. A course on linear models beyond the basic stats-methods course: §1, 2, (3), 4
2. A course on modeling nonlinear response: §1, 2, (3), 5, 6, parts of 8
3. A course on correlated data: §1, 2, (3), 7, 8, parts of 9
4. A course on mixed models: §1, 2, (3), parts of 4, 7, 8
5. A course on spatial data analysis: §1, 2, (3), 9

In a statistics curriculum the coverage of §4 to 9 should include the mathematical details


and special topics sections §A4 to A9.
We did not include exercises in this text; the book can be used in various types of courses
at different levels of technical difficulty. We did not want to suggest a particular type or level
through exercises. Although the applications (case studies) in §& to * are lengthy, they do not
consitute the final word on any particular data. Some data sets, such as the Hessian fly experi-
ment or the Poppy count data are visited repeatedly in different chapters and can be tackled
with different tools. We encourage comparative analyses for other data sets. If the appli-
cations leave the reader wanting to try out a different approach, to tackle the data from a new
angle, and to improve upon our analysis, we wronged enough to get that right.
This text would not have been possible without the help and support of others. Data were
kindly made available by A.M. Blackmer, R.E. Byers, R. Calhoun, J.R. Craig, D. Gilstrap,
C.A. Gotway Crawford, J.R. Harris, L.P. Hart, D. Holshouser, D.E. Karcher, J.J. Kells, J.
Kelly, D. Loftis, R. Mead, G. A. Milliken, P. Mou, T.G. Mueller, N.L. Powell, R. Reed, J. D.
Rimstidt, R. Witmer, J. Walters, and L.W. Zelazny. Dr. J.R. Davenport (Washington State
University-IRAEC) kindly provided the aerial photo of the potato circle for the cover. Several
graduate students at Virginia Tech reviewed the manuscript in various stages and provided
valuable insights and corrections. We are grateful in particular to S.K. Clark, S. Dorai-Raj,
and M.J. Waterman. Our thanks to C.E. Watson (Mississippi State University) for a detailed
review and to Simon L. Smith for EXP® . Without drawing on the statistical expertise of J.B.
Birch (Virginia Tech) and T.G. Gregoire (Yale University), this text would have been more
difficult to finalize. Without the loving support of our families it would have been impossible.
Finally, the fine editorial staff at CRC Press LLC, and in particular our editor, Mr. John
Sulzycki, brought their skills to bear to make this project a reality. We thank all of these indi-
viduals for contributing to the parts of the book that are right. Its flaws are our responsibility.

Oliver Schabenberger Francis J. Pierce


Virginia Polytechnic Institute and State University Washington State University

© 2002 by CRC Press LLC


About the Authors
This text was produced while Oliver Schabenberger was assistant and associate professor in
the department of statistics at Virginia Polytechnic Institute and
State University. He conducts research on parametric and non-
parametric statistical methods for nonlinear response, non-
normal data, longitudinal and spatial data. His interests are in the
application of statistics to agronomy and natural resource
disciplines. He taught statistical methods courses for non-majors
and courses for majors in applied statistics, biological statistics,
and spatial statistics. Dr. Schabenberger served as statistical
consultant to faculty and graduate students at Virginia Tech and
from 1996 to 1999 in the department of crop and soil sciences at
Michigan State University. He holds degrees in forest
engineering (Dipl.-Ing. F.H.) from the Fachhochschule für
Forstwirtschaft in Rottenburg, Germany, forest science (Diplom) from the Albert-Ludwigs
University in Freiburg, Germany, statistics (M.S.) and forestry (Ph.D.) from Virginia
Polytechnic Institute and State University. He has a research affiliation with the School of
Forestry and Environmental Studies at Yale University and is a member of the American
Statistical Association, the International Biometric Society (Eastern North American Region),
the American Society of Agronomy, and the Crop Science Society of America. Dr.
Schabenberger is a member of the Applications Staff in the Analytical Solutions Division of
SAS Institute Inc., Cary, NC.

Francis J. Pierce is the director of the center for precision agricultural systems at Washington
State University, located at the WSU Irrigated Agriculture
Research & Extension Center (IRAEC) in Prosser, Washington.
He is also a professor in the departments of crop and soil sciences
and biological systems engineering and directs the WSU Public
Agricultural Weather System. Dr. Pierce received his M.S. and
Ph.D. degrees in soil science from the University of Minnesota in
1980 and 1984. He spent the next 16 years at Michigan State
University, where he has served as professor of soil science in the
department of crop and soil sciences since 1995. His expertise is
in soil management and he has been involved in the development
and evaluation of precision agriculture since 1991. The Center for
Precision Agricultural Systems was funded by the Washington
Legislature as part of the University's Advanced Technology Initiative in 1999. As center
director, Dr. Pierce's mission is to advance the science and practice of precision agriculture in
Washington. The center's efforts will support the competitive production of Washington's
agricultural commodities, stimulate the state's economic development, and protect the region's
environmental and natural resources. Dr. Pierce has edited three other books, Soil
Management for Sustainability, Advances in Soil Conservation, and The State of Site-Specific
Management for Agriculture.
© 2002 by CRC Press LLC
Chapter 1

Statistical Models

“A theory has only the alternative of being right or wrong. A model has a
third possibility: it may be right, but irrelevant.” Manfred Eigen. In the
Physicist's Conception of Nature (Jagdish Mehra, Ed.) 1973.

1.1 Mathematical and Statistical Models


1.2 Functional Aspects of Models
1.3 The Inferential Steps — Estimation and Testing
1.4 >-Tests in Terms of Statistical Models
1.5 Embedding Hypotheses
1.6 Hypothesis and Significance Testing — Interpretation of the : -Value
1.7 Classes of Statistical Models
1.7.1 The Basic Component Equation
1.7.2 Linear and Nonlinear Models
1.7.3 Regression and Analysis of Variance Models
1.7.4 Univariate and Multivariate Models
1.7.5 Fixed, Random, and Mixed Effects Models
1.7.6 Generalized Linear Models
1.7.7 Errors in Variable Models

© 2002 by CRC Press LLC




1.1 Mathematical and Statistical Models


Box 1.1 Statistical Models

• A scientific model is the abstraction of a real phenomenon or process that


isolates those aspects relevant to a particular inquiry.

• Inclusion of stochastic (random) elements in a mathematical model leads to


more parsimonious and often more accurate abstractions than complex
deterministic models.

• A special case of a stochastic model is the statistical model which contains


unknown constants to be estimated from empirical data.

The ability to represent phenomena and processes of the biological, physical, chemical, and
social world through models is one of the great scientific achievements of humankind. Scien-
tific models isolate and abstract the elementary facts and relationships of interest and provide
the logical structure in which a system is studied and from which inferences are drawn.
Identifying the important components of a system and isolating the facts of primary interest is
necessary to focus on those aspects relevant to a particular inquiry. Abstraction is necessary
to cast the facts in a logical system that is concise, deepens our insight, and is understood by
others to foster communication, critique, and technology transfer. Mathematics is the most
universal and powerful logical system, and it comes as no surprise that most scientific models
in the life sciences or elsewhere are either developed as mathematical abstractions of real phe-
nomena or can be expressed as such. A purely mathematical model is a mechanistic
(= deterministic) device in that for a given set of inputs, it predicts the output with absolute
certainty. It leaves nothing to chance. Beltrami (1998, p. 86), for example, develops the
following mathematical model for the concentration ! of a pollutant in a river at point = and
time >:
!a=ß >b œ !! a=  ->bexpe  .>f. [1.1]

In this equation . is a proportionality constant, measuring the efficiency of bacterial decom-


position of the pollutant, - is the water velocity, and !! a=b is the initial pollutant concentra-
tion at site =. Given the inputs - , ., and !! , the pollutant concentration at site = and time > is
predicted with certainty. This would be appropriate if the model were correct, all its
assumptions met, and the inputs measured or ascertained with certainty. Important
assumptions of [1.1] are (i) a homogeneous pollutant concentration in all directions except for
downstream flow, (ii) the absence of diffusive effects due to contour irregularities and
turbulence, (iii) the decay of the pollutant due to bacterial action, (iv) the constancy of the
bacterial efficiency, and (v) thorough mixing of the pollutant in the water. These assumptions
are reasonable but not necessarily true. By ignoring diffusive effects, for example, it is really
assumed that the positive and negative effects due to contour irregularities and turbulence will
average out. The uncertainty of the effects at a particular location along the river and point in
time can be incorporated by casting [1.1] as a stochastic model,

© 2002 by CRC Press LLC


!a=ß >b œ !! a=  ->bexpe  .>f € /, [1.2]

where / is a random variable with mean !, variance 5 # , and some probability distribution.
Allowing for the random deviation /, model [1.2] now states explicitly that !a=ß >b is a ran-
dom variable and the expression !! a=  ->bexpe  .>f is the expected value or average
pollutant concentration,
Ec!a=ß >bd œ !! a=  ->bexpe  .>f.

Olkin et al. (1978, p. 4) conclude: “The assumption that chance phenomena exist and can
be described, whether true or not, has proved valuable in almost every discipline.” Of the
many reasons for incorporating stochastic elements in scientific models, an incomplete list
includes the following.
• The model is not correct for a particular observation, but correct on average.
• Omissions and assumptions are typically necessary to abstract a phenomenon.
• Even if the nature of all influences were known, it may be impossible to measure or
even observe all the variables.
• Scientists do not develop models without validation and calibration with real data. The
innate variability (nonconstancy) of empirical data stems from systematic and random
effects. Random measurement errors, observational (sampling) errors due to sampling
a population rather than measuring its entirety, experimental errors due to lack of
homogeneity in the experimental material or the application of treatments, account for
stochastic variation in the data even if all systematic effects are accounted for.
• Randomness is often introduced deliberately because it yields representative samples
from which unbiased inferences can be drawn. A random sample from a population
will represent the population (on average), regardless of the sample size. Treatments
are assigned to experimental units by a random mechanism to neutralize the effects of
unaccounted sources of variation which enables unbiased estimates of treatment
means and their differences (Fisher 1935). Replication of treatments guarantees that
experimental error variation can be estimated. Only in combination with
randomization will this estimate be free of bias.
• Stochastic models are often more parsimonious than deterministic models and easier to
study. A deterministic model for the germination of seeds from a large lot, for
example, would incorporate a plethora of factors, their actions and interactions. The
plant species and variety, storage conditions, the germination environment, amount of
non-seed material in the lot, seed-to-seed differences in nutrient content, plant-to-plant
interactions, competition, soil conditions, etc. must be accounted for. Alternatively, we
can think of the germination of a particular seed from the lot as a Bernoulli random
variable with success (germination) probability 1. That is, if ]3 takes on the value " if
seed 3 germinates and the value ! otherwise, then the probability distribution of ]3 is
simply
1 C3 œ "
:aC3 b œ œ
"1 C3 œ ! .

© 2002 by CRC Press LLC




If seeds germinate independently and the germination probability is constant


throughout the seed lot, this simple model permits important conclusions about the
nature of the seed lot based on a sample of seeds. If 8 seeds are gathered from the lot
8
for a germination test and 1 s œ !3œ" ]3 Î8 is the sample proportion of germinated
seeds, the stochastic behavior of the estimator 1 s is known, provided the seeds were
selected at random. For all practical purposes it is not necessary to know the precise
germination percentage in the lot. This would require either germinating all seeds or a
deterministic model that can be applied to all seeds in the lot. It is entirely sufficient to
be able to state with a desired level of confidence that the germination percentage is
within certain narrow bounds.

Statistical models, in terminology that we adopt for this text, are stochastic models that
contain unknown constants (parameters). In the river pollution example, the model
!a=ß >b œ !! a=  ->bexpe  .>f € /, Ec/d œ !ß Varc/d œ 5 #

is a stochastic model if all parameters a!! ß -ß .ß 5 # b are known. (Note that / is not a constant
but a random variable. Its mean and variance are constants, however.) Otherwise it is a statis-
tical model and those constants that are unknown must be estimated from data. In the seed
germination example, the germination probability 1 is unknown, hence the model
1 C3 œ "
: a C3 b œ œ
"1 C3 œ !

is a statistical one. The parameter 1 is estimated based on a sample of 8 seeds from the lot.
This usage of the term parameter is consistent with statistical theory but not necessarily with
modeling practice. Any quantity that drives a model is often termed a parameter of the
model. We will refer to parameters only if they are unknown constants. Variables that can be
measured, such as plant density in the model of a yield-density relationship are, not parame-
ters. The rate of change of plant yield as a function of plant density is a parameter.

1.2 Functional Aspects of Models


Box 1.2 What a Model Does

• Statistical models describe the distributional properties of one or more


response variables, thereby decomposing variability in known and unknown
sources.

• Statistical models represent a mechanism from which data with the same
statistical properties as the observed data can be generated.

• Statistical models are assumed to be correct on average. The quality of a


model is not necessarily a function of its complexity or size, but is
determined by its utility in a particular study or experiment to answer the
questions of interest.

© 2002 by CRC Press LLC


A statistical model describes completely or incompletely the distributional properties of one
or more variables, which we shall call the response(s). If the description is complete and
values for all parameters are given, the distributional properties of the response are known. A
simple linear regression model for a random sample of 3 œ "ß âß 8 observations on response
]3 and associated regressor B3 , for example, can be written as
]3 œ "! € "" B3 € /3 , /3 µ 33. K ˆ!ß 5 # ‰. [1.3]

The model errors /3 are assumed to be independent and identically distributed a33. b accor-
ding to a Gaussian distribution (we use this denomination instead of Normal distribution
throughout) with mean ! and variance 5 # . As a consequence, ]3 is also distributed as a
Gaussian random variable with mean Ec]3 d œ "! € "" B3 and variance 5 # ,
]3 µ Kˆ"! € "" B3 ß 5 # ‰.

The ]3 are not identically distributed because their means are different, but they remain inde-
pendent (a result of drawing a random sample). Since a Gaussian distribution is completely
specified by its mean and variance, the distribution of the ]3 is completely known, once
values for the parameters "! , "" , and 5 # are known. For many statistical purposes the assump-
tion of Gaussian errors is more than what is required. To derive unbiased estimators of the
intercept "! and slope "" , it is sufficient that the errors have zero mean. A simple linear
regression model with lesser assumptions than [1.3] would be, for example,
]3 œ "! € "" B3 € /3 , /3 µ 33. ˆ!ß 5 # ‰.

The errors are assumed independent zero mean random variables with equal variance (homo-
scedastic), but their distribution is otherwise not specified. This is sometimes referred to as
the first-two-moments specification of the model. Only the mean and variance of the ]3 can
be inferred:
Ec]3 d œ "! € "" B3
Varc]3 d œ 5 # .

If the parameters "! , "" , and 5 # were known, this model would be an incomplete description
of the distributional properties of the response ]3 . Implicit in the description of distributional
properties is a separation of variability into known sources, e.g., the dependency of ] on B,
and unknown sources (error) and a description of the form of the dependency. Here, ] is
assumed to depend linearly on the regressor. Expressing which regressors ] depends on indi-
vidually and simultaneously and how this dependency can be crafted mathematically is one
important aspect of statistical modeling.
To conceptualize what constitutes a useful statistical model, we appeal to what we con-
sider the most important functional aspect. A statistical model provides a mechanism to gene-
rate the essence of the data such that the properties of the data generated under the model are
statistically equivalent to the observed data. In other words, the observed data can be con-
sidered as one particular realization of the stochastic process that is implied by the model. If
the relevant features of the data cannot be realized under the assumed model, it is not useful.
The upper left panel in Figure 1.1 shows 8 œ #" yield observations as a function of the
amount of nitrogen fertilization. Various candidate models exist to model the relationship
between plant yield and fertilizer input. One class of models, the linear-plateau models

© 2002 by CRC Press LLC




(§5.8.3, §5.8.4), are segmented models connecting a linear regression with a flat plateau
yield. The upper right panel of Figure 1.1 shows the distributional specification of a linear
plateau model. If ! denotes the nitrogen concentration at which the two segments connect, the
model for the average plant yield at concentration R can be written as
"! € "" R R Ÿ!
Ec] 3/6. d œ œ
"! € "" ! R ž !.

If M aBb is the indicator function returning value " if the condition B holds and ! otherwise, the
statistical model can also be expressed as
] 3/6. œ a"! € "" R bM aR Ÿ !b € a"! € "" !bM aR ž !b € /, / µ K ˆ!ß 5 # ‰. [1.4]

Observed Data Select Statistical Model

80 80

70 70
Yield

Yield

60 60

50 Yield = ( β 0 + β 1 N ) I ( N ≤ α )
50
+ ( β 0 + β 1α ) I ( N > α ) + e
40 40 e ~ G ( 0, σ 2 )

0 100 200 300 400 0 100 200 300 400


N (kg/ha) N (kg/ha)

Discriminate Against Other Models Fit Model to Data

80 80

70 70
Yield

Yield

60 60

50 Mitscherlich 50 βˆ0 = 38.69, βˆ1 = 0.205, αˆ = 198


Linear σˆ 2 = 4.79
40 Quadratic
40 Pseudo − R 2 = 0.97

0 100 200 300 400 0 100 200 300 400


N (kg/ha) N (kg/ha)

Figure 1.1. Yield data as a function of R input (upper left panel), linear-plateau model as an
assumed data-generating mechanism (upper right panel). Fit of linear-plateau model and com-
peting models are shown in the lower panels.

The constant variance assumption is shown in Figure 1.1 as box-plots at selected


amounts of R with constant width. If [1.4] is viewed as the data-generating mechanism for
the data plotted in the first panel, then for certain values of "! , "" , !, and 5 # , the statistical
properties of the observed data should not be unusual compared to the data generated under
the model. Rarely is there only a single model that could have generated the data, and statisti-
cal modeling invariably involves the comparison of competing models. The lower left panel
of Figure 1.1 shows three alternatives to the linear-plateau model. A simple linear regression
model without plateau, a quadratic polynomial response model, and a nonlinear model known
as Mitscherlich's equation (§5.8.1, §5.8.2).

© 2002 by CRC Press LLC


Choosing among competing models involves many factors and criteria, not all of which
are statistical. Of the formal procedures, hypothesis tests for nested models and summary
measures of model performance are the most frequently used. Two models are nested if the
smaller (reduced) model can be obtained from the larger (full) model by placing restrictions
on the parameters of the latter. A hypothesis test stating the restriction as the null hypothesis
is performed. If the null hypothesis is rejected, the smaller model is deemed of lesser quality
than the large model. This, of course, does not imply that the full model fits the data
sufficiently well to be useful. Consider as an example the simple linear regression model dis-
cussed earlier and three competitors:

ó: ]3 œ "! € /3
ô: ]3 œ "! € "" B3 € /3
õ: ]3 œ "! € "" B3 € "# B#3 € /3
ö: ]3 œ "! € "" B3 € "# D3 € /3 .

Models ó through õ are nested models, so are ó, ô, and ö. The restriction "# œ ! in
model õ produces model ô, the restriction "" œ ! in model ô produces the intercept-only
model ó, and "" œ "# œ ! yields ó from õ. To decide among these three models, one can
commence by fitting model õ and perform hypothesis tests for the respective restrictions.
Based on the results of these tests we are led to the best model among ó, ô, õ. Should we
have started with model ö instead? Model ö is a two-regressor multiple linear regression
model, and a comparison between models õ and ö by means of a hypothesis test is not
possible; the models are not nested. Other criteria must be employed to discriminate between
them. The appropriate criteria will depend on the intended use of the statistical model. If it is
important that the model fits well to the data at hand, one may rely on the coefficient of deter-
mination (V # ). To guard against overfitting, Mallow's C: statistic or likelihood-based statis-
tics such as Akaike's information criterion (AIC) can be used. If precise predictions are re-
quired then one can compare models based on cross-validation criteria or the PRESS statistic.
Variance inflation factors and other collinearity diagnostics come to the fore if statistical
properties of the parameter estimates are important (see, e.g., Myers 1990, and our §§4.4,
A4.8.3). Depending on which criteria are chosen, different models might emerge as best.
Among the informal procedures of model critique are various graphical displays, such as
plots of residuals against predicted values and regressors, partial residual plots, normal proba-
bility and Q-Q plots, and so forth. These are indispensable tools of statistical analysis but they
are often overused and misused. As we will see in §4.4 the standard collection of residual
measures in linear models are ill-suited to pass judgment about whether the unobservable
model errors are Gaussian-distributed, or not. For most applications, it is not the Gaussian
assumption whose violation is most damaging, but the homogeneous variance and the inde-
pendence assumptions (§4.5). In nonlinear models the behavior of the residuals in a correctly
chosen model can be very different from the textbook behavior of fitted residuals in a linear
model (§5.7.1). Plotting studentized residuals against fitted values in a linear model, one
expects a band of random scatter about zero. In a nonlinear model where intrinsic curvature is
large, one should look for a negative trend between the residuals and fitted values as a sign of
a well-fitting model.
Besides model discrimination based on statistical procedures or displays, the subject
matter hopefully plays a substantive role in choosing among competing models. Interpreta-

© 2002 by CRC Press LLC




bility and parsimony are critical assets of a useful statistical model. Nothing is gained by
building models that are so large and complex that they are no longer interpretable as a whole
or involve factors that are impossible to observe in practice. Adding variables to a regression
model will necessarily increase V # , but can also create conditions where the relationships
among the regressor variables render estimation unstable, predictions imprecise, and
interpretation increasingly difficult (see §§4.4.4, 4.4.5 on collinearity, its impact, diagnosis,
and remedy). Medieval Franciscan monk William of Ockham (1285-1349) is credited with
coining pluralitas non est ponenda sine neccesitate or plurality should not be assumed
(posited) without necessity. Also known as Ockham's Razor, this tenet is often loosely
phrased as “among competing explanations, pick the simplest one.” When choosing among
competing statistical models, simple does not just imply the smallest possible model. The
selected model should be simple to fit, simple to interpret, simple to justify, and simple to
apply. Nonlinear models, for example, have long been considered difficult to fit to data. Even
recently, Black (1993, p. 65) refers to the “drudgery connected with the actual fitting” of non-
linear models. Although there are issues to be considered when modeling nonlinear relation-
ships that do not come to bear with linear models, the actual process of fitting a nonlinear
model with today's computing support is hardly more difficult than fitting a linear regression
model (see §5.4).
Returning to the yield-response example in Figure 1.1, of the four models plotted in the
lower panels, the straight-line regression model is ruled out because of poor fit. The other
three models, however, have very similar goodness-of-fit statistics and we consider them
competitors. Table 1.1 gives the formulas for the mean yield under these models. Each is a
three-parameter model, the first two are nonlinear, the quadratic polynomial is a linear model.
All three models are easy to fit with statistical software. The selection thus boils down to their
interpretability and justifiability.
Each model contains one parameter measuring the average yield if no R is applied. The
linear-plateau model achieves the yield maximum of "! € "" ! at precisely R œ !; the
Mitscherlich equation approaches the yield maximum - asymptotically (as R Ä _). The
quadratic polynomial does not have a yield plateau or asymptote, but achieves a yield maxi-
mum at R œ  #" Î### . Increasing yields are recorded for R   #" Î### and decreasing
yields for R ž  #" Î### . Since the yield increase is linear in the plateau model (up to
R œ !), a single parameter describes the rate of change in yield. In the Mitscherlich model
where the transition between yield minimum 0 and upper asymptote ! is smooth, no single
parameter measures the rate of change, but one parameter a,b governs it:
` Ec] 3/6. d
œ ,a-  0bexpe  ,R f.
`R
The standard interpretation of regression coefficients in a multiple linear regression equation
is to measure the change in the mean response if the associated regressor increases by one
unit while all other regressors are held constant. In the quadratic polynomial the linear and
quadratic coefficients a#" ß ## b cannot be interpreted this way. Changing R while holding R #
constant is not possible. The quadratic polynomial has a linear rate of change,
` Ec] 3/6. d
œ #" € ### R ,
`R

© 2002 by CRC Press LLC


so that #" can be interpreted as the rate of change at R œ ! and ### as the rate of change of
the rate of change.

Table 1.1. Three-parameter yield response models and the interpretation of their parameters
No. of
Model Ec] 3/6. d params. Interpretation
"! : Ec] 3/6. d at R œ !
"" : Change in Ec] 3/6. d per
"! € "" R R Ÿ! " kg/ha additional R
Linear-plateau œ" € " ! $
! " R ž! prior to reaching plateau
!: R amount where plateau
is reached
-: Upper Ec] 3/6. d asymptote
Mitscherlich - € a0  -bexpe  ,R f $ 0: Ec] 3/6. d at R œ !
,: Governs rate of change
#! : Ec] 3/6. d at R œ !
# #" : ` Ec] 3/6. d/`R at R œ !
Quadr. polynomial #! € # " R € # # R $
" #
## : ` Ec] 3/6. d/`R #
#

Interpretability of the model parameters clearly favors the nonlinear models. Biological
relationships rarely exhibit sharp transitions and kinks. Smooth, gradual transitions are more
likely. The Mitscherlich model may be more easily justifiable than the linear-plateau model.
If no decline of yields was observed over the range of R applied, resorting to a model that
will invariably have a maximum at some R amount, be it within the observed range or out-
side of it, is tenuous.
One appeal of the linear-plateau model is to estimate the amount of R at which the two
segments connect, a special case of what is known in dose-response studies as an effective (or
critical) dosage. To compute an effective dosage in the Mitscherlich model the user must
specify the response the dosage is supposed to achieve. For example, the nitrogen fertilizer
amount RO that produces O % of the asymptotic yield in the Mitscherlich model is obtained
by solving -OÎ"!! œ - € a0  -bexpe  ,RO f for RO ,
" - "!!  O
RO œ  lnœ Œ .
, -0 "!!

In the linear-plateau model, the plateau yield is estimated as " s! € "


s"! s œ $)Þ'* €
!Þ#!&‡"*) œ (*Þ#). The parameter estimates obtained by fitting the Mitscherlich equation to
the data are -s œ )#Þ*&$, s0 œ $&Þ'((", and ,s œ !Þ!!)&(. The estimated plateau yield is
*&Þ&(% of the estimated asymptotic yield, and the R amount that achieves the plateau yield
in the Mitscherlich model is

s5 œ  " )#Þ*&$ "!!  *&Þ&(


R lnœ Œ  œ #*)Þ! kg/ha,
!Þ!!)&( )#Þ*&$  $&Þ'((" "!!

considerably larger than the critical dosage in the linear-plateau model.

© 2002 by CRC Press LLC


10 Chapter 1  Statistical Models

1.3 The Inferential Steps — Estimation and Testing


Box 1.3 Inference

• Inference in a statistical model involves the estimation of model parameters,


the determination of the precision of the estimates, predictions and their
precision, and the drawing of conclusions about the true values of the
parameters.

• The two most important statistical estimation principles are the principles of
least squares and the maximum likelihood principle. Each appears in many
different flavors.

After selecting a statistical model its parameters must be estimated. If the fitted model is
accepted as a useful abstraction of the phenomenon under study, further inferential steps in-
volve the calculation of confidence bounds for the parameters, the testing of hypotheses about
the parameters, and the calculation of predicted values. Of the many estimation principles at
our disposal, the most important ones are the least squares and the maximum likelihood prin-
ciples. Almost all of the estimation methods that are discussed and applied in subsequent
chapters are applications of these basic principles. Least squares (LS) was advanced and
maximum likelihood (ML) proposed by Carl Friedrich Gauss (1777-1855) in the early
nineteenth century. R.A. Fisher is usually credited with the (re-)discovery of the likelihood
principle.

Least Squares
Assume that a statistical model for the observed data ]3 can be expressed as
]3 œ 03 a)" ß âß ): b € /3 , [1.5]

where the )4 a4 œ "ß âß :b are parameters of the mean function 03 ab, and the /3 are zero mean
random variables. The distribution of the /3 can depend on other parameters, but not on the
)4 . LS is a semi-parametric principle, in that only the mean and variance of the /3 as well as
their covariances (correlations) are needed to derive estimates. The distribution of the /3 can
be otherwise unspecified. In fact, least squares can be motivated as a geometric rather than a
statistical principle (§4.2.1). The assertion found in many places that the /3 are Gaussian ran-
dom variables is not needed to derive the estimators of )" ß âß ): .
Different flavors of the LS principle are distinguished according to the variances and co-
variances of the error terms. In ordinary least squares (OLS) estimation the /3 are uncorrela-
ted and homoscedastic ÐVarc/3 d œ 5 # Ñ. Weighted least squares (WLS) assumes uncorrelated
errors but allows their variances to differ ÐVarc/3 d œ 53# Ñ. Generalized least squares (GLS)
accommodates correlations among the errors and estimated generalized least squares (EGLS)
allows these correlations to be unknown. There are other varieties of the least squares
principles, but these four are of primary concern in this text. If the mean function
03 a)" ß âß ): b is nonlinear in the parameters )" ß âß ): , the respective methods are referred to
as nonlinear OLS, nonlinear WLS, and so forth (see §5).

© 2002 by CRC Press LLC


The Inferential Steps — Estimation and Testing 11

The general philosophy of least squares estimation is most easily demonstrated for the
case of OLS. The principle seeks to find those values s)" ß âß s): that minimize the sum of
squares
8 8
W a)" ß âß ): b œ "aC3  03 a)" ß âß ): bb# œ "/3# .
3œ" 3œ"

This is typically accomplished by taking partial derivatives of W a)" ß âß ): b with respect to the
)4 and setting them to zero. The system of equations
`W a)" ß âß ): bÎ` )" œ !
ã
`W a)" ß âß ): bÎ` ): œ !

is known as the normal equations. For linear and nonlinear models a solution is best
described in terms of matrices and vectors; it is deferred until necessary linear algebra tools
have been discussed in §3. The solutions s)" ß âß s): to this minimization problem are called
the ordinary least squares estimators (OLSE). The residual sum of squares WWV is obtained
by evaluating W a)" ß âß ): b at the least squares estimate.
Least squares estimators have many appealing properties. In linear models, for example,
• the s)4 are linear functions of the observations C" ß âß C8 , which makes it easy to estab-
lish statistical properties of the s)4 and to test hypotheses about the unknown )4 .
• The linear combination +"s)" € ⠀ +:s): is the best linear unbiased estimator (BLUE)
of +" )" € ⠀ +: ): (Gauss-Markov theorem). If, in addition, the /3 are Gaussian-
distributed, then +"s)" € ⠀ +:s): is the minimum variance unbiased estimator of
+" )" € ⠀ +: ): . No other unbiased estimator can beat its performance, linear or not.
• If the /3 are Gaussian, then two nested models can be compared with a sum of squares
reduction test. If WWV0 and WWV< denote the residual sums of squares in the full and
reduced model, respectively, and Q WV0 the residual mean square in the full model,
then the statistic
aWWV<  WWV0 bÎ;
J9,= œ
Q WV0

has an J distribution with ; numerator and .0 V0 denominator degrees of freedom


under the null hypothesis. Here, ; denotes the number of restrictions imposed (usually
the number of parameters eliminated from the full model) and .0 V0 denotes the resi-
dual degrees of freedom in the full model. J9,= is compared against the J!ß;ß.0 V0 and
the null hypothesis is rejected if J9,= exceeds the cutoff. The distributional properties
of J9,= are established in §4.2.3 and §A4.8.2. The sum of squares reduction test can be
shown to be equivalent to many standard procedures. For the one-sample and pooled
>-tests, we demonstrate the equivalency in §1.4.

A downside of least squares estimation is its focus on parameters of the mean function
03 ab. The principle does not lend itself to estimation of parameters associated with the distri-

© 2002 by CRC Press LLC


12 Chapter 1  Statistical Models

bution of the errors /3 , for example, the variance of the model errors. In least squares esti-
mation, these parameters must be obtained by other principles.

Maximum Likelihood
Maximum likelihood is a parametric principle; it requires that the joint distribution of the
observations ]" ß âß ]8 is known except for the parameters to be estimated. For example, one
may assume that ]" ß âß ]8 follow a multivariate Gaussian distribution (§3.7) and base ML
estimation on this fact. If the observations are statistically independent the joint density (C
continuous) or mass (C discrete) function is the product of the marginal distributions of the ]3 ,
and the likelihood is calculated as the product of individual contributions, one for each
sample. Consider the case where
" if a seed germinates
]3 œ œ
! otherwise,

a binary response variable. If a random sample of 8 seeds is obtained from a seed lot then the
probability mass function of C3 is
:aC3 à 1b œ 1C3 a"  1b"C3 ,

where the parameter 1 denotes the probability of germination in the seed lot. The joint mass
function of the random sample becomes
8
:aC" ß âß C8 à 1b œ $1C3 a"  1b"C3 œ 1< a"  1b8< , [1.6]
3œ"

with < œ !83œ" C3 , the number of germinated seeds. For any given value µ 1 , the probability
:aC" ß âß C8 à µ
1 b can be thought of as the probability of observing the sample C" ß âß C8 if the
germination probability is µ 1 . The maximum likelihood principle estimates 1 by that value
which maximizes :aC" ß âß C8 à 1b, because this is the value most likely to have generated the
data.
Since :aC" ß âß C8 à 1b is now considered a function of 1 for a given sample C" ß âß C8 , we
write ¿a1à C" ß âß C8 b for the function to be maximized and call it the likelihood function.
Whatever technical device is necessary, maximum likelihood estimators (MLEs) are found as
those values that maximize ¿a1à C" ß âß C8 b or, equivalently, maximize the log-likelihood
function lne¿a1à C" ß âß C8 bf œ 6a1à C" ß âß C8 b. Direct maximization is often possible. Such
is the case in the seed germination example. From [1.6] the log-likelihood is computed as
6a1à C" ß âß C8 b œ <lne1f € a8  <blne"  1f, and taking the derivative with respect to 1
yields
`6a1à C" ß âß C8 bÎ` 1 œ <Î1  a8  <bÎa"  1b.

The MLE 1 s of 1 is the solution of <Î1


s œ a8  <bÎa"  1
sb, or 1
s œ <Î8 œ C , the sample
proportion of germinated seeds.
The maximum likelihood principle is intuitive and maximum likelihood estimators have
many appealing properties. For example,
• ML produces estimates for all parameters, not only for those in the mean function;

© 2002 by CRC Press LLC


The Inferential Steps — Estimation and Testing 13

• If the data are Gaussian, MLEs of mean parameters are identical to least squares
estimates;
• MLEs are functionally invariant. If 1
s œ C is the MLE of 1, then lne1
s Î a"  1
sbf is the
MLE of lne1Îa"  1bf, the logit of 1, for example.
• MLEs are usually asymptotically efficient. With increasing sample size their distribu-
tion tends to a Gaussian distribution, they are asymptotically unbiased and the most
efficient estimators.

On the downside we note that MLEs do not necessarily exist, are not necessarily unique,
and are often biased estimators. Variations of the likelihood idea of particular importance for
the discussion in this text are restricted maximum likelihood (REML, §7), quasi-likelihood
(QL, §8), and composite likelihood (CL, §9).
To compare two nested models in the least squares framework, the sum of squares
reduction test is a convenient and powerful device. It is intuitive in that a restriction imposed
on a statistical model necessarily will result in an increase of the residual sum of squares.
Whether that increase is statistically significant can be assessed by comparing the J9,= statis-
tic against appropriate cutoff values or by calculating the :-value of the J9,= statistic under
the null distribution (see §1.6). If ¿0 is the likelihood in a statistical model and ¿< is the
likelihood if the model is reduced according to a restriction imposed on the full model, then
¿< cannot exceed ¿0 . In the discrete case where likelihoods have interpretation of true proba-
bilities, the ratio ¿0 ο< expresses how much more likely it is that the full model generated
the data compared to the reduced model. A similar interpretation applies in the case where ]
is continuous, although the likelihood ratio then does not measure a ratio of probabilities but a
ratio of densities. If the ratio is sufficiently large, then the reduced model should be rejected.
For many important cases, e.g., when the data are Gaussian-distributed, the distribution of
¿0 ο< or a function thereof is known. In general, the likelihood ratio statistic
A œ #lne¿0 ο< f œ #e60  6< f [1.7]

has an asymptotic Chi-squared distribution with ; degrees of freedom, where ; equals the
number of restrictions imposed on the full model. In other words, ; is equal to the number of
parameters in the full model minus the number of parameters in the reduced model. The re-
duced model is rejected in favor of the full model at the ! ‚ "!!% significance level if A
exceeds ;#!ß; , the ! right-tail probability cutoff of a ;;# distribution. In cases where an exact
likelihood-ratio test is possible, it is preferred over the asymptotic test, which is exact only as
sample size tends to infinity.
As in the sum of squares reduction test, the likelihood ratio test requires that the models
being compared are nested. In the seed germination example the full model
:aC3 à 1b œ 1C3 a"  1b"C3

leaves unspecified the germination probability 1 . To test whether the germination probability
in the seed lot takes on a given value, 1! , the model reduces to
:aC3 à 1! b œ 1!C3 a"  1! b"C3

under the hypothesis L! : 1 œ 1! . The log-likelihood in the full model is evaluated at the

© 2002 by CRC Press LLC


14 Chapter 1  Statistical Models

maximum likelihood estimate 1 s œ <Î8. There are no unknowns in the reduced model and its
log-likelihood is evaluated at the hypothesized value 1 œ 1! . The two log-likelihoods become
6 0 œ 6 a1
sà C" ß âß C8 b œ <lne<Î8f € a8  <blne"  <Î8f
6< œ 6a1! à C" ß âß C8 b œ <lne1! f € a8  <blne"  1! f.

After some minor manipulations the likelihood-ratio test statistic can be written as
< 8<
A œ #e60  6< f œ #<lnœ  € #a8  <blnœ .
81! 8  81!

Note that 81! is the expected number of seeds germinating if the null hypothesis is true.
Similarly, 8  81! is the expected number of seeds that fail to germinate under L! . The
quantities < and 8  < are the observed numbers of seeds in the two categories. The likeli-
hood-ratio statistic in this case takes on the familiar form (Agresti 1990),
observed count
Aœ " observed count ‚ lnœ .
all categories
expected count

1.4 >-Tests in Terms of Statistical Models


The idea of testing hypotheses by comparing nested models is extremely powerful and
general. As noted previously, we encounter such tests as sum of squares reduction tests,
which compare the residual sum of squares of a full and reduced model, and likelihood-ratio
tests that rest on the difference in the log-likelihoods. In linear regression models with
Gaussian errors, the test statistic
s4
"
>9,= œ ,
s4‹
eseŠ"

which is the ratio of a parameter estimate and its estimated standard error, is appropriate to
test the hypothesis L! : "4 œ !. This test turns out to be equivalent to a comparison of two
models. The full model containing the regressor associated with "4 and a reduced model from
which the regressor has been removed. The comparison of nested models lurks in other
procedures, too, which on the surface do not appear to have much in common with statistical
models. In this section we formulate the well-known one- and two-sample (pooled) >-tests in
terms of statistical models and show how the comparison of two nested models is equivalent
to the standard tests.

One-Sample >-Test
The one-sample >-test of the hypothesis that the mean . of a population takes on a particular
value .! , L! : . œ .! , is appropriate if the data are a random sample from a Gaussian popu-
lation with unknown mean . and unknown variance 5 # . The general setup of the test as dis-
cussed in an introductory statistics course is as follows. Let ]" ß âß ]8 denote a random
sample from a Ka.ß 5 # b distribution. To test L! : . œ .! against the alternative L" : . Á .! ,

© 2002 by CRC Press LLC


>-Tests in Terms of Statistical Models 15

compare the test statistic


lC  .! l
>9,= œ ,
=ÎÈ8

where = is the sample standard deviation, against the !Î# (right-tailed) cutoff of a >8" distri-
bution. If >9,= ž >!Î#ß8" , reject L! at the ! significance level. We note in passing that all
cutoff values in this text are understood as cutoffs for right-tailed probabilities, e.g.,
Pra>8 ž >!ß8 b œ !.
First notice that a two-sided >-test is equivalent to a one-sided J -test where the critical
value is obtained as the ! cutoff from an J distribution with one numerator and 8  " deno-
minator degrees of freedom and the test statistic is the square of >9,= . An equivalent test of
L! :. œ .! against L" : . Á .! thus rejects L! at the ! ‚ "!!% significance level if
J9,= œ >#9,= ž J!ß"ß8" œ >#!Î#ß8" .

The statistical models reflecting the null and alternative hypothesis are
L! true: ó: ]3 œ .! € /3 , /3 µ K ˆ!ß 5 # ‰
L" true: ô: ]3 œ . € /3 , /3 µ K ˆ!ß 5 # ‰

Model ô is the full model because . is not specified under the alternative and by imposing
the constraint . œ .! , model ó is obtained from model ô. The two models are thus nested
and we can compare how well they fit the data by calculating their respective residual sum of
squares WWV œ !83œ" ÐC3  E sc]3 dÑ# . Here, E
sc]3 d denotes the mean of ]3 evaluated at the least
squares estimates. Under the null hypothesis there are no parameters since .! is a known
constant, so that WWVó œ !83œ" aC3  .! b# . Under the alternative the least squares estimate
of the unknown mean . is the sample mean C. The residual sum of squares thus takes on the
familiar form WWVô œ !83œ" aC3  C b# . Some simple manipulations yield

WWVó  WWVô œ 8aC  .! b# .

The residual mean square in the full model, Q WV0 , is the sample variance
=# œ a8  "b" !83œ" aC3  C b# and the sum of squares reduction test statistic becomes

8aC  .! b#
J9,= œ œ >#9,= .
=#
The critical value for an ! ‚ "!!% level test is J!ß"ß8" and the sum of squares reduction test
is thereby shown to be equivalent to the standard one-sample >-test.
A likelihood-ratio test comparing models ó and ô can also be developed. The
probability density function of a Ka.ß 5 # b random variable is

" " aC  .b#


0 ˆCà .ß 5 # ‰ œ expž 
È#15 # # 5# Ÿ

and the log-likelihood function in a random sample of size 8 is

© 2002 by CRC Press LLC


16 Chapter 1  Statistical Models

8 8 " 8
6ˆ.ß 5 # à C" ß âß C8 ‰ œ  lne#1f  ln˜5 # ™  # "aC3  .b# . [1.8]
# # #5 3œ"

The derivatives with respect to . and 5 # are


" 8
`6ˆ.ß 5 # à C" ß âß C8 ‰Î` . œ "aC3  .b ´ !
5 # 3œ"
8 " 8
`6ˆ.ß 5 # à C" ß âß C8 ‰Î` 5 # œ  € "aC3  .b# ´ !.
#5 # #5 % 3œ"

In the full model where both . and 5 # are unknown, the respective MLEs are the solutions to
8 #
these equations, namely . s œ C and 5 s # œ 8" !3œ8 aC3  C b . Notice that the MLE of the error
#
variance is not the sample variance = . It is a biased estimator related to the sample variance
by 5s # œ =# a8  "bÎ8. In the reduced model the mean is fixed at .! and only 5 # is a
8 #
parameter of the model. The MLE in model ó becomes 5 s #! œ 8" !3œ" aC3  .! b . The
#
likelihood ratio test statistic is obtained by evaluating the log-likelihoods at . s, 5
s in model ô
s #! in model ó. Perhaps surprisingly, A reduces to
and at .! , 5
#
s #!
5 s # € aC  .! b
5
A œ 8lnž Ÿ œ 8 ln ž Ÿ.
s#
5 s#
5
#
The second expression uses the fact that 5 s #! œ 5
s # € aC  .! b . If the sample mean is far from
the hypothesized value, the variance estimate in the reduced model will be considerably larger
than that in the full model. That is the case if . is far removed from .! , because C is an
unbiased estimators of the true mean .. Consequently, we reject L! : . œ .! for large values
of A. Based on the fact that A has an approximate ;#" distribution, the decision rule can be
formulated to reject L! if A ž ;!# ß" . However, we may be able to determine a function of the
data in which A increases monotonically. If this function has a known rather than an approxi-
s #! Î5
mate distribution, an exact test is possible. It is sufficient to concentrate on 5 s # to this end
# #
since A is increasing in 5
s ! Î5
s . Writing
aC  .! b#
s #! Î5
5 s# œ " €
s#
5
s # œ =# a8  "bÎ8, we obtain
and using the fact that 5
aC  .! b# " 8aC  .! b# "
s #! Î5
5 s# œ " € #
œ " € #
œ"€ J9,= .
5
s 8" = 8"
Instead of rejecting for large values of A we can also reject for large values of J9,= . Since the
distribution of J9,= under the null hypothesis is J"ßa8"b , an exact test is possible, and this test
is the same as the sum of squares reduction test.

Pooled X -Test
In the two-sample case the hypothesis that two populations have the same mean, L! : ." œ .# ,
can be tested with the pooled >-test under the following assumptions. ]"4 , 4 œ "ß âß 8" are a

© 2002 by CRC Press LLC


>-Tests in Terms of Statistical Models 17

random sample from a Ka." ß 5 # b distribution and ]#4 , 4 œ "ß âß 8# are a random sample
from a Ka.# ß 5 # b distribution, drawn independently of the first sample. The common variance
5 # is unknown and can be estimated as the pooled sample variance from which the procedure
derives its name:
a8"  "b=#" € a8#  "b=##
=#: œ .
8" € 8#  #
The procedure for testing L! : ." œ .# against L" : ." Á .# is to compare the value of the test
statistic
lC "  C # l
>9,= œ
" "
Ê=#: Š 8" € 8# ‹

against the >!Î#ß8" €8# # cutoff.


The distributional properties of the data under the null and alternative hypothesis can
again be formulated as nested statistical models. Of the various mathematical constructions
that can accomplish this, we prefer the one relying on a dummy regressor variable. Let D34
denote a binary variable that takes on value " for all observations sampled from population #,
and the value ! for all observations from population ". That is,
! 3 œ " (= group 1)
D34 œ œ
" 3 œ # (= group 2)

The statistical model for the two-group data can be written as


]34 œ ." € D34 " € /34 , /34 µ 33. K ˆ!ß 5 # ‰.

The distributional assumptions for the errors reflect the independence of samples from a
group, among the groups, and the equal variance assumption.
For the two possible values of D34 , the model can be expressed as
." € /34 3 œ " (= group 1)
]34 œ œ
." € " € /34 3 œ # (= group 2).

The parameter " measures the difference between the means in the two groups,
Ec]"4 d  Ec]#4 d œ ."  .# œ " . The hypothesis L! : ."  .# œ ! is the same as L! : " œ !.
The reduced and full models to be compared are
L! true: ó: ]34 œ ." € /34 , /34 µ K ˆ!ß 5 # ‰
L" true: ô: ]34 œ ." € D34 " € /34 , /34 µ K ˆ!ß 5 # ‰.

It is a nice exercise to derive the least squares estimators of ." and " in the full and reduced
models and to calculate the residual sums of squares from it. Briefly, for the full model, one
obtains . s" œ C " , "s œ C #  C " , and WWVô œ a8"  "b=# € a8#  "b=## , Q WVô œ
"
#
=: a"Î8" € "Î8# b. In the reduced model the least squares estimate of ." becomes . s" œ
a8" C" € 8# C# bÎa8" € 8# b and

© 2002 by CRC Press LLC


18 Chapter 1  Statistical Models

8" 8#
WWVó  WWVô œ aC  C # b# .
8" € 8# "
The test statistic for the sum of squares reduction test,
ˆWWVó  WWVô ‰Î" aC "  C # b #
J9,= œ œ
Q WVô =#: Š 8"" € "
8# ‹

is again the square of the >9,= statistic.

1.5 Embedding Hypotheses


Box 1.4 Embedded Hypothesis

• A hypothesis is said to be embedded in a model if invoking the hypothesis


creates a reduced model that can be compared to the full model with a sum
of squares reduction or likelihood ratio test.

• Reparameterization is often necessary to reformulate a model so that


embedding is possible.

The idea of the sum of squares reduction test is intuitive. Impose a restriction on a model and
determine whether the resulting increase in an uncertainty measure is statistically significant.
If the change in the residual sum of squares is significant, we conclude that the restriction
does not hold. One could call the procedure a sum of squares increment test, but we prefer to
view it in terms of the reduction that is observed when the restriction is lifted from the
reduced model. The restriction is the null hypothesis, and its rejection leads to the rejection of
the reduced model. It is advantageous to formulate statistical models so that hypotheses of
interest can be tested through comparisons of nested models. We then say that the hypotheses
of interest can be embedded in the model.
As an example, consider the comparison of two simple linear regression lines, one for
each of two groups (control and treated group, for example). The possible scenarios are (i) the
same trend in both groups, (ii) different intercepts but the same slopes, (iii) same intercepts
but different slopes, and (iv) different intercepts and different slopes (Figure 1.2). Which of
the four scenarios best describes the mechanism that generated the observed data can be de-
termined by specifying a full model representing case (iv) in which the other three scenarios
are nested. We choose case (iv) as the full model because it has the most unknowns. Let ]34
denote the 4th observation from group 3 a3 œ "ß #b and define a dummy variable
" if observation from group "
D34 œ œ
! if observation from group #.

The model representing case (iv) can be written as


]34 œ "! € "" D34 € "# B34 € "$ B34 D34 € /34 ,

© 2002 by CRC Press LLC


Embedding Hypotheses 19

where B34 is the value of the continuous regressor for observation 4 from group 3. To see how
the dummy variable D34 creates two separate trends in the groups, consider
Ec]34 lD34 œ "d œ "! € "" € a"# € "$ bB34
Ec]34 lB34 œ !d œ "! € "# B34 .

The intercepts are a"! € "" b in group " and "! in group #. Similarly, the slopes are a"# € "$ b
in group " and "# in group #. The restrictions (null hypotheses) that reduce the full model to
the other three cases are
(i): L! À "" œ "$ œ ! (ii): L! À "$ œ ! (iii): L! À "" œ !.

Notice that the term B34 D34 has the form of an interaction between the regressor and the
dummy variable that identifies group membership. If "$ œ !, the lines are parallel. This is the
very meaning of the absence of an interaction: the comparison of groups no longer depends
on the value of B.
The hypotheses can be tested by fitting the full and the three reduced models and per-
forming the sum of squares reduction tests. For linear models, the results of reduction tests in-
volving only one regressor or effect are given by standard regression packages, and a good
package is capable of testing more complicated constraints such as (i) based on a fit of the full
model only (§4.2.3).

E[Yij] E[Yij]
Group 1
10 Group 1 10 (ii)
(i)
Group 2
Group 2

5 5

0 x 0 x
0 3 6 9 0 3 6 9

E[Yij] Group 1 E[Yij] Group 2

10 (iii) 10 (iv)

Group 2
Group 1
5 5

0 x 0 x
0 3 6 9 0 3 6 9

Figure 1.2. Comparison of simple linear regressions among two groups.

Many statistical models can be expressed in alternative ways and this can change the
formulation of the hypothesis. Consider a completely randomized experiment with < replica-
tions of > treatments. The linear statistical model for this experiment can be written in at least
two ways, known as the means and effects models (§4.3.1)

© 2002 by CRC Press LLC


20 Chapter 1  Statistical Models

Means model: ]34 œ .3 € /34 /34 µ 33. ˆ!ß 5 # ‰


Effects model: ]34 œ . € 73 € /34 /34 µ 33. ˆ!ß 5 # ‰
3 œ "ß âß >à 4 œ "ß âß <.

The treatment effects 73 are simply .3  . and . is the average of the treatment means .3 .
Under the hypothesis of equal treatment means, L! :." œ .# œ ⠜ .> , the means model re-
duces to ]34 œ . € /34 , where . is the unknown mean common to all treatments. The equiva-
lent hypothesis in the effects model is L! : 7" œ 7# œ ⠜ 7> . Since !>3œ" 73 œ ! by construc-
tion, one can also state the hypothesis as L! : all 73 œ !. Notice that the two-sample >-test
problem in §1.4 is a special case of this problem with > œ #ß < œ 8" œ 8# .
In particular for nonlinear models, it may not be obvious how to embed a hypothesis in a
model. This is the case when the model is not expressed in terms of the quantities of interest.
Recall the Mitscherlich yield equation
Ec] d œ - € Ð0  -Ñexpe  ,Bf, [1.9]

where - is the upper yield asymptote, 0 is the yield at B œ !, and , governs the rate of
change. Imagine that B is the amount of a nutrient applied and we are interested in estimating
and testing hypotheses about the amount of the nutrient already in the soil. Call this parameter
!. Black (1993, p. 273) terms  " ‚ ! the availability index of the nutrient in the soil. It
turns out that ! is related to the three parameters in [1.9],
! œ lnea-  0bÎ-fÎ,.

Once estimates of - , 0, and , have been obtained, this quantity can be estimated by plugging
in the estimates. The standard error of this estimate of ! will be very difficult to obtain owing
to the nonlinearity of the relationship. Furthermore, to test the restriction that ! œ  #!, for
example, requires fitting a reduced model in which lnea-  0bÎ-fÎ, œ  #!. This is not a
well-defined problem.
To enable estimation and testing of hypotheses in nonlinear models, the model should be
rewritten to contain the quantities of interest. This process, termed reparameterization
(§5.7), yields for the Mitscherlich equation
Ec] d œ -a"  expe  ,aB  !bfb. [1.10]

The parameter 0 in model [1.9] was replaced by -a"  expe,!fb and after collecting terms,
one arrives at [1.10]. The sum of squares decomposition, residuals, and fit statistics are
identical when the two models are fit to data. Testing the hypothesis ! œ  #! is now
straightforward. Obtain the estimate of ! and its estimated standard error from a statistical
package (see §§5.4, 5.8.1) and calculate a confidence interval for !. If it does not contain the
value  #!, reject L! :! œ  #!. Alternatively, fit the reduced model Ec] d œ
-a"  expe  ,aB € #!bfb and perform a sum of squares reduction test.

© 2002 by CRC Press LLC


Hypothesis and Significance Testing — Interpretation of the :-Value 21

1.6 Hypothesis and Significance Testing —


Interpretation of the :-Value
Box 1.5 :-Value

• :-values are probabilities calculated under the assumption that a null


hypothesis is true. They measure the probability to observe an experimental
outcome at least as extreme as the observed outcome.

• :-values are frequently misinterpreted in terms of an error probability of


rejecting the null hypothesis or, even worse, as a probability that the null
hypothesis is true.

A distinction is made in statistical theory between hypothesis and significance testing. The
former relies on comparing the observed value of a test statistic with a critical value and to
reject the null hypothesis if the observed value is more extreme than the critical value. Most
statistical computing packages apply the significance testing approach because it does not
involve critical values. In order to derive a critical value, one must decide first on the Type-I
error probability ! to reject a null hypothesis that is true. The significance approach relies on
calculating the :-value of a test statistic, the probability to obtain a value of the test statistic at
least as extreme as the observed one, provided that the null hypothesis is true. The connection
between the two approaches lies in the Type-I error rate !. If one rejects the null hypothesis
when the :-value is less than !, and fails to reject otherwise, significance and hypothesis
testing lead to the same decisions. Statistical tests done by hand are almost always performed
as hypothesis tests, and the results of tests carried out with computers are usually reported as
:-values. We will not make a formal distinction between the two approaches here and note
that :-values are more informative than decisions based on critical values. To attach *, **,
***, or some notation to the results of tests that are significant at the ! œ !Þ!&, !Þ!", and
!Þ!!" level is commonplace but arbitrary. When the :-value is reported each reader can draw
his/her own conclusion about the fate of the null hypothesis.
Even if results are reported with notations such as *, **, *** or by attaching lettering to
an ordered list of treatment means, these displays are often obtained by converting :-values
from statistical output. The ubiquitous :-values are probably the most misunderstood and
misinterpreted quantities in applied statistical work. To draw correct conclusions from output
of statistical packages it is imperative to interpret them properly. Common misconceptions are
that (i) the :-value measures an error probability for the rejection of the hypothesis, (ii) the :-
value measures the probability that the null hypothesis is true, (iii) small :-values imply that
the alternative hypothesis is correct. To rectify these misconceptions we briefly discuss the
rationale of hypothesis testing from a probably unfamiliar angle, Monte-Carlo testing, and
demonstrate the calculation of :-values with a spatial point pattern example.
The frequentist approach to measuring model-data agreement is based on the notion of
comparing a model against different data sets. Assume a particular model holds (is true) for
the time being. In the test of two nested models we assume that the restriction imposed on the
full model holds and we accept the reduced model unless we can find evidence to the

© 2002 by CRC Press LLC


22 Chapter 1  Statistical Models

contrary. That is, we are working under the assumption that the null hypothesis holds until it
is rejected. Because there is uncertainty in the outcome of the experiment we do not expect
the observed data and the postulated model to agree perfectly. But if chance is the only expla-
nation for the disagreement between data and model, there is no reason to reject the model
from a statistical point of view. It may fit the data poorly because of large variability in the
data, but it remains correct on average.
The problem, of course, is that we observe only one experimental outcome and do not see
other sets of data that have been generated by the model under investigation. If that were the
case we could devise the following test procedure. Calculate a test statistic from the observed
data. Generate all possible data sets consistent with the null hypothesis if this number is finite
or generate a sufficiently large number of data sets if there are infinitely many experimental
outcomes. Denote the number of data sets so generated by 5 . Calculate the test statistic in
each of the 5 realizations. Since we assume the null hypothesis to be true, the value of the test
statistic calculated from the observed data is added to the test statistics calculated from the
generated data sets and the 5 € " values are ranked. If the data were generated by a
mechanism that does not agree with the model under investigation, the observed value of the
test statistic should be unusual, and its rank should be extreme. At this point we need to
invoke a decision rule according to which values of the test statistic are deemed sufficiently
rare or unusual to reject L! . If the observed value is among those values considered rare
enough to reject L! , this is the decision that follows. The critical rank is a measure of the
acceptability of the model (McPherson 1990). The decision rule cannot alone be the attained
rank of a test statistic, for example, "reject L! if the observed test statistic ranks fifth." If 5 is
large the probability of a particular value can be very small. As 5 tends to infinity, the
probability to observe a particular rank tends to zero. Instead we define cases deemed
inconsistent with L! by a range of ranks. Outcomes at least as extreme as the critical rank
lead to the rejection of L! . This approach of testing hypotheses is known under several
names. In the design of experiment it is termed the randomization approach. If the number
of possible data sets under L! is finite it is also known as permutation testing. If a random
sample of the possible data sets is drawn it is referred to as Monte-Carlo testing (see, e.g.
Kempthorne 1952, 1955; Kempthorne and Doerfler 1969; Rubinstein 1981; Diggle 1983,
Hinkelmann and Kempthorne 1994, our §A4.8.2 and §9.7.3)
The reader is most likely familiar with procedures that calculate an observed value of a
test statistic and then (i) compare the value against a cutoff from a tabulated probability distri-
bution or (ii) calculate the :-value of the test statistic. The procedure based on generating data
sets under the null hypothesis as outlined above is no different from this classical approach.
The distribution table from which cutoffs are obtained or the distribution from which :-values
are calculated reflect the probability distribution of the test statistic if the null hypothesis is
true. The list of 5 test statistic obtained from data sets generated under the null hypothesis
also reflects the distribution of the test statistic under L! . The critical value (cutoff) in a test
corresponds to the critical rank in the permutation/Monte-Carlo/randomization procedure.
To illustrate the calculation of :-values and the Monte-Carlo approach, we consider the
data shown in Figure 1.3, which might represent the locations on a field at which "&! weeds
emerged. Chapter 9.7 provides an in-depth discussion of spatial point patterns and their
analysis. We wish to test whether a process that places weeds completely at random
(uniformly and independently) in the field could give rise to the distribution shown in Figure
1.3 or whether the data generating process is clustered. In a clustered process, events exhibit

© 2002 by CRC Press LLC


Hypothesis and Significance Testing — Interpretation of the :-Value 23

more grouping than in a spatially completely random process. If we calculate the average
distance between an event and its nearest neighbor, we expect this distance to be smaller in a
clustered process. For the observed pattern (Figure 1.3) the average nearest neighbor distance
is C œ !Þ!$*'%.

1.0

0.8
Y-Coordinate

0.6

0.4

0.2

0.0

0.1 0.3 0.5 0.7 0.9


X-Coordinate

Figure 1.3. Distribution of "&! weed plants on a simulated field.

Observed average
nearest neighbor
200 distance
n=27 n=173

150
Density

100

50

0
0.035 0.036 0.037 0.038 0.039 0.040 0.041 0.042 0.043 0.044 0.044 0.045 0.046

Average Nearest Neighbor Distance

Figure 1.4. Distribution of average nearest neighbor distances in 5 œ #!! simulations of a


spatial process that places "&! plants completely at random on the a!ß "b ‚ a!ß "b square.

Two hundred point patterns were then generated from a process that is spatially
completely neutral. Events are placed uniformly and independently of each other. This
process represents the stochastic model consistent with the null hypothesis. Each of the
5 œ #!! simulated processes has the same boundary box as the observed pattern and the same
number of points a"&!b. Figure 1.4 shows the histogram of the 5 € " œ #!" average nearest-
neighbor distances. Note that the observed distance aC œ !Þ!$*'%b is part of the histogram.

© 2002 by CRC Press LLC


24 Chapter 1  Statistical Models

The line in the figure is a nonparametric estimate of the probability density of the average
nearest-neighbor statistics. This density is only an estimate of the true distribution function of
the test statistic since infinitely many realizations are possible under the null hypothesis.
The observed statistic is the #)th smallest among the #!" values. Under the alternative
hypothesis of a clustered process, the average nearest-neighbor distance should be smaller
than under the null hypothesis. Plants that appear in clusters are closer to each other on
average than plants distributed completely at random. Small average nearest-neighbor dis-
tances are thus extreme under the null hypothesis. For a &% significance test the critical rank
would be "!Þ!&. If the observed value ranks "!th or lower, L! is rejected. This is not the case
and we fail to reject the hypothesis that the plants are distributed completely at random. There
is insufficient evidence at the &% significance level to conclude that a clustered process gives
rise to the spatial distribution in Figure 1.3. The :-value is calculated as the proportion of
values at least as extreme as the observed value, hence : œ #)Î#!" œ !Þ"$*. Had we tested
whether the observed point distribution is more regular than a completely random pattern,
large average nearest neighbor distances would be consistent with the alternative hypothesis
and the :-value would be "(%Î#!" œ !Þ)''.
Because the null hypothesis is true when the 5 œ #!! patterns are generated and for the
purpose of the statistical decision, the observed value is judged as if the null hypothesis is
true; the :-value is not a measure for the probability that L! is wrong. The :-value is a
conditional probability under the assumption that L! is correct. As this probability becomes
smaller and smaller, we will eventually distrust the condition.
In a test that is not based on Monte-Carlo arguments, the estimated density in Figure 1.4
is replaced by the exact probability density function of the test statistic. This distribution can
be obtained by complete enumeration in randomization or permutation tests, by deriving the
distribution of a test statistic from first principles, or by approximation. In the >-test examples
of §1.4, the distribution of J9,= under L! is known to be that of an J random variable and
>9,= is known to be distributed as a > random variable. The likelihood ratio test statistic [1.7] is
approximately distributed as a Chi-squared random variable.
In practice, rejection of the null hypothesis is tantamount to the acceptance of the alterna-
tive hypothesis. Implicit in this step is the assumption that if an outcome is extreme under the
null hypothesis, it is not so under the alternative. Consider the case of a simple null and
alternative hypotheses, e.g., L! :. œ .! , L" :. œ ." . If ." and .! are close, an experimental
outcome extreme under L! is also extreme under L" . Rejection of L! should then not prompt
the acceptance of L" . But this test most likely will have a large probability of a Type-II error,
to fail to reject an incorrect null hypothesis (low power). These situations can be avoided by
controlling the Type-II error of the test, which ultimately implies collecting samples of
sufficient size (sufficient to achieve a desired power for a stipulated difference ."  .! ).
Finally, we remind the reader of the difference between statistical and practical significance.
Just because statistical significance is attached to a result does not imply that it is a meaning-
ful result from a practical point of view. If it takes 8 œ &ß !!! samples to detect a significant
difference between two treatments, their actual difference is probably so small that hardly
anyone will be interested in knowing it.

© 2002 by CRC Press LLC


Classes of Statistical Models 25

1.7 Classes of Statistical Models

1.7Þ1 The Basic Component Equation


Box 1.6 Some Terminology

A statistical model consists of at least the following components:


Response, error, systematic part, parameters.

• Response: The outcome of interest being measured, counted, or


classified. Notation: ]
• Parameter: Any unknown constant in the mean function or
distribution of random variables.
• Systematic part: The mean function of a model.
• Model errors: The difference between observations and the mean
function.
• Prediction: Evaluation of the mean function at the estimated values
of the parameters.
• Fitted residual: The difference between observed and fitted values:
s/3 œ ]3  ]s3

Most statistical models are supported by the decomposition


response = structure + error
of observations into a component associated with identifiable sources of variability and an
error component that McPherson (1990) calls the component equation. The equation is of
course not a panacea for all statistical problems, errors may be multiplicative, for example.
The attribute being modeled is termed the response or outcome. An observation in the
narrow sense is a recorded value of the response. In a broader sense, observations also
incorporate information about the variables in the structural part to which the response is
related.
The component equation is eventually expressed in mathematical terms. A very general
expression for the systematic part is
0 aB!3 ß B"3 ß B#3 ß ÞÞÞß B53 ß )! ß )" ß ÞÞÞß ): b

where B!3 ß âß B53 are measured variables and )! ß )" ß âß ): are parameters. The response is
typically denoted ] , and an appropriate number of subscripts must be added to associate a
single response with the parts of the model structure. If a single subscript is sufficient the
basic component equation of a statistical model becomes
]3 œ 0 aB!3 ß B"3 ß B#3 ß ÞÞÞß B53 ß )! ß )" ß ÞÞÞß ): b € /3 . [1.11]

The specification of the component equation is not complete without the means,
variances, and covariances of all random variables involved and if possible, their distribution

© 2002 by CRC Press LLC


26 Chapter 1  Statistical Models

laws. If these are unknown, they add additional parameters to the model. The assumption that
the user's model is correct is reflected in the zero mean assumption of the errors aEc/3 d œ !b.
Since then Ec]3 d œ 0 aB!3 ß B"3 ß B#3 ß ÞÞÞß B53 ß )! ß )" ß ÞÞÞß ): b, the function 0 ab is often called the
mean function of the model.
The process of fitting the model to the observed responses involves estimation of the un-
known quantities in the systematic part and the parameters of the error distribution. Once the
parameters are estimated the fitted values can be calculated:
s 3 œ 0 ŠB!3 ß B"3 ß B#3 ß ÞÞÞß B53 ß s)! ß s)" ß ÞÞÞß s): ‹ œ s0 3
]

and a second decomposition of the observations has been attained:

response = fitted value + fitted residual œ s0 3 € s/3 .

A caret placed over a symbol denotes an estimated quantity. Fitted values are calculated for
the observed values of B!3 ß âß B53 . Values calculated for any combination of the B variables,
whether part of the data set or not, are termed predicted values. It is usually assumed that the
fitted residual s/3 is an estimate of the unobservable model error /3 which justifies model
diagnostics based on residuals. But unless the fitted values estimate the systematic part of the
model without bias and the model is correct aEc/3 d œ !b, the fitted residuals will not even
have a zero mean. And the fitted values s0 3 may be biased estimators of 03 , even if the model
is correctly specified. This is common when 0 is a nonlinear function of the parameters.
The measured variables contributing to the systematic part of the model are termed here
covariates. In regression applications, they are also referred to as regressors or independent
variables, while in analysis of variance models the term covariate is sometimes reserved for
those variables which are measured on a continuous scale. The term independent variable
should be avoided, since it is not clear what the variable is independent of. The label is popu-
lar though, since the response is often referred to as the dependent variable. In many
regression models the covariates are in fact very highly dependent on each other; therefore,
the term independent variable is misleading. Covariates that can take on only two values, ! or
", are also called design variables, dummy variables, or binary variables. They are typical in
analysis of variance models. In observational studies (see §2.3) covariates are also called ex-
planatory variables. We prefer the term covariate to encompass all of the above. The precise
nature of a covariate will be clear from context.

1.7.2 Linear and Nonlinear Models


Box 1.7 Nonlinearity

• Nonlinearity does not refer to curvature of the mean function as a function


of covariates.

• A model is nonlinear if at least one derivative of the mean function with


respect to the parameters depends on at least one parameter.

© 2002 by CRC Press LLC


Classes of Statistical Models 27

The distinction between linear and nonlinear models is often obstructed by references to
graphs of the predicted values. If a graph of the predicted values appears to have curvature,
the underlying statistical model may still be linear. The polynomial
]3 œ "! € "" B3 € "# B#3 € /3

is a linear model, but when sC is graphed vs. B, the predicted values exhibit curvature. The
acid test for linearity is as follows: if the derivatives of the model's systematic part with
respect to the parameters do not depend on any of the parameters, the model is linear. Other-
wise, the model is nonlinear. For example,
]3 œ "! € "" B3 € /3
has a linear mean function, since
` a"! € "" B3 bÎ` "! œ "
` a"! € "" B3 bÎ` "" œ B3

and neither of the derivatives depends on any parameters. The quadratic polynomial
]3 œ "! € "" B3 € "# B#3 € /3 is also a linear model, since
` ˆ"! € "" B3 € "# B#3 ‰Î` "! œ "
` ˆ"! € "" B3 € "# B#3 ‰Î` "" œ B3
` ˆ"! € "" B3 € "# B#3 ‰Î` "# œ B#3 .

Linear models with curved mean function are termed curvilinear. The model
]3 œ "! ˆ" € /"" B3 ‰ € /3 ,

on the other hand, is nonlinear, since the derivatives


` ˆ"! ˆ" € /"" B3 ‰‰Î` "! œ " € /"" B3
` ˆ"! ˆ" € /"" B3 ‰‰Î` "" œ "! B3 /"" B3

depend on the model parameters.


A model can be linear in some and nonlinear in other parameters. Ec]3 d œ "! € /"" B3 , for
example, is linear in "! and nonlinear in "" , since
` ˆ"! € /"" B3 ‰Î` "! œ "
` ˆ"! € /"" B3 ‰Î` "" œ B3 /"" B3 .

If a model is nonlinear in at least one parameter, the entire model is considered nonlinear.
Linearity refers to linearity in the parameters, not the covariates. Transformations of the
covariates such as /B , lnaBb, "ÎB, ÈB do not change the linearity of the model, although they
will affect the degree of curvature seen in a plot of C against B. Polynomial models which
raise a covariate to successively increasing powers are always linear models.

© 2002 by CRC Press LLC


28 Chapter 1  Statistical Models

1.7.3 Regression and Analysis of Variance Models


Box 1.8 Regression vs. ANOVA

• Regression model: Covariates are continuous

• ANOVA model: Covariates are classification variables

• ANCOVA model: Covariates are a mixture of continuous and


classification variables.

A parametric regression model in general is a linear or nonlinear statistical model in which


the covariates are continuous, while in an analysis of variance model covariates represent
classification variables. Assume that a plant growth regulator is applied at four different
rates, !Þ&, "Þ!, "Þ$, and #Þ% kg ‚ ha" . A simple linear regression model relating plant yield
on a per-plot basis to the amount of growth regulator applied associates the 3th plot's response
to the applied rates directly:
]3 œ )! € )" B3 € /3 .
If the first four observations received !Þ&, "Þ!, "Þ$, and "Þ$ kg ‚ ha" , respectively, the model
for these observations becomes
]" œ )! € !Þ&‡)" € /"
]# œ )! € "Þ!‡)" € /#
]$ œ )! € "Þ$‡)" € /$
]% œ )! € "Þ$‡)" € /% Þ

If rate of application is a classification variable, information about the actual amounts of


growth regulator applied is not taken into account. Only information about which of the four
levels of application rate an observation is associated with is considered. The corresponding
ANOVA (classification) model becomes
]34 œ .3 € /34 ,

where .3 is the mean yield if the 3th level of the growth regulator is applied. The double sub-
script is used to emphasize that multiple observations can share the same growth regulator
level (replications). This model can be expanded using a series of dummy covariates,
D"4 ß âß D%4 , say. Let D34 take on the value " if the 4th observation received the 3th level of the
growth regulator, and ! otherwise. The expanded ANOVA model then becomes
]34 œ .3 € /34
œ ." D"4 € .# D#4 € .$ D$4 € .% D%4 € /34 .

In this form, the ANOVA model is a multiple regression model with four covariates and no
intercept.
The role of the dummy covariates in ANOVA models is to select the parameters (effects)
associated with a particular response. The relationship between plot yield and growth regula-

© 2002 by CRC Press LLC


Classes of Statistical Models 29

tion is described more parsimoniously in the regression model, which contains only two para-
meters, the intercept )! and the slope )" . The ANOVA model allots four parameters
a." ß âß .% b to describe the systematic part of the model. The downside of the regression
model is that if the relationship between C and B is not linear, the model will not apply and
inferences based on the model may be incorrect.
ANOVA and linear regression models can be cast in the same framework; they are both
linear statistical models. Classification models may contain continuous covariates in addition
to design variables, and regression models may contain binary covariates in addition to
continuous ones. An example of the first type of model arises when adjustments are made for
known systematic differences in initial conditions among experimental units to which the
treatments are applied. Assume, for example, that the soils of the plots on which growth
regulators were applied had different lime requirements. Let ?34 be the lime requirement of
the 4th plot receiving the 3th rate of the regulator; then the systematic effect of lime require-
ment on plot yield can be accounted for by incorporating ?34 as a continuous covariate in the
classification model:
]34 œ .3 € )?34 € /34 [1.12]

This is often termed an analysis of covariance (ANCOVA) model. The parameter )


measures the change in plant yield if lime requirement changes by one unit. This change is
the same for all levels of growth regulator. If the lime requirement effect depends on the
particular growth regulator, interactions can be incorporated:
]34 œ .3 € )3 ?34 € /34 .

The presence of these interactions is easily tested with a sum of squares reduction test since
the previous two models are nested aL! : )" œ )# œ )$ œ )% b.
The same problem can be approached from a regression standpoint. Consider the initial
model, ]34 œ )! € )?34 € /34 , linking plot yield to lime requirement. Because a distinct
number of rates were applied on the plot, the simple linear regression can be extended to
accommodate separate intercepts for the growth regulators. Replace the common intercept )!
by
.3 œ ." D"4 € .# D#4 € .$ D$4 € .% D%4

and model [1.12] results. Whether a regression model is enlarged to accommodate a classifi-
cation variable or a classification model is enlarged to accommodate a continuous covariate,
the same models result.

1.7.4 Univariate and Multivariate Models


Box 1.9 Univariate vs. Multivariate Models

• Univariate statistical models analyze one response at a time while multiva-


riate models analyze several responses simultaneously.

© 2002 by CRC Press LLC


30 Chapter 1  Statistical Models

• Multivariate outcomes can be measurements of different responses, e.g.,


canning quality, biomass, yield, maturity, or measurements of the same
response at multiple points in time, space, or time and space.

Most experiments produce more than just a single response. Statistical models that model one
response independently of other experimental outcomes are called univariate models, where-
as multivariate models simultaneously model several response variables. Models with more
than one covariate are sometimes incorrectly termed multivariate models. The multiple linear
regression model ]3 œ "! € "" B"3 € "# B#3 € /3 is a univariate model.
The advantage of multivariate over univariate models is that multivariate models incorpo-
rate the relationships between experimental outcomes into the analysis. This is particularly
meaningful if the multivariate responses are observations of the same attribute at different
locations or time points. When data are collected as longitudinal, repeated measures, or
spatial data, the temporal and spatial dependencies among the observations must be taken into
account (§2.5). In a repeated measures study, for example, this requires modeling the obser-
vations jointly, rather than through separate analyses by time points. By separately analyzing
the outcomes by year in a multi-year study, little insight is gained into the time-dependency of
the system.
Multivariate responses in this text are confined to the special case where the same
response variable is measured repeatedly, that is, longitudinal, repeated measures, and spatial
data. Developing statistical models for such data (§§ 7, 8, 9) requires a good understanding of
the notion and consequences of clustering in data (discussed in §2.4 and §7.1).

1.7.5 Fixed, Random, and Mixed Effects Models


Box 1.10 Fixed and Random Effects

• The distinction of fixed and random effects applies to the unknown model
components:
— a fixed effect is an unknown constant (does not vary),
— a random effect is a random variable.

• Random effects arise from subsampling, random selection of treatment


levels, and hierarchical random processes, e.g., in clustered data (§2.4).

• Fixed effects model: All effects are fixed (apart from the error)

• Random effects model: All effects are random (apart from intercept)

• Mixed effects model: Some effects are fixed, others are random (not count-
ing an intercept and the model error)

The distinction between fixed, random, and mixed effects models is not related to the nature
of the covariates, but the unknown quantities of the statistical model. In this text we assume

© 2002 by CRC Press LLC


Classes of Statistical Models 31

that covariate values are not associated with error. A fixed effects model contains only
constants in its systematic part and one random variable (the error term). The variance of the
error term measures residual variability. Most traditional regression models are of this type.
In designed experiments, fixed effects models arise when the levels of the treatments are
chosen deliberately by the researcher as the only levels of interest. A fixed effects model for a
randomized complete block design with one treatment factor implies that the blocks are pre-
determined as well as the factor levels.
A random effects model consists of random variables only, apart from a possible grand
mean. These arise when multiple random processes are in operation. Consider sampling two
hundred bags of seeds from a large seed lot. Fifty laboratories are randomly selected to
receive four bags each for analysis of germination percentage and seed purity from a list of
laboratories. Upon repetition of this experiment a different set of laboratories would be
selected to receive different bags of seeds. Two random processes are at work. One source of
variability is due to selecting laboratories at random from the population of all possible
laboratories that could have performed the analysis. A second source of variability stems
from randomly determining which particular four bags of seeds are sent to a laboratory. This
variability is a measure for the heterogeneity within the seed lot, and the first source repre-
sents variability among laboratories. If the two random processes are independent, the
variance of a single germination test result is the sum of two variance components,
Varc]34 d œ 56# € 5,# .

Here, 56# measures lab-to-lab variability and 5,# variability in test results within a lab (seed lot
heterogeneity). A statistical model for this experiment is
]34 œ . € !3 € /34
3 œ "ß âß &!à 4 œ "ß âß %,

where !3 is a random variable with mean ! and variance 56# , /34 is a random variable (inde-
pendent of the !3 ) with mean ! and variance 5,# , and ]34 is the germination percentage repor-
ted by the 3th lab for the 4th bag. The grand mean is expressed by ., the true germination per-
centage of the lot. A fixed grand mean should always be included in random effects models
unless the response has zero average.
Mixed effects models arise when some of the model components are fixed, while others
are random. A mixed model contains at least two random variables (counting the model errors
/) and two unknown constants in the systematic part (counting the grand mean). Mixed
models can be found in multifactor experiments where levels of some factors are predeter-
mined while levels of other factors are chosen at random. If two levels of water stress (irrigat-
ed, not irrigated) are combined with six genotypes selected from a list of $! possible geno-
types, a two-factor mixed model results:
]345 œ . € !3 € #4 € a!# b34 € /345 .

Here, ]345 is the response of genotype 4 under water stress level 3 in replicate 5 . The !3 's
denote the fixed effects of water stress, the #4 's the random effects of genotype with mean !
and variance 5## . Interaction terms such as a!# b34 are random effects, if at least one of the
factors involved in the interaction is a random factor. Here, a!# b34 is a random effect with
#
mean ! and variance 5!# . The /345 finally denote the experimental errors. Mixed model struc-

© 2002 by CRC Press LLC


32 Chapter 1  Statistical Models

tures also result when treatments are allocated to experimental units by separate randomiza-
tions. A split-plot design randomly allocates levels of the whole-plot factor to large experi-
mental units (whole-plots) and independently thereof randomly allocates levels of one or
more other factors within the whole-plots. The two randomizations generate two types of
experimental errors, one associated with the whole-plots, one associated with the sub-plots. If
the levels of the whole- and sub-plot treatment factor are selected at random, the resulting
model is a random model. If the levels of at least one of the factors were predetermined, a
mixed model results.
In observational studies (see §2.3), mixed effects models have gained considerable popu-
larity for longitudinal data structures (Jones 1993; Longford 1993; Diggle, Liang, and Zeger
1994; Vonesh and Chinchilli 1997; Gregoire et al. 1997, Verbeke and Molenberghs 1997;
Littell et al. 1996). Longitudinal data are measurements taken repeatedly on observational
units without the creation of experimental conditions by the experimenter. These units are
often termed subjects or clusters. In the absence of randomization, mixed effects arise
because some "parameters" of the model (slopes, intercept) are assumed to vary at random
from subject to subject while other parameters remain constant across subjects. Chapter 7
discusses mixed models for longitudinal and repeated measures data in great detail. The
distinction between fixed and random effects and its bearing on data analysis and data
interpretation are discussed there in more detail.

1.7.6 Generalized Linear Models


Box 1.11 Generalized Linear Models (GLM)

• GLMs provide a unified framework in which Gaussian and non-Gaussian


data can be modeled. They combine aspects of linear and nonlinear models.
Covariate effects are linear on a transformed scale of the mean response.
The transformations are usually nonlinear.

• Classical linear regression and classification models with Gaussian errors


are special cases of generalized linear models.

Generalized linear models (GLMs) are statistical models for a large family of probability
distributions known as the exponential family (§6.2.1). This family includes such important
distributions as the Gaussian, Gamma, Chi-squared, Beta, Bernoulli, Binomial, and Poisson
distributions. We consider generalized linear models among the most important statistical
models today. The frequency (or lack thereof) with which they are applied in problems in the
plant and soil sciences belies their importance. They are based on work by Nelder and
Wedderburn (1972) and Wedderburn (1974), subsequently popularized in the monograph by
McCullagh and Nelder (1989). If responses are continuous, modelers typically resort to linear
or nonlinear statistical models of the kind
]3 œ 0 aB!3 ß B"3 ß B#3 ß ÞÞÞß B53 ß )! ß )" ß ÞÞÞß ): b € /3 ,

where it is assumed that the model residuals have zero mean, are independent, and have some
common variance Varc/3 d œ 5 # . For purposes of parameter estimation, these assumptions are

© 2002 by CRC Press LLC


Classes of Statistical Models 33

usually sufficient. For purposes of statistical inference, such as the test of hypotheses or the
calculation of confidence intervals, distributional assumptions about the model residuals are
added. All too often, the errors are assumed to follow a Gaussian distribution. There are many
instances in which the Gaussian assumption is not tenable. For example, if the response is not
a continuous characteristic, but a frequency count, or when the error distribution is clearly
skewed. Generalized linear models allow the modeling of such data when the response distri-
bution is a member of the exponential family (§6.2.1). Since the Gaussian distribution is a
member of the exponential family, linear regression and analysis of variance methods are spe-
cial cases of generalized linear models.
Besides non-Gaussian error distributions, generalized linear models utilize a model com-
ponent known as the link function. This is a transformation which maps the expected values
of the response onto a scale where covariate effects are additive. For a simple linear regres-
sion model with Gaussian error,
]3 œ "! € "" B3 € /3 à /3 µ K ˆ!ß 5 # ‰,

the expectation of the response is already linear. The applicable link function is the identity
function. Assume that we are concerned with a binary response, for example, whether a parti-
cular plant disease is present or absent. The mean (expected value) of the response is the
probability 1 that the disease occurs and a model is sought that relates the mean to some
environmental factor B. It would be unreasonable to model this probability as a linear
function of B, 1 œ "! € "" B. There is no guarantee that the predicted values are between !
and ", the only acceptable range for probabilities. A monotone function that maps values
between ! and " onto the real line is the logit function
1
( œ lnš ›.
"1
Rather then modeling 1 as a linear function of B, it is the transformed value ( that is modeled
as a linear function of B,
1
( œ lnš › œ "! € "" B.
"1
Since the logit function links the mean 1 to the covariate, it is called the link function of the
model. For any given value of (, the mean response is calculated by inverting the
relationship,
" "
1œ œ .
" € expe  (f " € expe  "!  "" Bf

In a generalized linear model, a linear function of covariates is selected in the same way
as in a regression or classification model. Under a distributional assumption for the responses
and after selecting a link function, the unknown parameters can be estimated. There are
important differences between applying a link function and transformations such as the
arcsine, square root, logarithmic transform that are frequently applied in statistical work. The
latter transformations are applied to the individual responses ] in order to achieve greater
variance homogeneity and/or symmetry, usually followed by a standard linear model analysis
on the transformed scale assuming Gaussian errors. In a generalized linear model the link
function transforms the mean response Ec] d and the distributional properties of the response

© 2002 by CRC Press LLC


34 Chapter 1  Statistical Models

are not changed. A Binomial random variable is analyzed as a Binomial random variable, a
Poisson random variable as a Poisson random variable.
Because ( is a linear function of the covariates, statistical inference about the model para-
meters is straightforward. Tests for treatment main effects and interactions, for example, are
simple if the outcomes of an experiment with factorial treatment structure are analyzed as a
generalized linear model. They are much more involved if the model is a general nonlinear
model. Chapter 6 provides a thorough discussion of generalized linear models and numerous
applications. We mention in passing that statistical models appropriate for ordinal outcomes
such as visual ratings of plant quality or injury can be derived as extensions of generalized
linear models (see §6.5)

1.7.7 Errors in Variable Models


The distinction between response variable and covariates applied in this text implies that the
response is subject to variability and the covariates are not. When measuring the height and
biomass of a plant it is obvious, however, that both variables are subject to variability, even if
uncertainty stems from measurement error alone. If the focus of the investigation is to de-
scribe the association or strength of dependency between the two variables, methods of corre-
lation analysis can be employed. If, however, one wishes to describe or model biomass as a
function of height, treating one variable as fixed (the covariate) and the other as random (the
response) ignores the variability in the covariate.
We will treat covariates as fixed throughout, consistent with classical statistical models.
If covariate values are random, this approach remains intact in the following scenarios (Seber
and Wild 1989):
• Response ] and covariate \ are both random and linked by a relationship
0 aBß )" ß âß ): b. If the values of \ can be measured accurately and the measured value
is a realization of \ , the systematic part of the model is interpreted conditionally on
the observed values of \ . We can write this as
Ec] l\ œ Bd œ 0 aBß )" ß âß ): b,

read as conditional on observing covariate value B, the mean of ] is 0 aBß )" ß âß ): b.


• Response ] and covariate \ are both random and 0 ab is an empirical relationship
developed by the modeler. Even if \ can only be measured with error, modeling
proceeds conditional on the observed values of \ and the same conditioning argument
applies: Ec] l\ œ Bd œ 0 aBß )" ß âß ): b.

To shorten notation we usually drop the condition on \ argument in expectation opera-


tions. If the conditioning argument is not applied, an error-in-variable model is called for.
These models are beyond the scope of this text. The interested reader is directed to the litera-
ture: Berkson (1950); Kendall and Stuart (1961); Moran (1971); Fedorov (1974); Seber and
Wild (1989); Bunke and Bunke (1989); Longford (1993); Carroll, Ruppert, and Stefanski
(1995).

© 2002 by CRC Press LLC


Chapter 2

Data Structures

“Modern statisticians are familiar with the notions that any finite body of
data contains only a limited amount of information, on any point under
examination; that this limit is set by the nature of the data themselves, and
cannot be increased by any amount of ingenuity expended in their
statistical examination: that the statistician's task, in fact, is limited to the
extraction of the whole of the available information on any particular
issue.” Fisher, R.A., The Design of Experiments, 4th ed., Edinburgh:
Oliver and Boyd, 1947, p. 39)

2.1 Introduction
2.2 Classification by Response Type
2.3 Classification by Study Type
2.4 Clustered Data
2.4.1 Clustering through Hierarchical Random Processes
2.4.2 Clustering through Repeated Measurements
2.5 Autocorrelated Data
2.5.1 The Autocorrelation Function
2.5.2 Consequences of Ignoring Autocorrelation
2.5.3 Autocorrelation in Designed Experiments
2.6 From Independent to Spatial Data — a Progression of Clustering

© 2003 by CRC Press LLC


2.1 Introduction
The statistical model is an abstraction of a data-generating mechanism that captures those
features of the experimental process that are pertinent to a particular inquiry. The data
structure is part of the generating mechanism that a model must recognize. The pooled >-test
of §1.4, for example, is appropriate when independent random samples are obtained from two
groups (populations) and the response variable is Gaussian distributed. Two data structures fit
this scenario: (i) two treatments are assigned completely at random to 8" and 8# experimental
units, respectively, and a Gaussian distributed response is measured once on each unit; (ii)
two existing populations are identified which have a Gaussian distribution with the same va-
riance; 8" individuals are randomly selected from the first and  independent thereof  8#
individuals are randomly selected from the second population. Situation (i) describes a de-
signed experiment and situation (ii) an observational study. Either situation can be analyzed
with the same statistical model
. € /34 3 œ " (=group 1)
]34 œ œ " , / µ 33. K ˆ!ß 5 # ‰.
." € " € /34 3 œ # (=group 2) 34

In other instances the statistical model must be altered to accommodate a different data
structure. Consider the Mitscherlich or linear-plateau model of §1.2. We tacitly assumed there
that observations corresponding to different levels of R input were independent. If the #" R
levels are randomly assigned to some experimental units, this assumption is reasonable. Even
if there are replications of each R level the models fitted in §1.2 still apply if the replicate
observations for a particular R level are averaged. Now imagine the following alteration of
the experimental protocol. Each experimental unit receives ! 51Î2+ R at the beginning of the
study, and every few days & 51Î2+ are added until eventually all experimental units have
received all of the #" R levels. Whether the statistical model remains valid and if not, how it
needs to be altered, depends on the changes in the data structure that have been incurred by
the protocol alteration.
A data structure comprises three key aspects. The response type (e.g., continuous or
discrete, §2.2), the study type (e.g., designed experiment vs. observational study, §2.3), and
the degree of data clustering (the hierarchical nature of the data, §2.4). Agronomists are
mostly familiar with statistical models for continuous response from designed experiments.
The statistical models underpinning the analyses are directed by the treatment, error control,
and observational design and require comparatively little interaction between user and model.
The temptation to apply the same types of analyses, i.e., the same types of statistical models,
in other situations, is understandable and may explain why analysis of proportions or ordinal
data by analysis of variance methods is common. But if the statistical model reflects a
mechanism that cannot generate data with the same pertinent features as the data at hand, if it
generates a different kind of data, how can inferences based on these models be reliable?
The analytical task is to construct models from the classes in §1.7 that represent appropri-
ate generating mechanisms. Discrete response data, for example, will lead to generalized
linear models, continuous responses with nonlinear mean function will lead to nonlinear
models. Hierarchical structures in the data, for example, from splitting experimental units,
often call for mixed model structures. The powerful array of tools with which many (most?)
data analytic problems in the plant and soil sciences can be tackled is attained by combining

© 2003 by CRC Press LLC


the modeling elements. A discrete outcome such as the number of infected plants observed in
a split-plot design will automatically lead to generalized linear models of the mixed variety
(§8). A continuous outcome for georeferenced data will lead to a spatial random field model
(§9).
The main chapters of this text are organized according to response types and levels of
clustering. Applications within each chapter reflect different study types. Models appropriate
for continuous responses are discussed in §4, §5, §7 to 9. Models for discrete and other non-
normal responses appear in §§6, 8. The statistical models in §§4 to 6 apply to uncorrelated
(nonclustered) data. Chapters 7 and 8 apply to single- or multiple-level hierarchical data.
Chapter 9 applies to a very special case of clustered data structures, spatial data.

2.2 Classification by Response Type


There are numerous classifications of random variables and almost any introductory statistics
book presents its own variation. The variation we find most useful is given below. For the
purpose of this text, outcomes (random variables) are classified as discrete or continuous,
depending on whether the possible values of the variable (its support) are countable or not
(Figure 2.1). The support of a continuous random variable ] is not countable, which implies
that if + and , (+  , ) are two points in the support of ] , an infinite number of possible
values can be placed between them.
The support of a discrete random variable is always countable; it can be enumerated,
even if it is infinitely large. The number of earthworms under a square meter of soil does not
have a theoretical upper limit, although one could claim that worms cannot be packed
arbitrarily dense. The support can nevertheless be enumerated: e!ß "ß #ß âß f. A discrete
variable is observed (measured) by counting the number of times a particular value in the
support occurs. If the support itself consists of counts, the discrete variable is called a count
variable. The number of earthworms per square meter or the number of seeds germinating
out of "!! seeds are examples of count variables. Among count variables a further distinction
is helpful. Some counts can be converted into proportions because they have a natural
denominator. An outcome in the seed count experiment can be reported as “] seeds out of 8
germinated” or as a proportion: “The proportion of germinated seeds is ] Î8.” We term such
count variables counts with a natural denominator. Counts without a natural denominator
cannot be converted into proportions. The number of poppies per square meter or the number
of chocolate chips in a cookie are examples. The distinction between the two types of count
variables is helpful because a reasonable probability model for counts with a natural
denominator is often the Binomial distribution and counts without a natural denominator are
often modeled as Poisson variables.
If the discrete support consists of labels that indicate membership in a category, the
variable is termed categorical. The presence or absence of a disease, the names of nematode
species, and plant quality ratings are categorical outcomes. The support of a categorical
variable may consist of numbers but continuity of the variable is not implied. The visual
assessment of turfgrass quality on a scale from 1 to 9 creates an ordinal rather than a
continuous variable. Instead of the numbers 1 through 9, the labels “a” through “i” may be
used. The two labeling systems represent the same ordering. Using numbers to denote

© 2003 by CRC Press LLC


category labels, so-called scoring, is common and gives a false sense of continuity that may
result in the application of statistical models to ordered data which are designed for
continuous data. A case in point is the fitting of analysis of variance models to ordinal scores.

Variable (Response) Types

Continuous
Number of possible values not countable
Distance between values well-defined

Discrete
Number of possible values countable

Counts Categorical
Support consists of true counts Support consists of labels, possibly numbers

Nominal Ordinal Binary


Unordered labels Labels are ordered Two labels

Figure 2.1. Classification of response types.

There is a transition from discrete to continuous variables, which is best illustrated using
proportions. Consider counting the number of plants \ out of a total of 5 plants that die after
application of an herbicide. Since both \ and 5 are integers, the support of ] , the proportion
of dead plants, is discrete:
" # 5"
œ!ß ß ß âß ß ".
5 5 5

As 5 increases so does the number of elements in the support and provided 5 is sufficiently
large, it can be justified to consider the support infinitely large and no longer countable. The
discrete proportion is then treated for analytic purposes as a continuous variable.

© 2003 by CRC Press LLC


2.3 Classification by Study Type
Box 2.1 Study Types

• Designed experiment: Conditions (treatments) are applied by the


experimenter and the principles of experimental design (replication,
randomization, blocking) are observed.

• Comparative experiment: A designed experiment where changes in the


conditions (treatments) are examined as the cause of a change in the
response.

• Observational study: Values of the covariates (conditions) are merely


observed, not applied. If conditions are applied, design principles are not or
cannot be followed. Inferences are associative rather than causal because of
the absence of experimental control.

• Comparative study: An observational study examining whether changes in


the conditions can be associated with changes in the response.

• Validity of inference is derived from random allocation of treatments to


experimental units in designed experiments and random sampling of the
population in observational studies.

The two fundamental situations in which data are gathered are the designed experiment and
the observational study. Control over experimental conditions and deliberate varying of these
conditions is occasionally cited as the defining feature of a designed experiment (e.g.,
McPherson 1990, Neter et al. 1990). Observational studies then are experiments where condi-
tions are beyond the control of the experimenter and covariates are merely observed. We do
not fully agree with this delineation of designed and observational experiments. Application
of treatments is an insufficient criterion for design and the existence of factors not controlled
by the investigator does not rule out a designed experiment, provided uncontrolled effects can
be properly neutralized via randomization. Unless the principles of experimental design, ran-
domization, replication, and across-unit homogeneity (blocking) are observed, data should
not be considered generated by a designed experiment. This narrow definition is necessary
since designed experiments are understood to lead to cause-and-effect conclusions rather than
associative interpretations. Experiments are usually designed as comparative experiments
where a change in treatment levels is to be shown to be the cause of changes in the response.
Experimental control must be exercised properly, which implies that (i) treatments are
randomly allocated to experimental units to neutralize the effects of uncontrolled factors; (ii)
treatments are replicated to allow the estimation of experimental error variance; and (iii)
experimental units are grouped into homogeneous blocks prior to treatment application to
eliminate controllable factors that are related to the response. The only negotiable of the three
principles is that of blocking. If a variable by which the experimental units should be blocked
is not taken into account, the experimental design will lead to unbiased estimates of treatment

© 2003 by CRC Press LLC


differences and error variances provided randomization is applied. The resulting design may
be inefficient with a large experimental error component and statistical tests may be lacking
power. Inferences remain valid, however. The completely randomized design (CRD) which
does not involve any blocking factors can indeed be more efficient than a randomized
complete block design if the experimental material is homogeneous. The other principles are
not negotiable. If treatments are replicated but not randomly assigned to experimental units,
the basis for causal inferences has been withdrawn and the data must not be viewed as if they
had come from a designed experiment. The data should be treated as observational.
An observational study in the narrow sense produces data where the values of covariates
are merely observed, not assigned. Typical examples are studies where a population is
sampled and along with a response, covariates of immediate or potential interest are observed.
In the broader sense we include experiments where the factors of interest have not been ran-
domized and experiments without proper replication. The validity of inferences in obser-
vational studies derives from random sampling of the population, not random assignment of
treatments. Conclusions in observational studies are narrower than in designed experiments,
since a dependence of the response on covariates does not imply that a change in the value of
the covariate will cause a corresponding change in the value of the response. The variables
are simply associated with each other in the particular data set. It cannot be ruled out that
other effects caused the changes in the response and confound inferences.

a) b) c)
N

A1 A2 A2 A1 A1 A2

A1 A2 A1 A2 A2 A2

A1 A2 A2 A1 A1 A1

A1 A2 A2 A1 A2 A1

Figure 2.2. Three randomizations of two treatments aE" ß E# b in four replications. (a) and (b)
are complete random assignments, whereas (c) restricts each treatment to appear exactly twice
in the east and twice in the west strip of the experimental field.

In a designed experiment, a single randomization is obtained, based on which the experi-


ment is performed. It is sometimes remarked that in the particular randomization obtained,
treatment differences may be confounded with nontreatment effects. For example, in a
completely randomized design, it is possible, although unlikely, that the < replications of
treatment E" come to lie in the westernmost part of the experimental area and the replications
of treatment E# are found in the easternmost areas (Figure 2.2a). If wind or snowdrift induce
an east-west effect on the outcome, treatment differences will be confounded with the effects

© 2003 by CRC Press LLC


of snow accumulation. One may be more comfortable with a more balanced arrangement
(Figure 2.2b). If the experimenter feels uncomfortable with the result of a randomization, then
most likely the chosen experimental design is inappropriate. If the experimental units are truly
homogeneous, treatments can be assigned completely at random and the actual allocation ob-
tained should not matter. If one's discomfort is the reflection of an anticipated east-west effect
the complete random assignment is not appropriate, the east-west effect can be controlled by
blocking (Figure 2.2c).
The effect of randomization cannot be gleaned from the experimental layout obtained.
One must envision the design under repetition of the assignment procedure. The neutralizing
effect of randomization does not apply to a single experiment, but is an average effect across
all possible arrangements of treatments to experimental units under the particular design.
While blocking eliminates the effects of systematic factors in any given design, randomiza-
tion neutralizes the unknown effects in the population of all possible designs. Randomization
is the means to estimate treatment differences and variance components without bias. Replica-
tion, the independent assignment of treatments, enables estimation of experimental error
variance. Replication alone does not lead to unbiased estimates of treatment effects or experi-
mental error, a fact that is often overlooked.

2.4 Clustered Data


Box 2.2 Clusters

• A cluster is a collection of observations that share a stochastic, temporal,


spatial, or other association that suggests to treat them as a group.

• While dependencies and interrelations may exist among the units within a
cluster, it is often reasonable to treat observations from different clusters as
independent.

• Clusters commonly arise from


— hierarchical random sampling (nesting of sampling units)
— hierarchical random assignment (splitting),
— repeated observations taken on experimental or sampling units.

Clustering of data refers to the hierarchical structure in data. It is an important feature of the
data-generating mechanism and as such it plays a critical role in formulating statistical
models, in particular mixed models and models for spatial data. In general, a cluster repre-
sents a collection of observations that are somehow stochastically related, whereas observa-
tions from different clusters are typically independent (stochastically unrelated). The two pri-
mary situations that lead to clustered data structures are (i) hierarchical random processes and
(ii) repeated measurements (Figure 2.3). Grouping of observations into sequences of repeated
observations collected on the same entity or subject has long been recognized as a clustered
data structure. Clustering through hierarchical random processes such as subsampling or split-
ting of experimental units also gives rise to hierarchical data structures.

© 2003 by CRC Press LLC


Clustered Data

Hierarchical Random Processes Repeated Measurements


Longitudinal Data
Units within a cluster are Units within a cluster are
uncorrelated correlated

Subsampling in time

Splitting in space

in time and space


Random selection of or some other
blocks or treatments metric

Figure 2.3. Frequently encountered situations that give rise to clustered data.

2.4.1 Clustering through Hierarchical Random Processes


Observational studies as well as designed experiments can involve subsampling. In the for-
mer, the level-one units are sampled from a population and contain multiple elements. A ran-
dom sample of these elements, the subsamples, constitute the observational data. For
example, forest stands are selected at random from a list of stands and within each stand a
given number of trees are selected. The forest stand is the cluster and the trees selected within
are the cluster elements. These share a commonality, they belong to the same stand (cluster).
Double-tubed steel probes are inserted in a truckload of wheat kernels and the kernels trapped
in the probe are extracted. The contents of the probe are well-mixed and two "!!-g
subsamples are selected and submitted for assay to determine deoxynivalenol concentration.
The probes are the clusters and the two "!!-g samples are the within-cluster observational
units. A subsampling structure can also be cast as a nested structure, in which the subsamples
are nested within the level-one sampling units. In designed experiments subsampling is
common when experimental units cannot be measured in their entirety. An experimental unit
may contain eight rows of crop but only a subsample thereof is harvested and analyzed. The
experimental unit is the cluster. In contrast to the observational study, the first random pro-
cess is not the random selection of the experimental unit, but the random allocation of the
treatments to the experimental unit.
Splitting, the process of assigning levels of one factor within experimental units of
another factor, also gives rise to clustered data structures. Consider two treatment factors E
and F with + and , levels, respectively. The + levels of factor E are assigned to experimental

© 2003 by CRC Press LLC


units according to some standard experimental design, e.g., a completely randomized design,
a randomized complete block design, etc. Each experimental unit for factor E, the whole-
plots, is then split (divided) into , experimental units (the sub-plots) to which the levels of
factor F are randomly assigned. This process of splitting and assigning the F levels is carried
out independently for each whole-plot. Each whole-plot constitutes a cluster and the sub-plot
observations for factor F are the within-cluster units. This process can be carried out to
another level of clustering when the sub-plots are split to accommodate a third factor, i.e., as
a split-split-plot design.
The third type of clustering through hierarchical random processes, the random selection
of block or treatment levels, is not as obvious, but it shares important features with the
subsampling and splitting procedures. When treatment levels are randomly selected from a
list of possible treatments, the treatment effects in the statistical model become random
variables. The experimental errors, also random variables, then are nested within another
random effect. To compare the three hierarchical approaches, consider the case of a designed
experiment. Experiment ó is a completely randomized design (CRD) with subsampling,
experiment ô is a split-plot design with the whole-plot factor arranged in a CRD, and experi-
ment õ is a CRD with the treatments selected at random. The statistical models for the three
designs are
ó ]345 œ . € 73 € /34 € .345
/34 µ 33. ˆ!ß 5/# ‰à .345 µ 33. ˆ!ß 5.# ‰

ô ]345 œ . € !3 € /34 € "5 € a!" b35 € .345


/34 µ 33. ˆ!ß 5/# ‰à .345 µ 33. ˆ!ß 5.# ‰

õ ]34 œ . € 73 € /34
73 µ 33. ˆ!ß 57# ‰à /34 µ 33. ˆ!ß 5/# ‰.

In all models the /34 denote experimental errors. In ô the /34 are the whole-plot experimental
errors and the .345 are the sub-plot experimental errors. In ó the .345 are subsampling (obser-
vational) errors. The 73 in õ are the random treatment effects. Regardless of the type of
design, the error terms in all three models are independent by virtue of the random selection
of observational units or the random assignment of treatments. Every model contains two ran-
dom effects where the second effect has one more subscript. The clusters are formed by the
/34 in ó and ô and by the 73 in õ. While the within-cluster units are uncorrelated, it turns
out that the responses ]345 and ]34 are not necessarily independent of each other. For two sub-
samples from the same experimental unit in ó and two observations from the same whole-
plot in ô, we have Covc]345 ß ]345w d œ 5/# . For two replicates of the same treatment in õ, we
obtain Covc]34 ß ]34w d œ 57# . This is an important feature of statistical models with multiple,
nested random effects. They induce correlations of the responses (§7.5.1).

2.4.2 Clustering through Repeated Measurements


A different type of clustered data structure arises if the same observational or experimental
unit is measured repeatedly. An experiment where one randomization of treatments to experi-

© 2003 by CRC Press LLC


mental units occurs at the beginning of the study and the units are observed repeatedly
through time without re-randomization is termed a repeated measures study. If the data are
observational in nature, we prefer the term longitudinal data. The cluster is then formed by
the collection of repeated measurements. Characteristic of both is the absence of a random
assignment or selection within the cluster. Furthermore, the repeat observations from a given
cluster usually are autocorrelated (§2.5) It is not necessary that the repeated measurements
are collected over time, although this is by far the most frequent case. It is only required that
the cluster elements in longitudinal and repeated measures data be ordered along some metric.
This metric may be temporal, spatial, or spatiotemporal. The cluster elements in the following
example are arranged spatially along transects.

Example 2.1. The phosphate load of pastures under rotational grazing is investigated
for two forage species, alfalfa and birdsfoot trefoil. Since each pasture contains a
watering hole where heifers may concentrate, Bray-1 soil P (Bray-P1) is measured at
various distances from the watering hole. The layout of the observational units for one
replicate of the alfalfa treatment is shown in Figure 2.4.

Replicate II, Alfalfa Ray 3

Ray 2

10 ft
{

Ray 1

Watering
hole

Figure 2.4. Alfalfa replicate in soil P study. The replicate serves as the cluster for the
three rays along which soil samples are collected in "!-ft spacing. Each ray is a cluster
of seven soil samples.

The measurements along each ray are ordered along a spatial metric. Similarly, the
distances between measurements on different rays within a replicate are defined by the
Euclidean distance between any two points of soil sampling.

A critical difference between this design and a subsampling or split-plot design is the
systematic arrangement of elements within a cluster. This lack of randomization is particu-
larly apparent in longitudinal or repeated measures structures in time. A common practice is
to analyze repeated measures data as if it arises from a split-plot experiment. The repeated
measurements made on the experimental units of a basic design such as a randomized block
design are assumed to constitute a split of these units. This practice is not appropriate unless

© 2003 by CRC Press LLC


the temporal measurements are as exchangeable (permutable) as treatment randomizations.
Since the flow of time is unidirectional, this is typically not the case.
Longitudinal studies where the cluster elements are arranged by time are frequent in the
study of growth processes. For example, plants are randomly selected and their growth is re-
corded repeatedly over time. An alternative data collection scheme would be to randomly
select plants at different stages of development and record their current yield, a so-called
cross-sectional study. If subjects of different ages are observed at one point in time, the dif-
ferences in yield between these subjects do not necessarily represent true growth due to the
entanglement of age and cohort effect. The term cohort was coined in the social sciences and
describes a group of individuals which differs from other individuals in a systematic fashion.
For example, the cohort of people in their twenties today is not necessarily comparable to the
cohort of twenty year olds in 1950. The problem of confounding age and cohort effects
applies to data in the plant and soil sciences similarly.
In a long-term ecological research project one may be interested in the long-term effects
of manure management according to guidelines on soil phosphorus levels. The hypothesis is
that management according to guidelines will reduce the Bray-P1 levels over time. A cross-
sectional study would identify 8 sites that have been managed according to guidelines for a
varied number of years and analyze (model) the Bray-P1 levels as a function of time in
management. A longitudinal study identifies 8 sites managed according to guidelines and
follows them for several years. The sequence of observations over time on a given site in the
longitudinal study is that sites change in Bray-P1 phosphorus. In order to conclude that
changes over time in Bray-P1 are due to adherence to management guidelines in the cross-
sectional study, cohort effects must be eliminated or assumed absent. That is, the cohort of
sites that were in continued manure management for five years at the beginning of the study
are assumed to develop in the next seven years to the Bray-P1 levels exhibited by those sites
that were in continued manure management for twelve years at the onset of the study.
The most meaningful approach to measuring the growth of a subject is to observe the
subject at different points in time. All things being equal (ceteris paribus), the error-free
change will represent true growth. The power of longitudinal and repeated measures data lies
in the efficient estimation of growth without confounding of cohort effects. In the vernacular
of longitudinal data analysis this is phrased as “each cluster [subject] serves as its own
control.” Deviations are not measured relative to the overall trend across all clusters, the
population average, but relative to the trend specific to each cluster.
The within-cluster elements of longitudinal and repeated measures data can be ordered in
a nontemporal fashion, for example spatially (e.g., Example 2.1). On occasion, spatial and
temporal metrics are intertwined. Following Gregoire and Schabenberger (1996a, 1996b) we
analyze data from tree stem analysis where the diameter of a tree is recorded in three-foot
intervals in §8.4.1. Tree volume is modeled as a function of the diameter to which the tree has
tapered at a given measurement interval. The diameter can be viewed as a proxy for a spatial
metric, the distance above ground, or a temporal metric, the age of the tree at which a certain
measurement height was achieved.
A repeated measured data structure with spatiotemporal metric is discussed in the follow-
ing example.

© 2003 by CRC Press LLC


Example 2.2. Pierce et al. (1994) discuss a tillage experiment where four tillage strate-
gies (Moldboard plowing; Plowing Spring 1987 and Fall 1996; Plowing Spring 1986,
Spring 1991, Fall 1995; No-Tillage) were arranged in a randomized complete block
design with four blocks. Soil characteristics were obtained April 10, 1987, April 22,
1991, July 26, 1996, and November 20, 1997. At each time point, soil samples were
extracted from four depths: ! to #, # to %, % to ', and ' to ) inches. At a given time
point the data have a repeated measures structure with spatial metric. For a given
sampling depth, the data have a repeated measures structure with temporal metric. The
combined data have a repeated measures structure with spatiotemporal metric. Figure
2.5 depicts the sample mean :L across the spatiotemporal metric for the No-Tillage
treatment.
22Apr91 26Jul96 20Nov97
10Apr87

6.5

6.0
Mean pH

5.5

Depth = 0-2"
5.0
Depth = 2-4"
Depth = 4-6"
Depth = 6-8"
No Tillage
Mean across depths
4.5
0 2 4 6 8 10

Figure 2.5. Depth ‚ Time sample mean pH for the No-Tillage treatment. Cross-hairs
depict sample mean pH at a given depth.

While in longitudinal studies, focus is on modeling the population-average or cluster-spe-


cific trends as a function of time, space, etc., in repeated measures studies, interest shifts to
examining whether treatment comparisons depend on time and/or space. If the trends in treat-
ment response over time are complex, or the number of re-measurements large, modeling the
trends over time similar to a longitudinal study and comparing those among treatments may
be a viable approach to analysis (see Rasse, Smucker, and Schabenberger 1999 for an
example).
In studies where data are collected over long periods of time, it invariably happens that
data become unbalanced. At certain measurement times not all sampling or experimental units
can be observed. Data are considered missing if observations intended for collection are
inaccessible, lost, or unavailable. Experimental units are lost to real estate sales, animals die,
patients drop out of a study, a lysimeter is destroyed, etc. Missing data implies unplanned im-

© 2003 by CRC Press LLC


balance but the reverse is not necessarily true. Unbalanced data contain missing values only if
one intended to measure the absent observations. If the process that produced the missing
observation is related to the unobserved outcome, conclusions based on data that ignore the
missing value process are biased. Consider the study of soil properties under bare soil and a
cover crop in a repeated measures design. Due to drought the bare soil cannot be drilled at
certain sampling times. The absence of the observations is caused by soil properties we
intended to measure. The missing values mask the effect we are hoping to detect, and ig-
noring them will bias the comparison between cover crop and bare soil treatments. Now
assume that a technician applying pesticides on an adjacent area accidentally drives through
one of the bare soil replicates. Investigators are forced to ignore the data from the particular
replicate because soil properties have been affected by trafficking. In this scenario the unavai-
lability of the data from the trafficked replicate is not related to the characteristic we intended
to record. Ignoring data from the replicate will not bias conclusions about the treatments. It
will only lower the power of the comparisons due to reduced sample size.
Little and Rubin (1987) classify missing value mechanisms into three categories. If the
probability of a value being missing is unrelated to both the observed and unobserved data,
missingness is termed completely at random. If missingness does not relate to the unobserved
data (but possibly to the observed data) it is termed random, and if it is dependent on the
unobserved data as in the first scenario above, it is informative. Informative missingness is
troublesome because it biases the results. Laird (1988) calls random missingness ignorable
missingness because it does not negatively affect certain statistical estimation methods (see
also Rubin 1976, Diggle et al. 1994). When analyzing repeated measures or longitudinal data
containing missing values we hope that the missing value mechanism is at least random.
Missingness is then ignorable.
The underlying metric of a longitudinal or repeated measures data set can be continuous
or discrete and measurements can be equally or unequally spaced. Some processes are
inherently discrete when measurements cannot be taken at arbitrary points in time. Jones
(1993) cites the recording of the daily maximum temperature which yields only one observa-
tion per day. If plant leaf temperatures are measured daily for four weeks at "#:!! p.m., how-
ever, the underlying time scale is continuous since the choice of measurement time was
arbitrary. One could have observed the temperatures at any other time of day. The measure-
ments are also equally spaced. Equal or unequal spacing of measurements does not imply
discreteness or continuity of the underlying metric. If unequal spacing is deliberate (planned),
the metric is usually continuous. There is some ambiguity, however. In Example 2.2 the
temporal metric has four entries. One could treat this as unequally spaced yearly measure-
ments or equally spaced daily measurements with many missing values ($ß )(% missing values
between April 10, 1987 and November 20, 1997). The former representation clearly is more
useful. In the study of plant growth, measurements will be denser during the growing season
compared to periods of dormancy or lesser biological activity. Unequal spacing is neither
good nor bad. Measurements should be collected when it is most meaningful. The statistical
models for analyzing repeated measures and longitudinal data should follow suit. The mixed
models for clustered data discussed in §§7 and 8 are models of this type. They can
accommodate unequal spacing without difficulty as well as (random) missingness of obser-
vations. This is in sharp contrast to more traditional models for repeated measures analysis,
e.g., repeated measures analysis of variance, where the absence of a single temporal
observation for an experimental unit results in the loss of all repeated measurements for that
unit, a wasteful proposition.

© 2003 by CRC Press LLC


2.5 Autocorrelated Data
Box 2.3 Autocorrelation

The correlation 3BC between two random variables measures the strength
(and direction) of the (linear) dependency between \ and ] .

• If \ and ] are two different variables 3BC is called the product-moment


correlation coefficient. If \ and ] are observations of the same variable at
different times or locations, 3BC is called the autocorrelation coefficient.

• Autocorrelation is the correlation of a variable in time or space with itself.


While estimating the product-moment correlation requires 8 pairs of
observations aB" ß C" bß ÞÞÞß aB8 ß C8 b, autocorrelation is determined from a
single (time) sequence aB" ß B# ß ÞÞÞß B> b.

• Ignoring autocorrelation (treating data as if they were independent) distorts


statistical inferences.

• Autocorrelations are usually positive, which implies that an above-average


value is likely to be followed by another above average value.

2.5.1 The Autocorrelation Function


Unless elements of a cluster are selected or assigned at random, the responses within a cluster
are likely to be correlated. This type of correlation does not measure the (linear) dependency
between two different attributes, but between different values of the same attribute. Hence the
name autocorrelation. When measurements are collected over time it is also referred to as
serial correlation, a term that originated in the study of time series data (Box, Jenkins, and
Reinsel 1994). The (product-moment) correlation coefficient between random variables \
and ] is defined as
Covc\ß ] d
Corrc\ß ] d œ 3BC œ , [2.1]
ÈVarc\ dVarc] d

where Covc\ß ] d denotes the covariance between the two random variables. The coefficient
ranges from –" to " and measures the strength of the linear dependency between \ and ] . It
is related to the coefficient of determination aV # b in a linear regression of ] on \ (or \ on
] ) by V # œ 3BC#
. A positive value of 3BC implies that an above-average value of \ is likely to
be paired with an above-average value of ] . A negative correlation coefficient implies that
above-average values of \ are paired with below-average values of ] . Autocorrelation coef-
ficients are defined in the same fashion, a covariance divided by the square root of a variance
product. Instead of two different variables \ and ] , the covariance and variances pertain to

© 2003 by CRC Press LLC


the same attribute measured at two different points in time, two different points in space, and
so forth.
Focus on the temporal case for the time being and let ] a>" bß âß ] a>8 b denote a sequence
of observations collected at times >" through >8 . Denote the mean at time >3 as Ec] a>3 bd œ
.a>3 b. The covariance between two values in the sequence is given by
Covc] a>3 bß ] a>4 bd œ Eca] a>3 b  .a>3 bba] a>4 b  .a>4 bbd œ Ec] a>3 b] a>4 bd  .a>3 b.a>4 b.

The variance of the sequence at time >3 is Varc] a>3 bd œ Ea] a>3 b  .a>3 bb# ‘ and the auto-
correlation between observations at times >3 and >4 is measured by the correlation
Covc] a>3 bß ] a>4 bd
Corrc] a>3 bß ] a>4 bd œ . [2.2]
ÈVarc] a>3 bdVarc] a>4 bd

If data measured repeatedly are uncorrelated, they should scatter around the trend over
time in an unpredictable, nonsystematic fashion (open circles in Figure 2.6). Data with posi-
tive autocorrelation show long runs of positive or negative residuals. If an observation is
likely to be below (above) average at some time point, it was likely to be below (above)
average in the immediate past. While negative product-moment correlations between random
variables are common, negative autocorrelation is fairly rare. It would imply that above
(below) average values are likely to be preceded by below (above) average values. This is
usually indication of an incorrectly specified mean function. For example, a circadian rhythm
or seasonal fluctuation was omitted.

1.2

1.0

0.8

0.6

0.4

0.2
Residual

-0.0

-0.2

-0.4

-0.6

-0.8

-1.0

-1.2

0 2 4 6 8 10 12 14 16 18 20
Time

Figure 2.6. Autocorrelated and independent data with the same dispersion, 5 # œ !Þ$. Open
circles depict independent observations, closed circles observations with positive autocorrela-
tion. Data are shown as deviations from the mean (residuals). A run of negative residuals for
the autocorrelated data occurs between times ' and "%.

© 2003 by CRC Press LLC


In a temporal sequence with 8 observations there are possibly 8 different variances and
8a8  "bÎ# different covariances. The number of unknowns in the autocorrelation structure is
larger than the number of observations. To facilitate estimation of the autocorrelation struc-
ture stationarity assumptions are common in time series. Similar assumptions are made for
longitudinal and repeated measures data (§7) and for spatial data (§9). Details of stationarity
properties will be discussed at length in the pertinent chapters; here we note only that a
(second-order) stationary time series has reached a form of equilibrium. The variances of the
observations are constant and do not depend on the time point. The strength of the autocorre-
lation does not depend on the origin of time, only on elapsed time. In a stationary time series
the correlation between observations two days apart is the same, whether the first day is a
Monday or a Thursday. Under these conditions the correlation [2.2] can be expressed as
G al>3  >4 lb
Corrc] a>3 bß ] a>4 bd œ , [2.3]
5#
where 5 # œ Varc] a>3 bd. The function G ab is called the (auto-)covariance function of the
process and 234 œ l>3  >4 l is the time lag between ] a>3 b and ] a>4 b. Since Varc] a>3 bd œ
Covc] a>3 bß ] a>3 bd œ G a!b, the (auto-)correlation function in a series can be expressed for
any time lag 2 as
G a2b
V a2b œ . [2.4]
G a!b

The autocorrelation function is an important device in analyzing clustered data. In some


applications it is modeled implicitly (indirectly), that is, the formulation of the model implies
a particular correlation function. In other applications it is modeled explicitly (directly), that
is, the analyst develops a model for V a2b (or G a2b) in the same way that a model for the
mean structure is built. Implicit specification of the autocorrelation function is typical for
clustered data structures that arise through hierarchical random processes (Figure 2.3). In the
case of the split-plot design of §2.4 we have for observations from the same whole-plot
Covc]345 ß ]345w d œ 5/#
Varc]345 d œ 5/# € 5.# ,

so that the autocorrelation function is


5/#
V a2 b œ œ 3.
5/# € 5.#

The correlations among observations from the same whole-plot are constant. The function
Vab does not depend on which particular two sub-plot observations are considered. This
structure is known as the compound-symmetric or exchangeable correlation structure. The
process of randomization makes the ordering of the sub-plot units exchangeable. If repeated
measures experiments are analyzed as split-plot designs it is implicitly assumed that the auto-
correlation structure of the repeated measure is exchangeable.
With temporal data it is usually more appropriate to assume that correlations (covarian-
ces) decrease with increasing temporal separation. The autocorrelation function V a2b
approaches ! as the time lag 2 increases. When modeling V a2b or G a2b directly, we rely on
models for the covariance function that behave in a way the user deems reasonable. The

© 2003 by CRC Press LLC


following models are some popular candidates of covariance functions where correlations
decrease with lag (3 ž !ß ! ž !),
G a2b œ G a!b3l34l
G a2b œ G a!b3l>3 >4 l
G a2b œ G a!bexp˜  l>3  >4 l# Î!# ™.

2.5.2 Consequences of Ignoring Autocorrelation


Analyzing data that exhibit positive autocorrelation as if they were uncorrelated has se-
rious implications. In short, estimators are typically inefficient although they may remain un-
biased. Estimates of the precision of the estimators (the estimated standard errors) can be
severely biased. The precision of the estimators is usually overstated, resulting in test statis-
tics that are too large and :-values that are too small. These are ramifications of “pretending
to have more information than there really is.” We demonstrate this effect with a simple
example. Assume that two treatments (groups) are to be compared. For each group a series of
positively autocorrelated observations is collected, for example, a repeated measure. The
observations are thus correlated within a group but are uncorrelated between the groups. For
simplicity we assume that both groups are observed at the same points in time >" ß âß >8 , that
both groups have the same (known) variances, and that the observations within a group are
equicorrelated and Gaussian-distributed. In essence this setup is a generalization of the two-
sample D -test for Gaussian data with correlations among the observations from a group. In
parlance of §2.4 there are two clusters with 8 observations each.
Let ]" a>" bß âß ]" a>8 b denote the 8 observations from group " and ]# a>" bß âß ]# a>8 b the
observations from group #. The distributional assumptions can be expressed as
]" a>3 b µ K ˆ." ß 5 # ‰
]# a>3 b µ K ˆ.# ß 5 # ‰
5# 3 3Á4
Covc]" a>3 bß ]" a>4 bd œ œ
5# 3œ4
Covc]" a>3 bß ]# a>4 bd œ !.

We are interested in comparing the means of the two groups, L! : ."  .# œ !. First
assume that the correlations are ignored. Then one would estimate ." and .# by the respec-
tive sample means
" 8 " 8
]" œ "]" a>3 b ]# œ "]# a>3 b
8 3œ" 8 3œ"

and determine their variances to be VarÒ] " Ó œ VarÒ] # Ó œ 5 # Î8. The test statistic (assuming
5 # known) for L! : ."  .# œ ! is

‡ ]" ]#
^9,= œ .
5 È#Î8

© 2003 by CRC Press LLC


If the within-group correlations are taken into account, ] " and ] # remain unbiased estimators
of ." and .# , but their variance is no longer 5 # Î8. Instead,
8
" " 8 8
Var] " ‘ œ #
Var–"]" a>3 b— œ # ""Covc]" a>3 bß ]" a>4 bd
8 3œ"
8 3œ" 4œ"
Ú
Ý 8 Þ
á
8 8
"
œ # Û "Varc]" a>3 bd € " " Covc]" a>3 bß ]" a>4 bdß
8 Ý 3œ" 3œ" 4 œ "
á
Ü 4Á3 à
" 5#
œ ˜85 # € 8a8  "b5 # 3™ œ e" € a8  "b3f.
8# 8
The same expression is derived for VarÒ] # Ó. The variance of the sample means is larger (if
3 ž !) than 5 # Î8, the variance of the sample mean in the absence of correlations. The preci-
sion of the estimate of the group difference, ] "  ] #, is overestimated by the multiplicative
factor " € a8  "b3. The correct test statistic is

C"  C# ‡
^9,= œ  ^9,= .
5 É 8# a" € a8  "b3b

The incorrect test statistic will be too large, and the :-value of that test will be too small. One
may declare the two groups as significantly different, whereas the correct test may fail to find
significant differences in the group means. The evidence against the null hypothesis has been
overstated by ignoring the correlation.
Another way of approaching this issue is in terms of the effective sample size. One can
ask: “How many samples of the uncorrelated kind provide the same precision as a sample of
correlated observations?” Cressie (1993, p. 15) calls this the equivalent number of indepen-
dent observations. Let 8 denote the number of samples of the correlated kind and 8w the equi-
valent number of independent observations. The effective sample size is calculated as
Var] assuming independence‘ 8
8w œ 8 œ .
Var] under autocorrelation‘ " € a8  "b3

Ten observations equicorrelated with 3 œ !Þ$ provide as much information as #Þ( inde-
pendent observations. This seems like a hefty penalty for having correlated data. Recall that
the group mean difference was estimated as the difference of the arithmetic sample means,
] "  ] # . Although this difference is an unbiased estimate of the group mean difference
."  .# , it is an efficient estimate only in the case of uncorrelated data. If the correlations are
taken into account, more efficient estimators of ."  .# are available and the increase in
efficiency works to offset the smaller effective sample size.
When predicting new observations based on a statistical model with correlated errors, it
turns out that the correlations enhance the predictive ability. Geostatistical kriging methods,
for example, utilize the spatial autocorrelation among observations to predict an attribute of
interest at a location where no observation has been collected (§9.4). The autocorrelation
between the unobserved attributed and the observed values allows one to glean information

© 2003 by CRC Press LLC


about the unknown value from surrounding values. If data were uncorrelated, surrounding
values would not carry any stochastic information.

2.5.3 Autocorrelation in Designed Experiments


Serial, spatial, or other autocorrelation in designed experiments is typically due to an absence
of randomization. Time can be a randomized treatment factor in an experiment, for example,
when only parts of an experimental unit are harvested at certain points in time and the order
in which the parts are harvested is randomized. In this case time is a true sub-plot treatment
factor in a split-plot design. More often, however, time effects refer to repeated measurements
of the same experimental or observational units and cannot be randomized. The second
measurement is collected after the first and before the third measurement. The obstacle to ran-
domization is the absence of a conveyance for time-travel. Randomization restrictions can
also stem from mechanical or technical limitations in the experimental setup. In strip-planter
trials the order of the strips is often not randomized. In line sprinkler experiments, water is
emitted from an immobile source through regularly spaced nozzles and treatments are
arranged in strips perpendicular to the line source. Hanks et al. (1980), for example, describe
an experiment where three cultivars are arranged in strips on either side of the sprinkler. The
irrigation level, a factor of interest in the study, decreases with increasing distance from the
sprinkler and cannot be randomized. Along the strip of a particular cultivar one would expect
spatial autocorrelations.
A contemporary approach to analyzing designed experiments derives validity of infer-
ence not from randomization alone, but from information about the spatial, temporal, and
other dependencies that exist among the experimental units. Randomization neutralizes
spatial dependencies at the scale of the experimental unit, but not on smaller or larger scales.
Fertility trends that run across an experimental area and are not eliminated by blocking are
neutralized by randomization if one considers averages across all possible experimental
arrangements. In a particular experiment, these spatial effects may still be present and
increase error variance. Local trends in soil fertility, competition among adjacent plots, and
imprecision in treatment application (drifting spray, subsurface movement of nutrients, etc.)
create effects which increase variability and can bias unadjusted treatment contrasts. Stroup et
al. (1994) note that combining adjacent experimental units into blocks in agricultural variety
trials can be at odds with an assumption of homogeneity within blocks when more than eight
to twelve experimental units are grouped. Spatial trends will then be removed only incom-
pletely. An analysis which takes explicitly into account the spatial dependencies of experi-
mental units may be beneficial compared to an analysis which relies on conceptual assign-
ments that did not occur. A simple approach to account for spatial within-block effects is
based on analyzing differences of nearest-neighbors in the field layout. These methods origi-
nated with Papadakis (1937) and were subsequently modified and improved upon (see, for
example, Wilkinson et al. 1983, Besag and Kempton 1986). Gilmour et al. (1997) discuss
some of the difficulties and pitfalls when differencing data to remove effects of spatial varia-
tion. Grondona and Cressie (1991) developed spatial analysis of variance derived from ex-
pected mean squares where expectations are taken over the distribution induced by treatment
randomization and a spatial process. We prefer modeling techniques rooted in random field
theory (§9.5) to analyze experimental data in which spatial dependencies were not fully
removed or neutralized (Zimmerman and Harville 1991, Brownie et al. 1993, Stroup et al.

© 2003 by CRC Press LLC


1994, Brownie and Gumpertz 1997, Gilmour et al. 1997, Gotway and Stroup 1997, Verbyla
et al. 1999). This is a modeling approach that considers the spatial process as the driving
force behind the data. Whether a spatial model will provide a more efficient analysis than a
traditional analysis of variance will depend to what extent the components of the spatial proc-
ess are conducive to modeling. We agree with Besag and Kempton (1986) that many agrono-
mic experiments are not carried out in a sophisticated manner. The reasons may be conve-
nience, unfamiliarity of the experimenter with more complex design choices, or tradition.
Bartlett (1938, 1978a) views analyses that emphasize the spatial context over the design con-
text as ancillary devices to salvage efficiency in experiments that could have been designed
more appropriately. We do not agree with this view of spatial analyses as salvage tools. One
might then adopt the viewpoint that randomization and replication are superfluous since treat-
ment comparisons can be recovered through spatial analysis tools. By switching from the
classical design-based analysis to an analysis based on spatial modeling, inferences have
become conditional on the correctness of the model. The ability to draw causal conclusions
has been lost. There is no free lunch.

2.6 From Independent to Spatial Data — a Progression


of Clustering
Box 2.4 Progression of Clustering

• Clustered data can be decomposed into 5 clusters of size 85 where


8 œ !53œ" 83 is the total number of observations and 83 is the size of the 3th
cluster.

• Spatial data is a special case of clustered data where the entire data
comprises a single cluster.
— Unclustered data: 5 œ 8ß 83 œ "
— Longitudinal data: 5  8ß 83 ž "
— Spatial data: 5 œ "ß 8" œ 8.

As far as the correlation in data is concerned this text considers three prominent types of data
structures. Models for independent (uncorrelated) data that do not exhibit clustering are
covered in Chapters 4 to 6, statistical models for clustered data where observations from
different clusters are uncorrelated but correlations within a cluster are possible are considered
in Chapters 7 and 8. Models for spatial data are discussed in §9. Although the statistical tools
can differ greatly from chapter to chapter, there is a natural progression in these three data
structures. Before we can make this progression more precise, a few introductory comments
about spatial data are in order. More detailed coverage of spatial data types and their
underlying stochastic processes is deferred until §9.1.
A data set is termed spatial if, along with the attribute of interest, ] , the spatial locations
of the attributes are recorded. Let s denote the vector of coordinates at which ] was
observed. If we restrict discussion for the time being to observations collected in the plane,
then s is a two-dimensional vector containing the longitude and latitude. In a time series we

© 2003 by CRC Press LLC


can represent the observation collected at time > as ] a>b. Similarly for spatial data we
represent the observation as ] asb. A univariate data set of 8 spatial observations thus consists
of ] as" bß ] as# bß âß ] as8 b. The three main types of spatial data, geostatistical data, lattice
data, and point pattern data, are distinguished according to characteristics of the set of
locations. If the underlying space is continuous and samples can be gathered at arbitrary
points, the data are termed geostatistical. Yield monitoring a field or collecting soil samples
produces geostatistical data. The soil samples are collected at a finite number of sites but
could have been gathered anywhere within the field. If the underlying space is made up of
discrete units, the spatial data type is termed lattice data. Recording events by city block,
county, region, soil type, or experimental unit in a field trial, yields lattice data. Lattices can
be regular or irregular. Satellite pixel images also give rise to spatial lattice data as pixels are
discrete units. No observations can be gathered between two pixels. Finally, if the set of
locations is itself random, the data are termed a point pattern. For geostatistical and lattice
data the locations at which data are collected are not considered random, even if they are
randomly placed. Whether data are collected on a systematic grid or by random sampling has
no bearing on the nature of the spatial data type. With point patterns we consider the locations
at which events occur as random; for example, when the location of disease-resistant plants is
recorded.
Modeling techniques for spatial data are concerned with separating the variability in ] asb
into components due to large-scale trends, smooth-scale, microscale variation and measure-
ment error (see §9.3). The smooth- and microscale variation in spatial data is stochastic varia-
tion, hence random. Furthermore, we expect observations at two spatial locations s3 and s4 to
exhibit more similarity if the locations are close and less similarity if s3 is far from s4 . This
phenomenon is often cited as Tobler's first law of geography: “Everything is related to every-
thing else, but near things are more related than distant things” (Tobler 1970). Spatial data are
autocorrelated and the degree of spatial dependency typically decreases with increasing
separation. For a set of spatial data ] as" bß ] as# bß âß ] as8 b Tobler's law implies that all co-
variances Covc] as3 bß ] as4 bd are potentially nonzero. In contrast to clustered data one cannot
identify a group of observations where members of the group are correlated but are indepen-
dent of members of other groups. This is where the progression of clustering comes in. Let 5
denote the number of clusters in a data set and 83 the number of observations for the 3th
cluster. The total size of the data is then
5
8 œ "83 .
3œ"

Consider a square field with sixteen equally spaced grid points. Soil samples are collected at
the sixteen points and analyzed for soil organic matter (SOM) content (Figure 2.7a). Al-
though the grid points are regularly spaced, the observations are a realization of geostatistical
data, not lattice data. Soil samples could have been collected anywhere in the field. If, how-
ever, a grid point is considered the centroid of a rectangular area represented by the point,
these would be lattice data. Because the data are spatial, every SOM measurement might be
correlated with every other measurement. If a cluster is the collection of observations that are
potentially correlated, the SOM data comprise a single cluster a5 œ "b of size 8 œ "'.
Panels (b) and (c) of Figure 2.7 show two experimental design choices to compare four
treatments. A completely randomized design with four replications is arranged in panel (b)
and a completely randomized design with two replications and two subsamples per experi-

© 2003 by CRC Press LLC


mental unit in panel (c). The data in panel (b) represent uncorrelated observations or 5 œ "'
clusters of size 83 œ " each. The subsampling design is a clustered design in the narrow
sense, because it involves two hierarchical random processes. Treatments are assigned
completely at random to the experimental units such that each treatment is replicated twice.
Then two subsamples are selected from each unit. These data comprise 5 œ ) clusters of size
83 œ #. There will be correlations among the observations from the same experimental unit
because they share the same experimental error term (see §2.4.1).
a) b)
k = 1; n1 = 16 k = 16; ni = 1; i = 1,...,k = 16
16 16

C D D B
12 12
Coordinate S2

Coordinate S2
A A B D
8 8

B D C A
4 4

A B C C
0 0

0 4 8 12 16 0 4 8 12 16
Coordinate S1 Coordinate S1

c) k = 8; ni = 2; i = 1,...,k = 8
16

A D
12
Coordinate S2

C B
8

A C
4

B D
0

0 4 8 12 16
Coordinate S1

Figure 2.7. Relationship between clustered and spatial data. Panel a) shows the grid points at
which a spatial data set for soil organic matter is collected. A completely randomized design
(CRD) with four treatments and four replicates is shown in (b), and a CRD with two repli-
cates of four treatments and two subsamples per experimental unit is shown in (c).

Statistical models that can accommodate all three levels of clustering are particularly
appealing (to us). The mixed models of §7 and §8 have this property. They reduce to standard
models for uncorrelated data when each observation represents a cluster by itself and to
(certain) models for spatial data when the entire data is considered a single cluster. It is
reasonable to assume that the SOM data in Figure 2.7a are correlated while the CRD data in
Figure 2.7b are independent, because they appeal to different data generating mechanisms;
randomization in Figure 2.7b and a stochastic process (a random field) in Figure 2.7a. More
and more agronomists are not just faced with a single data structure in a given experiment.
Some responses may be continuous, others discrete. Some data have a design context, other a
spatial context. Some data are longitudinal, some are cross-sectional. The organization of the
remainder of this book by response type and level of clustering is shown in Figure 2.8.

© 2003 by CRC Press LLC


Level of Clustering
Independent Clustered Spatial
(unclustered)

Response
§4 §7 §9

linear
Gaussian assumption
Continuous

reasonable

Response
Type of Response

nonlinear

§5 §8 §9
continuous but
non-Gaussian

§6 §8 §9
Discrete,

Figure 2.8. Organization of chapters by response type and level of clustering in data.

© 2003 by CRC Press LLC


Chapter 3

Linear Algebra Tools

“The bottom line for mathematicians is that the architecture has to be


right. In all the mathematics that I did, the essential point was to find the
right architecture. It's like building a bridge. Once the main lines of the
structure are right, then the details miraculously fit. The problem is the
overall design.” Freeman Dyson. In Interview with Donald J. Albers (
“Freeman Dyson: Mathematician, Physicist, and Writer”). The College
Mathematics Journal, vol. 25, no. 1, January 1994.

“It is easier to square the circle than to get round a mathematician.”


Augustus De Morgan. In Eves, H., Mathematical Circles, Boston:
Prindle, Weber and Schmidt, 1969.

3.1 Introduction
3.2 Matrices and Vectors
3.3 Basic Matrix Operations
3.4 Matrix Inversion — Regular and Generalized Inverse
3.5 Mean, Variance, and Covariance of Random Vectors
3.6 The Trace and Expectation of Quadratic Forms
3.7 The Multivariate Gaussian Distribution
3.8 Matrix and Vector Differentiation
3.9 Using Matrix Algebra to Specify Models
3.9.1 Linear Models
3.9.2 Nonlinear Models
3.9.3 Variance-Covariance Matrices and Clustering

© 2003 by CRC Press LLC


60 Chapter 3  Linear Algebra Tools

3.1 Introduction
The discussion of the various statistical models in the following chapters requires linear
algebra tools. The reader familiar with the basic concepts and operations such as matrix addi-
tion, subtraction, multiplication, transposition, inversion and the expectation and variance of a
random vector may skip this chapter. For the reader unfamiliar with these concepts or in need
of a refresher, this chapter provides the necessary background and tools. We have compiled
the most important rules and results needed to follow the notation and mathematical opera-
tions in subsequent chapters. Texts by Graybill (1969), Rao and Mitra (1971), Searle (1982),
Healy (1986), Magnus (1988), and others provide many more details about matrix algebra
useful in statistics. Specific techniques such as Taylor series expansions of vector-valued
functions are introduced as needed.
Without using matrices and vectors, mathematical expressions in statistical model infer-
ence quickly become unwieldy. Consider a two-way a+ ‚ , b factorial treatment structure in a
randomized complete block design with < blocks. The experimental error (residual) sum of
squares in terms of scalar quantities is obtained as
+ , <
#
WWV œ " " "ˆC345  C 3ÞÞ  C Þ4Þ  C ÞÞ5 € C 34Þ € C 3Þ5 € C Þ45  C ÞÞÞ ‰ ,
3œ" 4œ" 5œ"

where C345 denotes an observation for level 3 of factor E, level 4 of factor F in block 5 . C 3ÞÞ is
the sample mean of all observations for level 3 of E and so forth. If only one treatment factor
is involved, the experimental error sum of squares fomula becomes
> <
#
WWV œ ""ˆC34  C 3Þ  C Þ4 € C ÞÞ ‰ .
3œ" jœ"

Using matrices and vectors, the residual sum of squares can be written in either case as
WWV œ yw aI  Hby for a properly defined vector y and matrix H.
Consider the following three linear regression models:
]3 œ "! € "" B3 € /3
]3 œ "" B3 € /3
]3 œ "! € "" B3 € "# B#3 € /3 ,

a simple linear regression, a straight line regression through the origin, and a quadratic poly-
nomial. If the least squares estimates are expressed in terms of scalar quantities, we get for the
simple linear regression model the formulas
8 8
s! œ C  "
" s"B s " œ "aC3  CbaB3  Bb ‚"aB3  Bb# ,
" Ž 
3œ" 3œ"

in the case of the straight line through the origin,


8 8
s " œ "C3 B3 ‚ "B#3 ,
" Ž  Ž 
3œ" 3œ"

© 2003 by CRC Press LLC


Introduction 61

and
s! œ C  "
" s"B  "
s#D
8 8 8 8
#
œ ! aD3  D bC3 ! aB3  BbaD3  D b  œ ! aB3  BbC3 ! aD3  D b
s" œ
"
3œ" 3œ" 3œ" 3œ"
8 # 8 8
# #
œ ! aB3  BbaD3  D b  ! aB3  Bb ! aD3  D b
3œ" 3œ" 3œ"

for the quadratic polynomial aD3 œ B#3 b. By properly defining matrices X, Y, e, and " , we can
write either of these models as Y œ X" € e and the least squares estimates are
s œ aXw Xb" Xw y.
"

Matrix algebra allows us to efficiently develop and discuss theory and methods of statistical
models.

3.2 Matrices and Vectors


Box 3.1 Matrices and Vectors

• A matrix is a two-dimensional array of real numbers.

• The size of a matrix (its order) is specified as (rows ‚ columns).

• A vector is a one-dimensional array of real numbers. Vectors are special


cases of matrices with either a single row (row vector) or a single column
(column vector).

• Matrices are denoted by uppercase bold-faced letters, vectors with


lowercase bold-faced letters. Vectors of random variables are an exception.
The random vectors are denoted by boldface uppercase letters and the
realized values by boldface lowercase letters.

In this text matrices are rectangular arrays of real numbers (we do not consider complex num-
bers). The size of a matrix is called its order and written as (row ‚ column). A a$ ‚ %b
matrix has three rows and four columns. When referring to individual elements of a matrix, a
double subscript denotes the row and column of the matrix in which the element is located.
For example, if A is a a8 ‚ 5 b matrix, +$& is the element positioned in the third row, fifth
column. We sometimes write A œ c+34 d to show that the individual elements of A are the +34 .
If it is necessary to explicitly identify the order of A, a subscript is used, for example, Aa8‚5b .
Matrices (and vectors) are distinguished from scalars with bold-face lettering. Matrices
with a single row or column dimension are called vectors. A vector is usually denoted by a
lowercase boldface letter, e.g., z, y. If uppercase lettering such as Z or Y is used for vectors, it
is implied that the elements of the vector are random variables. This can cause a little con-

© 2003 by CRC Press LLC


62 Chapter 3  Linear Algebra Tools

fusion. Is X a vector of random variables or a matrix of constants? To reduce confusion as


much as possible the following rules (and exceptions) are adopted:
• X always refers to a matrix of constants, never to a random vector.
• Z also refers to a matrix of constants, except when we consider spatial data, where the
vector Z has a special meaning in terms of Zasb. If Z appears by itself without the
argument ÐsÑ, it denotes a matrix of constants.
• An exception from the uppercase/lowercase notation is made for the vector b, which is
a vector of random variables in mixed models (§7).
• If Y denotes a vector of random variables, then y denotes its realized values. y is thus a
vector of constants.
• Boldface Greek lettering is used to denote vectors of parameters, e.g., ) or " . When
placing a caret asb or tilde a µ b over the Greek letter, we refer to an estimator or
estimate of the parameter vector. An estimator is a random variable; for example,
aXw Xb" Xw Y is random since it depends on the random vector Y. An estimate is a
constant; the estimate aXw Xb" Xw y is the realized value of the estimator aXw Xb" Xw Y.

An a8 ‚ "b vector is also called a column vector, a" ‚ 5 b vectors are referred to as row
vectors. By convention all vectors in this text are column vectors unless explicitly stated
otherwise. Any matrix can be partitioned into a series of column vectors. Aa8‚5b can be
thought of as the horizontal concatenation of 5 a8 ‚ "b column vectors: A œ ca" ß a# ß ÞÞÞß a5 d.
For example,

Ô" #Þ$ ! "×


Ö" *Þ! " !Ù
Ö Ù
Aa&‚%b œ Ö" $Þ" ! "Ù [3.1]
Ö Ù
" %Þ* " !
Õ" $Þ# " !Ø

consists of % column vectors

Ô"× Ô #Þ$ × Ô!× Ô"×


Ö"Ù Ö *Þ! Ù Ö"Ù Ö!Ù
Ö Ù Ö Ù Ö Ù Ö Ù
a" œ Ö " Ùß a# œ Ö $Þ" Ùß a$ œ Ö ! Ùß a% œ Ö " Ù.
Ö Ù Ö Ù Ö Ù Ö Ù
" %Þ* " !
Õ"Ø Õ $Þ# Ø Õ"Ø Õ!Ø

w
Ô !" ×
Ö !w# Ù
Ö Ù
A can also be viewed as a vertical concatenation of row vectors: A œ Ö !w$ Ù, where
Ö wÙ
!%
Õ !w Ø
&
!w" œ c"ß #Þ$ß !ß "d.
We now define some special matrices encountered frequently throughout the text.
• 18 À the unit vector; an a8 ‚ "b vector whose elements are ".

© 2003 by CRC Press LLC


Matrices and Vectors 63

• J8 À the unit matrix of size 8; an a8 ‚ 8b matrix consisting of ones everywhere, e.g.,

Ô" " " " "×


Ö" " " " "Ù
Ö Ù
J & œ c1 & ß 1 & ß 1 & ß 1 & ß 1 & d œ Ö " " " " " Ù.
Ö Ù
" " " " "
Õ" " " " "Ø

• I8 À the identity matrix of size 8; an a8 ‚ 8b matrix with ones on the diagonal and
zeros elsewhere,

Ô" ! ! ! !×
Ö! " ! ! !Ù
Ö Ù
I& œ Ö ! ! " ! ! ÙÞ
Ö Ù
! ! ! " !
Õ! ! ! ! "Ø

• 08 À the a8 ‚ "b zero vector; a column vector consisting of zeros only.


• 0a8‚5b À an a8 ‚ 5 b matrix of zeros.

If the order of these matrices is obvious from the context, the subscripts are omitted.

3.3 Basic Matrix Operations


Box 3.2 Basic Operations

• Basic matrix operations are addition, subtraction, transposition, and


inversion.

• The transpose of a matrix Aa8‚5 b is obtained by exchanging its rows and


columns (row 1 becomes column 1 etc.). It is denoted with a single quote
(the transpose of A is denoted Aw ) and has order a5 ‚ 8b.

• The sum of two matrices is the elementwise sum of its elements. The
difference between two matrices is the elementwise difference between its
elements.

• Two matrices A and B are conformable for addition and subtraction if they
have the same order and for multiplication A*B if the number of columns in
A equals the number of rows in B.

Basic matrix operations are addition, subtraction, multiplication, transposition, and inversion.
Matrix inversion is discussed in §3.4, since it requires some additional comments. The
transpose of a matrix A is obtained by exchanging its rows and columns. The first row

© 2003 by CRC Press LLC


64 Chapter 3  Linear Algebra Tools

becomes the first column, the second row the second column, and so forth. It is denoted by
attaching a single quote (w ) to the matrix symbol. Symbolically, Aw œ c+43 d. The a& ‚ %b matrix
A in [3.1] has transpose

Ô " " " " " ×


Ö #Þ$ *Þ! $Þ" %Þ* $Þ# Ù
Aw œ Ö Ù
! " ! " "
Õ " ! " ! ! Ø

The transpose of a column vector a is a row vector, and vice versa:


w
Ô +" ×
Ö+ Ù
a œ Ö # Ù œ c+" ß +# ß ÞÞÞß +8 d.
w
ã
Õ +8 Ø

Transposing a matrix changes its order from a8 ‚ 5 b to a5 ‚ 8b and transposing a transpose


produces the original matrix, aAw bw œ A.
The sum of two matrices is the elementwise sum of its elements. The difference between
two matrices is the elementwise difference of its elements. For this operation to be successful,
the matrices being added or subtracted must be of identical order, and they must be con-
formable for addition. Symbolically,
Aa8‚5b € Ba8‚5b œ c+34 € ,34 d, Aa8‚5b  Ba8‚5b œ c+34  ,34 d. [3.2]

Define the following matrices

Ô" # !× Ô # " "× Ô" !× Ô $ *×


Aœ $ " $ ,Bœ % " ) ßCœ # $ ßD œ ! " .
Õ% " #Ø Õ " $ %Ø Õ# "Ø Õ # $Ø

A and B are both of order a$ ‚ $b, while C and D are of order a$ ‚ #b. The sum

Ô"  # # € " ! € "× Ô  " $ "×


A€Bœ $€% "€" $€) œ ( # &
Õ% € " " € $ # € %Ø Õ & % 'Ø

is defined as well as the difference

Ô"  $ !  * ×Ô  #  *×
CDœ #! $" # # .
Õ #  a  #b "  $ ØÕ %  #Ø

The operations A € Cß A  C, B € D, for example, are not possible since the matrices do not
conform. Addition and subtraction are commutative, i.e., A € B œ B € A and can be com-
bined with transposition,
aA € Bbw œ Aw € Bw . [3.3]

Multiplication of two matrices requires a different kind of conformity. The product A*B
is possible if the number of columns in A equals the number of rows in B. For example, A*C

© 2003 by CRC Press LLC


Basic Matrix Operations 65

is defined if A has order a$ ‚ $b and C has order a$ ‚ #b. C*A is not possible, however. The
order of the matrix product equals the number of rows of the first and the number of columns
of the second matrix. The product of an a8 ‚ 5 b and a a5 ‚ :b matrix is an a8 ‚ :b matrix.
The product of a row vector aw and a column vector b is termed the inner product of a and b.
Because a a" ‚ 5 b matrix is multiplied with a a5 ‚ "b matrix, the inner product is a scalar.

Ô ," × 5
Ö, Ù
aw b œ c+" ß ÞÞÞß +5 dÖ # Ù œ +" ," € +# ,# € ÞÞÞ € +5 ,5 œ "+3 ,3 . [3.4]
ã 3œ"
Õ ,5 Ø

The square root of the inner product of a vector with itself is the (Euclidean) norm of the
vector, denoted llall:
Í
Í 5
Í
llall œ Èaw a œ "+3# . [3.5]
Ì 3œ"

The norm of the difference of two vectors a and b measures their Euclidean distance:
Í
Í 5
Í
lla  bll œ aa  bb aa  bb œ "a+3  ,3 b# .
É w
[3.6]
Ì 3œ"

It plays an important role in statistics for spatial data (§9) where a and b are vectors of spatial
coordinates (longitude and latitude).
Multiplication of the matrices Aa8‚5b and Ba5‚:b can be expressed as a series of inner
products. Partition Aw as Aw œ c!" ß !# ß ÞÞÞß !8 d and Ba5‡:b œ cb" ß b# ß ÞÞÞß b: dÞ Here, !3 is the 3th
row of A and b3 is the 3th column of B. The elements of the matrix product A*B can be
written as a matrix whose typical elements are the inner products of the rows of A with the
columns of B
Aa8‚5b Ba5‚:b œ c!w3 b4 da8‚:b . [3.7]

For example, define two matrices

Ô" # !× Ô" !×
Aœ $ " $ ,Bœ # $ .
Õ% " #Ø Õ# "Ø

Then matrix can be partitioned into vectors corresponding to the rows of A and the columns
of B as

Ô"× Ô $× Ô%× Ô"× Ô!×


!" œ # !# œ " !$ œ " b" œ # b# œ $ .
Õ!Ø Õ  $Ø Õ#Ø Õ#Ø Õ"Ø

The product AB can now be written as

© 2003 by CRC Press LLC


66 Chapter 3  Linear Algebra Tools

w
Ô !" b " !w" b# ×
Aa8‚5b Ba5‚:b œ c!w3 b4 d œ !w# b" !w# b#
Õ !w b " !w$ b# Ø
$

Ô "‡" € #‡# € !‡# "‡! € #‡$ € !‡" × Ô & '×


œ $‡" € "‡#  $‡# $‡! € "‡$  $‡" œ  " !
Õ %‡" € "‡# € #‡# %‡! € "‡$ € #‡" Ø Õ "! &Ø

Matrix multiplication is not a symmetric operation. AB does not yield the same product
as BA. First, for both products to be defined, A and B must have the same row and column
order (must be square). Even then, the outcome of the multiplication might differ. The
operation AB is called postmultiplication of A by B, BA is called premultiplication of A by
B.
To multiply a matrix by a scalar, simply multiply every element of the matrix by the scalar:
- A œ c-+34 d. [3.8]

Here are more basic results about matrix multiplication.

CaA € Bb œ CA € CB
aABbC œ AaBCb
[3.9]
- aA € Bb œ - A € - B
aA € BbaC € Db œ AC € AD € BC € BD.

The following results combine matrix multiplication and transposition.


aABbw œ Bw Aw
aABCbw œ Cw Bw Aw
[3.10]
a- A b w œ - A w
a+A € , Bbw œ +Aw € , Bw

Matrices are called square, when their row and column dimensions are identical. If, fur-
thermore, +34 œ +43 for all 4 Á 3, the matrix is called symmetric. Symmetric matrices are en-
countered frequently in the form of variance-covariance or correlation matrices (see §3.5). A
symmetric matrix can be constructed from any matrix A with the operations Aw A or AAw . If
all off-diagonal cells of a matrix are zero, i.e. +34 œ ! if 4 Á 3, the matrix is called diagonal
and is obviously symmetric. The identity matrix is a diagonal matrix with "'s on the diagonal.
On occasion, diagonal matrices are written as Diagaab, where a is a vector of diagonal
elements. If, for example, aw œ c"ß #ß $ß %ß &d, then

Ô" ! ! ! !×
Ö! # ! ! !Ù
Ö Ù
Diagaab œ Ö ! ! $ ! ! Ù.
Ö Ù
! ! ! % !
Õ! ! ! ! &Ø

© 2003 by CRC Press LLC


Matrix Inversion — Regular and Generalized Inverse 67

3.4 Matrix Inversion — Regular and Generalized


Inverse
Box 3.3 Matrix Inversion

• The matrix B for which AB = I is called the inverse of A. It is denoted as


A" .

• The rank of a matrix equals the number of its linearly independent columns.

• Only square matrices of full rank have an inverse. These are called non-
singular matrices. If the inverse of A exists, it is unique.

• A singular matrix does not have a regular inverse. However, a generalized


inverse A for which AA A = A holds, can be found for any matrix.

Multiplying a scalar with its reciprocal produces the multiplicative identity, - ‚ a"Î- b œ ",
provided - Á !. For matrices an operation such as this simple division is not so straightfor-
ward. The multiplicative identity for matrices should be the identity matrix I. The matrix B
which yields AB œ I is called the inverse of A, denoted A" . Unfortunately, an inverse does
not exist for all matrices. Matrices which are not square do not have regular inverses. If A"
exists, A is called nonsingular and we have

A" A œ AA" œ I.

To see how important it is to have access to the inverse of a matrix, consider the follow-
ing equation which expresses the vector y as a linear function of c,

ya8‚"b œ Xa8‚5b ca5‚"b .

To solve for c we need to eliminate the matrix X from the right-hand side of the equation. But
X is not square and thus cannot be inverted. However, Xw X is a square, symmetric matrix.
Premultiply both sides of the equation with Xw
Xw y œ Xw Xc. [3.11]
If the inverse of Xw X exists, premultiply both sides of the equation with it to isolate c:
" "
aXw Xb Xw y œ aXw Xb Xw Xc œ Ic œ c.

For the inverse of a matrix to exist, the matrix must be of full rank. The rank of a matrix
A, denoted <aAb, is equal to the number of its linearly independent columns. For example,

© 2003 by CRC Press LLC


68 Chapter 3  Linear Algebra Tools

Ô" " !×
Ö" " !Ù
AœÖ Ù
" ! "
Õ" ! "Ø

has three columns, a" , a# , and a$ . But because a" œ a# € a$ , the columns of A are not linear-
ly independent. In general, let a" , a# , ..., a5 be a set of column vectors. If a set of scalars
-" ß ÞÞÞß -5 can be found, not all of which are zero, such that
-" a" € -# a# € ÞÞÞ € -5 a5 œ !, [3Þ12]
the 5 column vectors are said to be linearly dependent. If the only set of constants for which
[3.12] holds is -" œ -# œ ÞÞÞ œ -5 œ !, the vectors are linearly independentÞ An a8 ‚ 5 b
matrix X whose rank is less than 5 is called rank-deficient (or singular) and Xw X does not
have a regular inverse. A few important results about the rank of a matrix and the rank of
matrix products follow:
<aAb œ <aAw b œ <aAw Ab œ <aAAw b
<aABb Ÿ mine<aAbß <aBbf [3.13]
<aA € Bb Ÿ <aAb € <aBb

If X is rank-deficient, Xw X will still be symmetric, but its inverse does not exist. How can we
then isolate c in [3.11]? It can be shown that for any matrix A, a matrix A can be found
which satisfies
AA A œ A. [3.14]
A is called the generalized inverse or pseudo-inverse of A. The terms g-inverse or condi-
tional inverse are also in use. It can be shown that a solution of Xw y œ Xw Xc is obtained with
a generalized inverse as c œ aXw Xb Xw y. Apparently, if X is not of full rank, all we have to
do is substitute the generalized inverse for the regular inverse. Unfortunately, generalized
inverses are not unique. The condition [3.14] is satisfied by (infinitely) many matrices. The
solution c is hence not unique either. Assume that G is a generalized inverse of Xw X. Then
any vector
c œ GXw y € aGXw X  Ibd

is a solution where d is a conformable but otherwise arbitrary vector (Searle 1971, p. 9; Rao
and Mitra 1971, p. 44). In analysis of variance models, X contains dummy variables coding
the treatment and design effects and is typically rank-deficient (see §3.9.1). Statistical
packages that use different generalized inverses will return different estimates of these
effects. This would pose a considerable problem, but fortunately, several important properties
of generalized inverses come to our rescue in statistical inference. If G is a generalized
inverse of Xw X, then
(i) Gw is also a generalized inverse of Xw X;
(ii) Gw X is a generalized inverse of X;
(iii) XGXw is invariant to the choice of G;
(iv) Xw XGXw œ Xw and XGXw X œ X;

© 2003 by CRC Press LLC


Matrix Inversion — Regular and Generalized Inverse 69

(v) GXw X and Xw XG are symmetric.

Consider the third result, for example. If XGXw is invariant to the choice of the particular
generalized inverse, then XGXw y is also invariant. But GXw y œ aXw Xb Xw y is the solution
derived above and so Xc is invariant. In statistical models, c is often the least squares estimate
of the model parameters and X a regressor or design matrix. While the estimates c will
depend on the choice of the generalized inverse, the predicted values will not. The two
statistical packages should report the same fitted values.
For any matrix A there is one unique matrix B that satisfies the following conditions:
(i) ABA œ A (ii) BAB œ B (iii) aBAbw œ BA (iv) aABbw œ ABÞ [3.15]
Because of (i), B is a generalized inverse of A. The matrix satisfying (i) to (iv) of [3.15]
is called the Moore-Penrose inverse, named after work by Penrose (1955) and Moore (1920).
Different classes of generalized inverses have been defined depending on subsets of the four
conditions. A matrix B satisfying (i) is the standard generalized inverse. If B satisfies (i) and
(ii), it is termed the reflexive generalized inverse according to Urquhart (1968). Special cases
of generalized inverses satisfying (i), (ii), (iii) or (i), (ii), (iv) are the left and right inverses.
Let Aa:‚5b be of rank 5 . Then Aw A is a a5 ‚ 5 b matrix of rank 5 by [3.13] and its inverse
exists. The matrix aAw Ab" Aw is called the left inverse of A, since left multiplication of A by
aAw Ab" Aw produces the identiy matrix:
"
aAw Ab Aw A œ I.

Similarly, if Aa:‚5b is of rank :, then Aw aAAw b" is its right inverse, since
AA aAAw b" œ IÞ Left and right inverses are not regular inverses, but generalized inverses.
w

Let Ca5‚:b œ Aw aAAw b" then AC œ I, but the product CA cannot be computed since the
matrices do not conform to multiplication. It is easy to verify, however, that ACA œ A, hence
C is a generalized inverse of A by [3.14]. Left and right inverses are sometimes called
normalized generalized inverse matrices (Rohde 1966, Morris and Odell 1968, Urquhart
1968). Searle (1971, pp. 1-3) explains how to construct a generalized inverse that satisfies
[3.15]. Given a matrix A and arbitrary generalized inverses B of aAAw b and C of aAw Ab, the
unique Moore-Penrose inverse can be constructed as Aw BACAw .
For inverses and all generalized inverses, we note the following results (here B is rank-
deficient and A is of full rank):

aB bw œ aBw b
ˆA" ‰w œ aAw b"
ˆA" ‰" œ A
[3.16]
aABb" œ B" A"
aB b œ B
<aB b œ <aBb.

Finding the inverse of a matrix is simple for certain patterned matrices such as full-rank
a# ‚ #b matrices, diagonal matrices, and block-diagonal matrices. The inverse of a full-rank

© 2003 by CRC Press LLC


70 Chapter 3  Linear Algebra Tools

a# ‚ #b matrix
+"" +"#
Aœ”
+#" +## •

is
" +##  +"#
A" œ ” . [3.17]
+## +""  +"# +#"  +#" +"" •

The inverse of a full-rank diagonal matrix Da5‡5b œ Diagaab is obtained by replacing the
diagonal elements by their reciprocals
" " "
D" œ DiagŒ” ß ß ÞÞÞß •. [3.18]
+" +# +5

A block-diagonal matrix is akin to a diagonal matrix, where matrices instead of scalars form
the diagonal. For example,

Ô B" 0 0 0 ×
Ö 0 B# 0 0 Ù
BœÖ Ù,
0 0 B$ 0
Õ 0 0 0 B% Ø

is a block-diagonal matrix where the matrices B" ß âß B% form the blocks. The inverse of a
block-diagonal matrix is obtained by separately inverting the matrices on the diagonal, provi-
ded these inverses exist:
"
Ô B" 0 0 0 ×
Ö 0 B" 0 0 Ù
B" œÖ
Ö 0
# Ù.
0 B"
$ 0 Ù
Õ 0 0 0 B"
%
Ø

3.5 Mean, Variance, and Covariance of Random


Vectors
Box 3Þ4 Moments

• A random vector is a vector whose elements are random variables.

• The expectation of a random vector is the vector of the expected values of its
elements.

• The variance-covariance matrix of a random vector Ya8‚"b is a square,


symmetric matrix of order a8 ‚ 8b. It contains the variances of the
elements of Y on the diagonal and their covariances in the off-diagonal cells.

© 2003 by CRC Press LLC


Mean, Variance, and Covariance of Random Vectors 71

If the elements of a vector are random variables, it is called a random vector. If ]3 ,


a3 œ "ß âß 8b, denotes the 3th element of the random vector Y, then ]3 has mean (expected
value) Ec]3 d and variance Varc]3 d œ Ec]3# d  Ec]3 d# . The expected value of a random vector is
the vector of the expected values of its elements,

Ô Ec]" d ×
Ö Ec]# d Ù
Ö Ù
EcYd œ cEc]3 dd œ Ö Ec]$ d Ù. [3.19]
Ö Ù
ã
Õ Ec]8 d Ø

In the following, let A, B, and C be matrices of constants (not containing random


variables), a, b, and c vectors of constants and +, ,, and - scalar constants. Then
EcAd œ A
EcAYB € Cd œ AEcYdB € C [3.20]
EcAY € cd œ AEcYd € c.

If Y and U are random vectors, then


EcAY € BUd œ AEcYd € BEcUd
Ec+Y € , Ud œ +EcYd € , EcUd. [3.21]

The covariance matrix between two random vectors Ya5‚"b and Ua:‚"b is a a5 ‚ :b
matrix, its 34th element is the covariance between ]3 and Y4 .
CovcYß Ud œ cCovc]3 ß Y4 dd.

It can be expressed in terms of expectation operations as


CovcYß Ud œ EaY  EcYdbaU  EcUdbw ‘ œ EcYUw d  EcYdEcUdw . [3.22]

The covariance of linear combinations of random vectors are evaluated similarly to the scalar
case (assuming W and V are also random vectors):
CovcAYß Ud œ ACovcYß Ud
CovcYß BUd œ CovcYß UdBw
CovcAYß BUd œ ACovcYß UdBw [3.23]
Covc+Y € , Uß - W € . Vd œ +- CovcYß Wd € ,- CovcUß Wd
€ +. CovcYß Vd € ,. CovcUß Vd.

The variance-covariance matrix is the covariance of a random vector with itself


VarcYd œ CovcYß Yd œ cCovc]3 ß ]4 dd
œ EaY  EcYdb EaY  EcYdbw ‘
œ EcYYw d  EcYdEcYdw .

The variance-covariance matrix contains on its diagonal the variances of the observations and
the covariances among the random vector's elements in the off-diagonal cells. Variance-co-

© 2003 by CRC Press LLC


72 Chapter 3  Linear Algebra Tools

variance matrices are square and symmetric, since Covc]3 ß ]4 d œ Covc]4 ß ]3 d:

Ô Varc]" d Covc]" ß ]# d Covc]" ß ]$ d á Covc]" ß ]5 d ×


Ö Covc]# ß ]" d Varc]# d Covc]# ß ]$ d á Covc]# ß ]5 d Ù
Ö Ù
VarYa5‚"b ‘ œ Ö Covc]$ ß ]" d Covc]$ ß ]# d Varc]$ d á Covc]$ ß ]5 d Ù.
Ö Ù
ã ã ã ä ã
Õ Covc]5 ß ]" d Covc]5 ß ]# d Covc]5 ß ]$ d á Varc]5 d Ø

To designate the mean and variance of a random vector Y, we use the notation
Y µ aEcYdß VarcYdb. For example, homoscedastic zero mean errors e are designated as
e µ a 0 ß 5 # Ib .
The elements of a random vector Y are said to be uncorrelated if the variance-covariance
matrix of Y is a diagonal matrix

Ô Varc]" d ! ! á ! ×
Ö ! Varc]# d ! á ! Ù
Ö Ù
VarYa5‚"b ‘ œ Ö ! ! Varc]$ d á ! Ù.
Ö Ù
ã ã ã ä ã
Õ ! ! ! á Varc]5 d Ø

Two random vectors Y" and Y# are said to be uncorrelated if their variance-covariance matrix
is block-diagonal
Y" VarcY" d 0
Var” œ .
Y# • ” 0 VarcY# d •

The rules above for working with the covariance of two (or more) random vectors can be
readily extended to variance-covariance matrices:
VarcAYd œ AVarcYdAw
VarcY € ad œ VarcYd
Varcaw Yd œ aw VarcYda [3.24]
#
Varc+Yd œ + VarcYd
Varc+Y € , Ud œ +# VarcYd € , # VarcUd € #+, CovcYß Ud.

3.6 The Trace and Expectation of Quadratic Forms


Box 3.5 Trace and Quadratic Forms

• The trace of a matrix A, denoted tr(A), equals the sum of its diagonal
elements.

• The expression yw Ay is called a quadratic form. It is a scalar quantity. Sums


of squares are quadratic forms. If Y is a random vector and A a matrix of

© 2003 by CRC Press LLC


The Trace and Expectation of Quadratic Forms 73

constants, the expected value of Yw AY is


EcYw AYd œ traAVarcYdb € EcYdw AEcYd.

The trace of a matrix A is the sum of its diagonal elements, denoted traAb and plays an impor-
tant role in determining the expected value of quadratic forms in random vectors. If y is an
a8 ‚ "b vector and A an a8 ‚ 8b matrix, yw Ay is a quadratic form in y. Notice that quadratic
forms are scalars. Consider a regression model Y œ X" € e, where the elements of e have !
mean and constant variance 5 # , e µ a0ß 5 # Ib. The total sum of squares corrected for the mean
WWX7 œ Yw aI  11w Î8bY

is a quadratic form as well as the regression (model) and residual sums of squares:
WWQ7 œ Yw aH  11w Î8bY
WWV œ Yw aI  HbY
"
H œ X aX w X b X w .

To determine distributional properties of these sums of squares and expected mean squares,
we need to know, for example, EcWWV d.
If Y has mean . and variance-covariance matrix V, the quadratic form Yw AY has
expected value
EcYw AYd œ traAVb € .w A.. [3.25]

In evaluating such expectations, several properties of the trace operator tra•b are helpful:
(i) traABCb œ traBCAb œ traCABb
(ii) tr(A € B) œ traAb € traBb
(iii) yw Ay œ trayw Ayb
[3.26]
(iv) traAb œ traAw b
(v) tra- Ab œ - traAb
(vi) traAb œ <aAb if AA œ A and Aw œ A.

Property (i) states that the trace is invariant under cyclic permutations and (ii) that the trace of
the sum of two matrices is identical to the sum of their traces. Property (iii) emphasizes that
quadratic forms are scalars and any scalar is of course equal to its trace (a scalar is a a" ‚ "b
matrix). We can now apply [3.25] with A œ aI  Hb, V œ 5 # I, . œ X" to find the expected
value of WWV in the linear regression model:
EcWWV d œ EcYw aI  HbYd
œ trˆaI  Hb5 # I‰ € " w Xw aI  HbX"
œ 5 # traI  Hb € " w Xw X"  " w Xw HX" .

At this point we notice that aI  Hb is symmetric and that aI  HbaI  Hb œ aI  Hb. We can
apply rule (vi) and find traI  Hb œ 8  <aHb. Furthermore,

© 2003 by CRC Press LLC


74 Chapter 3  Linear Algebra Tools

"
" w Xw HX" œ " w Xw XaXw Xb Xw X" œ " w Xw X" ,

which leads to EcWWV d œ 5 # traI  Hb œ 5 # a8  <aHbb. We have established that W # œ


WWVÎa8  <aHbb is an unbiased estimator of the residual variability in the classical linear
model. Applying the rules about the rank of a matrix in [3.13] we see that the rank of the
matrix H is identical to the rank of the regressor matrix X and so W # œ WWVÎc8  <aXbd.

3.7 The Multivariate Gaussian Distribution


Box 3.6 Multivariate Gaussian

• A random vector Y with mean EcYd œ . and variance-covariance matrix


VarcYd œ D is said to be multivariate Gaussian-distributed if its probability
density function is given by [3.28]. Notation: Y µ K a.ß Db.

• The elements of a multivariate Gaussian random vector have (univariate)


Gaussian distributions.

• Independent random variables (vectors) are also uncorrelated. The reverse


is not necessarily true, but holds for Gaussian random variables (vectors).
That is, Gaussian random variables (vectors) that are uncorrelated are
necessarily stochastically independent.

The Gaussian (Normal) distributions are arguably the most important family of distributions
in all of statistics. This does not stem from the fact that many attributes are Gaussian-distribu-
ted. Most outcomes are not Gaussian. We thus prefer the label Gaussian distribution over
Normal distribution. First, statistical methods are usually more simple and mathematically
straightforward if data are Gaussian-distributed. Second, the Central Limit Theorem (CLT)
permits approximating the distribution of averages in random samples by a Gaussian distri-
bution regardless of the distribution from which the sample was drawn, provided sample size
is sufficiently large.
A scalar random variable ] is said to be (univariate) Gaussian-distributed with mean .
and variance 5 # if its probability density function is
" "
0 aC b œ expœ  aC  .b# , [3.27]
È#15 # #5 #

a fact denoted by ] µ Ka.ß 5 # b. The distribution of a random vector Ya8‚"b is said to be


multivariate Gaussian with mean EcYd œ . and variance-covariance matrix VarcYd œ D if its
density is given by

© 2003 by CRC Press LLC


The Multivariate Gaussian Distribution 75

lDl"Î# "
0 ay b œ expœ  ay  .bw D" ay  .bÞ [3.28]
a #1 b 8Î# #

The term lDl is called the determinant of D. We express the fact that Y is multivariate
Gaussian with the shortcut notation
Y µ K8 a.ß Db.

The subscript 8 identifies the dimensionality of the distribution and thereby the order of .
and D. It can be omitted if the dimension is clear from the context. Some important properties
of Gaussian-distributed random variables follow:
(i) EcYd œ .ß VarcYd œ D.
(ii) If Y µ K8 a.ß Db, then Y  . µ K8 a0ß Db.
(iii) aY  .bw D" aY  .b µ ;#8 , a Chi-squared variable with 8 degrees of freedom.
(iv) If . œ 0 and D œ 5 # I, then K a0ß 5 # Ib is called the standard multivariate Gaussian
distribution.
(v) If Y µ K8 a.ß Db, then U œ AÐ5‚8Ñ Y € bÐ5‚"Ñ is Gaussian-distributed with mean
A. € b and variance-covariance matrix ADAw (linear combinations of Gaussian
variables are also Gaussian).

A direct relationship between absence of correlations and stochastic independence exists


only for Gaussian-distributed random variables. In general, CovcY" ß Y# d œ ! implies only that
Y" and Y# are not correlated. It is not a sufficient condition for their stochastic independence
which requires that the joint density of Y" and Y# factors into a product of marginal densities.
In the case of Gaussian-distributed random variables zero covariance does imply stochastic
independence. Let Y µ K8 a.ß Db and partition Y into two vectors
Y"Ð=‚"Ñ
YÐ8‚"Ñ œ ” with 8 œ = € > and partition further
Y#Ð>‚"Ñ •

EcY" d œ ." D D"#


EcYd œ . œ ” ß VarcYd œ D œ ” ""
EcY# d œ .# • D#" D## •

If D"#a=‚>b œ 0, then Y" and Y# are uncorrelated and independent. As a corollary note that the
elements of Y are mutually independent if and only if D is a diagonal matrix. We can learn
more from this partitioning of Y. The distribution of any subset of Y is itself Gaussian-distri-
buted, for example,
Y" µ K= a.= ß D"" b
]3 µ Ka.3 ß 533 b Þ

Another important result pertaining to the multivariate Gaussian distribution tells us that if
Y œ cYw" ß Y#w dw is Gaussian-distributed, the conditional distribution of Y" given Y# œ y# is also
Gaussian. Furthermore, the conditional mean EÒY" ly# Ó is a linear function of y# . The general
result is as follows (see, e.g. Searle 1971, Ch. 2.4). If

© 2003 by CRC Press LLC


76 Chapter 3  Linear Algebra Tools

Y" ." V"" V"#


” Y • µ K Œ” . •ß ” Vw V## •
,
# # "#

then Y" ly# µ Ka." € V"# V"


## ay# .# bß V""  V"# V" w
## V"# b. This result is important when
predictors for random variables are derived, for example in linear mixed models (§7), and to
evaluate the optimality of kriging methods for geostatistical data (§9).

3.8 Matrix and Vector Differentiation


Estimation of the parameters of a statistical model commonly involves the minimization or
maximization of a function. Since the parameters are elements of vectors and matrices, we
need tools to perform basic calculus with vectors and matrices. It is sufficient to discuss
matrix differentiation. Vector differentiation then follows immediately as a special case.
Assume that the elements of matrix A depend on a scalar parameter ), i.e.,
A œ c+34 a)bd.

The derivative of A with respect to ) is defined as the matrix of derivatives of its elements,
`+34 a)b
` AÎ` ) œ ” •. [3.29]
`)

Many statistical applications involve derivatives of the logarithm of the determinant of a


matrix, lnalAlb. Although we have not discussed the determinant of a matrix in detail, we note
that
" `lAl
` lnalAlbÎ` ) œ œ ><ˆA" ` AÎ` )‰. [3.30]
lA l ` )

The derivatives of the inverse of A with respect to ) are calculated as


` A" Î` ) œ  A" a` AÎ` )bA" Þ [3.31]

Occasionally the derivative of a function with respect to an entire vector is needed. For
example, let
0 axb œ B" € $B#  'B# B$ € %B#$ .

The derivative of 0 axb with respect to x is the vector of partial derivatives

Ô `0 axbÎ` B" × Ô " ×


`0 axbÎ` x œ `0 axbÎ` B# œ $  'B$ .
Õ `0 axbÎ` B$ Ø Õ  'B# € )B$ Ø

Some additional results for vector and matrix calculus follow. Again, A is a matrix whose
elements depend on ). Then,
(i) ` traABbÎ` ) œ traa` AÎ` )bBb € traA` BÎ` )b
(ii) ` xw A" xÎ` ) œ  xw A" Ð` AÎ` )ÑA" x

© 2003 by CRC Press LLC


Matrix and Vector Differentiation 77

(iii) ` xw aÎ` x œ ` aw xÎ` x œ a


(iv) ` xw AxÎ` x œ #Ax. [3.32]

We can put these rules to the test to find the maximum likelihood estimator of " in the
Gaussian linear model Y œ X" € e, e µ K8 a0ß Db. To this end we need to find the solution
s , which maximizes the likelihood function. From [3.28] the density function is given by
"
lDl"Î# "
0 ay ß " b œ expœ  ay  X" bw D" ay  X" bÞ
a #1 b 8Î# #

Consider this a function of " for a given set of data y, and call it the likelihood function
¿a" à yb. Maximizing ¿a" à yb is equivalent to maximizing its logarithm. The log-likelihood
function in the Gaussian linear model becomes

" 8 "
lne¿a" à ybf œ 6a" ;yb œ  lnalDlb  lna#1b  ay  X" bw D" ay  X" b. [3.33]
# # #
To find "s , find the solution to `6a";ybÎ` " œ 0. First we derive the derivative of the log-
likelihood:
" ` ay  X" bw D" ay  X" b
`6a" ;ybÎ` " œ 
# `"
" `
œ  ˜yw D" y  yw D" X"  " w Xw D" y € " w Xw D" X" ™
# `"
" `
œ  ˜yw D" y  #yw D" X" € " w Xw D" X" ™
# `"
"
œ  ˜  #Xw D" y € #Xw D" X" ™Þ
#
Setting the derivative to zero yields the maximum likelihood equations for ":
Xw D" y œ Xw D" X".

If X is of full rank, the maximum likelihood estimator becomes " s œ aXw D" Xb" Xw D" y,
which is also the generalized least squares estimator of " (see §4.2 and §A4.8.1).

3.9 Using Matrix Algebra to Specify Models

3.9.1 Linear Models


Box 3.7 Linear Model

• The linear regression or classification model in matrix/vector notation is


Y œ X" € e,

© 2003 by CRC Press LLC


78 Chapter 3  Linear Algebra Tools

where Y is the a8 ‚ "b vector of observations, X is an a8 ‚ 5 b matrix of


regressor or dummy variables, and e is a vector of random disturbances
(errors). The first column of X usually consists of ones, modeling an
intercept in regression or a grand mean in classification models. The vector
" contains the parameters of the mean structure.

The notation for the parameters is arbitrary; we only require that parameters are denoted with
Greek letters. In regression models the symbols "" , "# , â, "5 are traditionally used, while !,
3, 7 , etc., are common in analysis of variance models. When parameters are combined into a
parameter vector, the generic symbols " or ) are entertained. As an example, consider the
linear regression model

]3 œ "! € "" B"3 € "# B#3 € ⠀ "5 B53 € /3 ß 3 œ "ß âß 8. [3.34]


Here, B53 denotes the value of the 5 th covariate (regressor) for the 3th observation. The model
for 8 observations is
]" œ "! € "" B"" € "# B#" € ⠀ "5 B5" € /"
]# œ "! € "" B"# € "# B## € ⠀ "5 B5# € /#
]$ œ "! € "" B"$ € "# B#$ € ⠀ "5 B5$ € /$
]% œ "! € "" B"% € "# B#% € ⠀ "5 B5% € /%
ã
]8 œ "! € "" B"8 € "# B#8 € ⠀ "5 B58 € /8

To express the linear regression model for this set of data in matrix/vector notation
define:

Ô ]" × Ô "! × Ô /" ×


Ö ]# Ù Ö "" Ù Ö /# Ù
Ö Ù Ö Ù Ö Ù
YÐ8‚"Ñ œ Ö ]$ Ù, "Ð5€"‚"Ñ œ Ö "# Ù, eÐ8‚"Ñ œ Ö /$ Ù,
Ö Ù Ö Ù Ö Ù
ã ã ã
Õ ]8 Ø Õ" Ø Õ /8 Ø
5

and finally collect the regressors in the matrix

Ô" B"" B#" á B5" ×


Ö" B"# B## á B5# Ù
Xa8‚Ð5€"Ñb œÖ Ù.
ã ã ã ä ã
Õ" B"8 B#8 á B58 Ø

Combining terms, model [3.34] can be written as


Y œ X" € e. [3.35]
Notice that X" is evaluated as

© 2003 by CRC Press LLC


Using Matrix Algebra to Specify Models 79

"
Ô" B"" B#" á B5" ×Ô ! × Ô "! € "" B"" € "# B#" € ÞÞÞ € "5 B5" ×
Ö "" Ù
Ö" B"# B## á B5# ÙÖ Ù Ö "! € "" B"# € "# B## € ÞÞÞ € "5 B5# Ù
X" œ Ö ÙÖ " Ù œ Ö Ù.
ã ã ã ä ã Ö #Ù ã
Õ" ã
B"8 B#8 á B58 ØÕ Ø Õ "! € "" B"8 € "# B#8 € ÞÞÞ € "5 B58 Ø
" 5

X is often called the regressor matrix of the model.


The specification of the model is not complete without specifying means, variances, and
covariances of all random components. The assumption that the model is correct leads
naturally to a zero mean assumption for the errors, Eced œ 0. In the model Y œ X" € e,
where " contains fixed effects only, Varced œ VarcYd. Denote this variance-covariance matrix
as V. If the residuals are uncorrelated, V is a diagonal matrix and can be written as
#
Ô 5" ! â ! ×
Ö ! 5## â ! Ù
V œ Diagˆ5" ß 5# ß ÞÞÞß 58 ‰ œ Ö
# # #
Ù.
ã ä ã
Õ ! ! â 58# Ø

If the variances of the residuals are in addition homogeneous,

Ô" ! â !×
Ö! " â !Ù
V œ 5# Ö Ù œ 5 # I8 ,
ã ä ã
Õ! ! â "Ø

where 5 # is the common variance of the model disturbances. The classical linear model with
homoscedastic, uncorrelated errors is finally
Y œ X" € eß e µ a0ß 5 # Ib. [3.36]

The assumption of Gaussian error distribution is deliberately omitted. To estimate the param-
eter vector " by least squares does not require a Gaussian distribution. Only if hypotheses
about " are tested does a distributional assumption for the errors come into play.
Model [3.34] is a regression model consisting of fixed coefficients only. How would the
notation change if the model incorporates effects (classification variables) rather than
coefficients? In §1.7.3 it was shown that ANOVA models can be expressed as regression
models by constructing appropriate dummy regressors, which associate an observation with
elements of the parameter vector. Consider a randomized complete block design with four
treatments in three blocks. Written as an effects model (§4.3.1), the linear model for the block
design is
]34 œ . € 34 € 73 € /34 , [3.37]

where ]34 is the response in the experimental unit receiving treatment 3 in block 4, 34 is the
effect of block 3 œ "ß âß $, 73 is the effect of treatment 3 œ "ß âß %, and the /34 are the experi-
mental errors assumed uncorrelated with mean ! and variance 5 # . Define the column vector
of parameters
) œ c.ß 3" ß 3# ß 3$ ß 7" ß 7# ß 7$ ß 7% dw ,

© 2003 by CRC Press LLC


80 Chapter 3  Linear Algebra Tools

and the response vector Y, design matrix P, and vector of experimental errors as
] " " ! ! " ! ! ! /
Ô "" × Ô × Ô "" ×
Ö ]#" Ù Ö" " ! ! ! " ! !Ù Ö /#" Ù
Ö] Ù Ö" " ! ! ! ! " !Ù Ö/ Ù
Ö $" Ù Ö Ù Ö $" Ù
Ö ]%" Ù Ö" " ! ! ! ! ! "Ù Ö /%" Ù
Ö Ù Ö Ù Ö Ù
Ö ]"# Ù Ö" ! " ! " ! ! !Ù Ö /"# Ù
Ö Ù Ö Ù Ö Ù
Ö ]## Ù Ö Ù Ö Ù
YœÖ Ù, P œ Ö " ! " ! ! " ! ! Ù, e œ Ö /## Ù,
Ö ]$# Ù Ö" ! " ! ! ! " !Ù Ö /$# Ù
Ö Ù Ö Ù Ö Ù
Ö ]%# Ù Ö" ! " ! ! ! ! "Ù Ö /%# Ù
Ö Ù Ö Ù Ö Ù
Ö ]"$ Ù Ö" ! ! " " ! ! !Ù Ö /"$ Ù
Ö Ù Ö Ù Ö Ù
Ö ]#$ Ù Ö" ! ! " ! " ! !Ù Ö /#$ Ù
Ö Ù Ö Ù Ö Ù
]$$ " ! ! " ! ! " ! /$$
Õ] Ø Õ" ! ! " ! ! ! "Ø Õ/ Ø
%$ %$

and model [3.37] in matrix/vector notation becomes


Y œ P) € e. [3.38]
The first four rows of Y, P, and e correspond to the four observations from block ", observa-
tions five through eight are associated with block #, and so forth. Comparing [3.35] and
[3.38] the models look remarkably similar. The matrix P in [3.38] contains only dummy
variables, however, and is termed a design matrix whereas X in [3.35] contains continuous
regressors. The parameter vector ) in [3.38] contains the block and treatment effects, " in
[3.35] contains the slopes (gradients) of the response in the covariates. It can easily be
verified that the design matrix P is rank-deficient. The sum of columns two through four is
identical to the first column. Also, the sum of the last four columns equals the first column.
Generalized inverses must be used to calculate parameter estimates in analysis of variance
models. The ramifications with regard to the uniqueness of the estimates have been discussed
in §3.4. The regressor matrix X in regression models is usually of full (column) rank. A rank
deficiency in regression models occurs only if one covariate is a linear combination of one or
more of the other covariates. For example, if Y is regressed on the height of a plant in inches
and meters simultaneously. While the inverse matrix of Xw X usually exists in linear regression
models, the problem of near-linear dependencies among the columns of X can arise if
covariates are closely interrelated. This condition is known as multicollinearity. A popular
method for combating collinearity is ridge regression (§4.4.5).
The variance-covariance matrix of e in [3.38] is a diagonal matrix by virtue of the ran-
dom assignment of treatments to experimental units within blocks and independent random-
izations between blocks. Heterogeneity of the variances across blocks could still exist. If, for
example, the homogeneity of experimental units differs between blocks " and #, but not
between blocks # and $, Varced would become
Varced œ Diagˆ5"# ß 5"# ß 5"# ß 5"# ß 5## ß 5## ß 5## ß 5## ß 5## ß 5## ß 5## ß 5## ‰. [3.39]

© 2003 by CRC Press LLC


Using Matrix Algebra to Specify Models 81

3.9.2 Nonlinear Models


A nonlinear mean function is somewhat tricky to express in matrix/vector notation. Building
on the basic component equation in §1.7.1, the systematic part for the 3th observation can be
expressed as 0 aB"3 ß B#3 ß âß B53 ß )" ß )# ß âß ): b. In §3.9.1 we included B!3 to allow for an inter-
cept term. Nonlinear models do not necessarily possess intercepts and the number of parame-
ters usually does not equal the number of regressors. The B's and )'s can be collected into two
vectors to depict the systematic component for the 3th observation as a function of covariates
and parameters as 0 ax3 ß )b, where
w
x3 œ cB"3 ß B#3 ß âß B53 dw , ) œ c)" ß )# ß âß ): d .

As an example, consider
]3 œ B!"3 a"" € "# B#3 b,

a nonlinear model used by Cole (1975) to model forced expiratory volume of humans a]3 b as
a function of height aB" b and age aB# b. Put x3 œ cB"3 ß B#3 dw and ) œ c!ß "" ß "# dw and add a sto-
chastic element to the model:
]3 œ 0 ax3 ß )b € /3 . [3.40]

To express model [3.40] for the vector of responses Y œ c]" ß ]# ß âß ]8 dw , replace the
function 0 ab with vector notation, and remove the index 3 from its arguments,
Y œ faxß )b € eÞ [3.41]

[3.41] is somewhat careless notation, since fab is not a vector function. We think of it as the
function 0 ab applied to the arguments x3 ß ) in turn:

Ô ]" × Ô 0 ax " ß ) b × Ô /" ×


Ö ]# Ù Ö 0 ax # ß ) b Ù Ö /# Ù
Ö Ù Ö Ù Ö Ù
Y œ Ö ]$ Ù, faxß )b œ Ö 0 ax$ ß )b Ù, e œ Ö /$ Ù.
Ö Ù Ö Ù Ö Ù
ã ã ã
Õ ]8 Ø Õ 0 ax 8 ß ) b Ø Õ /8 Ø

3.9.3 Variance-Covariance Matrices and Clustering


In §2.6 a progression of clustering was introduced leading from unclustered data (all data
points uncorrelated) to clustered and spatial data depending on the number and size of the
clusters in the data set. This progression corresponds to particular structures of the variance-
covariance matrix of the observations Y. Consider Figure 2.7 (page 56), which shows spatial
data in panel (a), unclustered data in panel (b), and clustered data in panel (c). Denote the re-
sponses associated with the panels as Y+ , Y, , and Y- , respectively. Possible models for the
three cases are

© 2003 by CRC Press LLC


82 Chapter 3  Linear Algebra Tools

a) ]+3 œ "! € "" ="3 € "# =#3 € /3


b) ],34 œ . € 74 € /34
c) ]-345 œ . € 74 € /34 € .345 ,

where /34 denotes the experimental error for replicate 3 of treatment 4, and .345 the
subsampling error for sample 5 of replicate 3 of treatment 4. The indices range as follows.

Table 3.1. Indices for the models corresponding to panels (a) to (c) of Figure 2.7
Panel of Figure 2.7 3œ 4œ 5œ No. of obs.
(a) "ß âß "' "'
(b) "ß âß % "ß âß % "'
(c) "ß âß # "ß âß % "ß âß # "'

To complete the specification of the error structure we put Ec/3 d œ Ec/34 d œ Ec.345 d œ !,
(a) Covc/3 ß /5 d œ 5 # expe  $234 Î!f
5/# 3 œ 5ß 4 œ 6
(b), (c) Covc/34 ß /56 d œ œ
! otherwise
5.# 3 œ 6ß 4 œ 7ß 5 œ 8
(d) Covc.345 ß .678 d œ œ
! otherwise
Covc/34 ß .345 d œ !.

The covariance model for two spatial observations in model (a) is called the exponential
model, where 234 is the Euclidean distance between ]+3 and ]+4 . The parameter ! measures
the range at which observations are (practically) uncorrelated (see §7.5.2 and §9.2.2 for
details). In models (b) and (c), the error structure states that experimental and subsampling
errors are uncorrelated with variances 5/# and 5.# , respectively. The variance-covariance
matrix in each model is a a"' ‚ "'b matrix. In the case of model a) we have
Varc]+3 d œ Covc/3 ß /3 d œ 5 # expe  $233 Î!f œ 5 # expe  !Î!f œ 5 #
Covc]+3 ß ]+4 d œ Covc/3 ß /4 d œ 5 # expe  $234 Î!f

and the variance-covariance matrix is

Ô " expe  $2"# Î!f expe  $2"$ Î!f â expe  $2" "' Î!f ×
Ö expe  $2#" Î!f " expe  $2#$ Î!f â expe  $2# "' Î!f Ù
Ö Ù
VarcY+ d œ 5 # Ö expe  $2$" Î!f expe  $2$# Î!f " â expe  $2$ "' Î!f Ù.
Ö Ù
ã ã ä ã
Õ expe  $2"' " Î!f expe  $2"' # Î!f expe  $2"' $ Î!f â " Ø

In model (b) we have


Varc],34 d œ 5/# , Covc],34 ß ],56 d œ ! (if 3 Á 5 or 4 Á 6)

and the variance-covariance matrix is diagonal:

© 2003 by CRC Press LLC


Using Matrix Algebra to Specify Models 83

ÔÔ ],"" ×× Ô " ! ! ! ! ! â !×
ÖÖ ],"# ÙÙ Ö ! " ! ! ! ! â !Ù
ÖÖ ÙÙ Ö Ù
ÖÖ ],"$ ÙÙ Ö ! ! " ! ! ! â !Ù
ÖÖ ÙÙ Ö Ù
ÖÖ ],"% ÙÙ Ö ! ! ! " ! ! â !Ù
VarcY, d œ VarÖÖ ÙÙ œ 5 # Ö Ù.
ÖÖ ],#" ÙÙ Ö ! ! ! ! " ! â !Ù
ÖÖ ÙÙ Ö Ù
ÖÖ ],## ÙÙ Ö ! ! ! ! ! " â !Ù
ÖÖ ÙÙ Ö Ù
ã ã ã ã ã ã ã ä ã
ÕÕ ],%% ØØ Õ ! ! ! ! ! ! â "Ø

The first four entries of Y, correspond to the first replicates of treatments " to % and so forth.
To derive the variance-covariance matrix for model (c), we need to separately investigate the
covariances among observations from the same experimental unit and from different units.
For the former we have
Covc]-345 ß ]-345 d œ Varc/34 € .345 d œ 5/# € 5.#

and
Covc]-345 ß ]-346 d œ Covc/34 € .345 ß /34 € .346 d
œ Covc/34 ß /34 d € Covc/34 ß .346 d € Covc.345 ß /34 d € Covc.345 ß .346 d
œ 5/# € ! € ! € !.

For observations from different experimental units we have


Covc]-345 ß ]-678 d œ Covc/34 € .345 ß /67 € .678 d
œ Covc/34 ß /67 d € Covc/34 ß .678 d € Covc.345 ß /67 d € Covc.345 ß .678 d
œ ! € ! € ! € !.

If the elements of Y- are arranged by grouping observations from the same cluster together,
the variance-covariance matrix can be written as

Ô " 9 ! ! ! ! ! ! !×
Ö 9 " ! ! ! ! ! ! !Ù
Ö Ù
Ö ! ! " 9 ! ! ! ! !Ù
Ö Ù
Ö ! ! 9 " ! ! ! ! !Ù
ˆ # # ‰Ö Ù
VarcY- d œ 5/ € 5. Ö ! ! ! ! " 9 ! ! !Ù
Ö Ù
Ö ! ! ! ! 9 " ã ãÙ
Ö Ù
Ö ! ! ! ! ! ! ä Ù
Ö Ù
! ! ! ! ! ! â " 9
Õ! ! ! ! ! ! â 9 "Ø

where 9 œ 5/# Îa5/# € 5.# b. The observations from a single cluster are shaded.
When data are clustered and clusters are uncorrelated but observations within a cluster
are correlated, VarcYd has a block-diagonal structure, and each block corresponds to a
different cluster. If data are unclustered (cluster size ") and uncorrelated, VarcYd is diagonal.
If data consists of a single cluster of size 8 (spatial data), the variance-covariance matrix
consists of a single block.

© 2003 by CRC Press LLC


Chapter 4

The Classical Linear Model: Least


Squares and Some Alternatives

“The ususal criticism is that the formulae ... can tell us nothing new, and
nothing worth knowing of the biology of the phenomenon. This appears to
me to be very ill-founded. In the first place, quantitative expression in
place of a vague idea ... is not merely a mild convenience. It may even be
a very great convenience, and it may even be indispensable in making
certain systematic and biological deductions. But further, it may suggest
important ideas to the underlying processes involved.” Huxley, J.S.,
Problems of Relative Growth. New York: Dial Press, 1932.

4.1 Introduction
4.2 Least Squares Estimation and Partitioning of Variation
4.2.1 The Principle
4.2.2 Partitioning Variability through Sums of Squares
4.2.3 Sequential and Partial Sums of Squares and the
Sum of Squares Reduction Test
4.3 Factorial Classification
4.3.1 The Means and Effects Model
4.3.2 Effect Types in Factorial Designs
4.3.3 Sum of Squares Partitioning through Contrasts
4.3.4 Effects and Contrasts in The SAS® System
4.4 Diagnosing Regression Models
4.4.1 Residual Analysis
4.4.2 Recursive and Linearly Recovered Errors
4.4.3 Case Deletion Diagnostics
4.4.4 Collinearity Diagnostics
4.4.5 Ridge Regression to Combat Collinearity

© 2003 by CRC Press LLC


4.5 Diagnosing Classification Models
4.5.1 What Matters?
4.5.2 Diagnosing and Combating Heteroscedasticity
4.5.3 Median Polishing of Two-Way Layouts
4.6 Robust Estimation
4.6.1 P" -Estimation
4.6.2 M-Estimation
4.6.3 Robust Regression for Prediction Efficiency Data
4.6.4 M-Estimation in Classification Models
4.7 Nonparametric Regression
4.7.1 Local Averaging and Local Regression
4.7.2 Choosing the Smoothing Parameter

© 2003 by CRC Press LLC


4.1 Introduction
Contemporary statistical models are the main concern of this text and the reader may right-
fully challenge us to define what this means. This is no simple task since we hope that our
notion of what is contemporary will be obsolete shortly. In fact, if the title of this text is out-
dated by the time it goes to press, we would be immensely satisfied. Past experience has led
us to believe that most of the statistical models utilized by plant and soil scientists can be cast
as classical regression or analysis of variance models to which a student will be introduced in
a one- or two-semester sequence of introductory statistics courses. A randomized complete
block design with fixed block and treatment effects, for example, is such a standard linear
model since it comprises a single error term, a linear mean structure of block and treatment
effects, and its parameters are best estimated by ordinary least squares. In matrix/vector
notation this classical model can be written as
Y œ X" € e, [4.1]
where Eced œ 0, Varced œ 5 # I, and X is an a8 ‚ 5 b matrix.
We do not consider such models as contemporary and draw the line between classical
and contemporary where the linear fixed effects model with a single error term is no longer
appropriate because
• such a simple model is no longer a good description of the data generating mechanism

or
• least squares estimators of the model parameters are no longer best suited although the
model may be correct.

It is the many ways in which model [4.1] breaks down that we discuss here. If the model
does not hold we are led to alternative model formulations; in the second case we are led to
alternative methods of parameter estimation. Figure 4.1 is an attempt to roadmap where these
breakdowns will take us. Before engaging nonlinear, generalized linear, linear mixed, non-
linear mixed, and spatial models, this chapter is intended to reacquaint the reader with the
basic concepts of statistical estimation and inference in the classical linear model and to intro-
duce some methods that have gone largely unnoticed in the plant and soil sciences. Sections
§4.2 and §4.3 are largely a review of the analysis of variance and regression methods. The
important sum of squares reduction test that will be used frequently throughout this text is
discussed in §4.2.3. Standard diagnostics for performance of regression models are discussed
in §4.4 along with some remedies for model breakdowns such as ridge regression to combat
collinearity of the regressors (§4.4.5). §4.5 concentrates on diagnosing classification models
with special emphasis on the homogeneous variance assumption. In the sections that follow
we highlight some alternative approaches to statistical estimation that (in our opinion) have
not received the attention they deserve, specifically P" - and M-Estimation (§4.6) and non-
parametric regression (§4.7). Mathematical details on these topics which reach beyond the
coverage in the main text can be found in Appendix A on the CD-ROM (§A4.8).

© 2003 by CRC Press LLC




β
Non-Conforming Xβ Non-Conforming e

Columns of X nearly linearly dependent Distribution far from Gaussian:


(collinearity): Ridge Regression (§4.4) M-Estimation (§4.6)

β is unknown: Nonparametric models (§4.7)


Xβ Errors correlated:
Clustered Data Models (§7, §8)
Y is not linear in β : Nonlinear models (§5) Spatial Models (§9)
Outliers and highly influential data points: Multiple error terms:
M-Estimation (§4.6.2) and L1-Estimation (§4.6.1) Mixed Models (§7)

Errors uncorrelated but not equi-


dispersed: Weighted Estimation
(§5.8)

The classical linear model


with Gaussian errors
Y = Xββ + e, e ~ G(0,σ2 I)

Non-Conforming Y

Y is not continuous:
Generalized Linear Models (§6)
Transformed Linear Models (§4.5, §6)

Y is continuous with known


non-Gaussian distribution:
M-Estimation (§4.6)
Generalized Linear Models (§6)

Elements of Y correlated:
Mixed Models and Clustered Data Models (§7, §8)
Spatial Models (§9)

Figure 4.1. The classical linear Gaussian model and breakdowns of its components that lead
to alternative methods of estimation (underlined) or alternative statistical models (italicized).

The linear model Y œ X" € e, e µ a0ß 5 # Ib encompasses a plethora of important situa-


tions. Standard analysis of variance and simple and multiple linear regression are among
them. In this chapter we focus by way of example on a small subset of situations and those
model breakdowns that lead to alternative estimation methods. The applications to which we
will return frequently are now introduced.

Example 4.1. Lime Application.


Each of two lime types (agricultural lime (AL) and granulated lime (GL)) were applied
at each of five rates (!, ", #, %, ) tons) independently on five replicate plots (Pierce and
Warncke 2000). Hence, a total of & ‚ & ‚ # experimental units were involved in the
experiment. The pH in soil samples obtained one week after lime application is shown
in Table 4.1. If the treatments are assigned completely at random to the experimental
units, the observations in Table 4.1 can be thought of as realizations of random devia-

© 2003 by CRC Press LLC


tions around a mean value common to a particular Lime ‚ Rate combination. We use
uppercase lettering when referring to a factor and lowercase lettering when referring to
one or more levels of a factor. For example, “Lime Type” designates the factor and
“lime types” the two levels.

By virtue of this random assignment of treatments to experimental units, the fifty


responses are also independent and if error variability is homogeneous across
treatments, the responses are homoscedastic. Under these conditions, model [4.1]
applies and can be expressed, for example, as
]345 œ .34 € /345
3 œ "ß #à 4 œ "ß âß &à 5 œ "ß âß &.

Here, .34 is the mean pH on plots receiving Lime Type 3 at Rate of Application 4. The
/345 are experimental errors associated with the 5 replicates of the 34th treatment combi-
nation.

Table 4.1. pH in & ‚ # factorial arrangement with five replicates


Agricultural Lime Granulated Lime
Rep. 0 1 2 4 8 0 1 2 4 8
" &Þ($& &Þ)%& &Þ*)! 'Þ")! 'Þ%"& &Þ''! &Þ*#& &Þ)!! 'Þ!!& 'Þ!'!
# &Þ((! &Þ))! &Þ*'& 'Þ!'! 'Þ%(& &Þ((! %Þ(%! &Þ('! &Þ*%! &Þ*)&
$ &Þ($! &Þ)'& &Þ*(& 'Þ"$& 'Þ%*& &Þ('! &Þ()& &Þ)!& &Þ)'& &Þ*#!
% &Þ($& &Þ)(& &Þ*(& 'Þ#!& 'Þ%&& &Þ(#! &Þ(#& &Þ)!& &Þ*"! &Þ)%&
& &Þ(&! &Þ)'& 'Þ!(& 'Þ!*& 'Þ%"! &Þ(!! &Þ('& &Þ'(! &Þ)&! 'Þ!'&

Of interest to the modeler is to uncover the relationship among the "! treatment means
."" ß âß .#& . Figure 4.2 displays the sample averages of the five replicates for each
treatment and a sharp increase of pH with increasing rate of application for agricultural
lime and a weaker increase for granulated lime is apparent.

6.4

Agricultural Lime

6.2
Sample Average pH

6.0
Granulated Lime

5.8

5.6
0 1 2 3 4 5 6 7 8
Rate of Application (tons)

Figure 4.2. Sample average pH in soil samples one week after lime application.

© 2003 by CRC Press LLC


Treatment comparisons can be approached by first partitioning the variability in Y into
a source due to experimental error and a source due to treatments and subsequent
testing of linear combinations of the .34 . L! : ."#  #."$ € ."% œ !, for example, posits
that the average of rate " ton and % tons is equal to the mean pH when applying # tons
for AL. These comparisons can be structured efficiently, since the factors Lime Type
and Rate of Application are crossed; each level of Lime Type appears in combination
with each level of factor Rate. The treatment variability is thus partitioned into main
effects and interactions and the linear model Y œ X. € e is similarly expanded. In
§4.2.2 we discuss variability partitioning in general and in §4.3.3 specifically for
classification models.

The linear model in Example 4.1 is determined by the experimental design. Randomi-
zation ensures independence of the errors, the treatment structure determines which treatment
effects are to be included in the model and if design effects (blocking, replication at different
locations, or time points) were present they, too, would be included in the mean structure.
Analysis focuses on the predetermined questions of interest. For example,
• Do factors Lime Type and Rate of Application interact?
• Are there significant main effects of Lime Type and Rate of Application?
• How can the trend between pH and application rates be modeled and does this trend
depend on which type of lime is applied?
• At which rate of application do the lime types differ significantly in pH?

In the next example, developing an appropriate mean structure is the focal point of the
analysis. The modeler must apply a series of hypothesis tests and diagnostic procedures to
arrive at a final model on which inference and conclusions can be based with confidence.

Example 4.2. Turnip Greens. Draper and Smith (1981, p. 406) list data from a study
of Vitamin F# content in the leaves of turnip plants (Wakeley, 1949). For each of
8 œ #( plants, the concentration of F# vitamin (milligram per gram) was measured as
the response of interest. Along with the vitamin content the explanatory variables

\" œ Amount of radiation (Sunlight) during the preceding half-day


(1 cal ‚ -7# ‚ 738" )
\# œ Soil Moisture tension
\$ œ Air Temperature (°J )

were measured (Table 4.2). Only three levels of soil moisture were observed for \#
a#Þ!ß (Þ!ß %(Þ%b with nine plants per level, whereas only a few or no duplicate values are
available for the variables Sunlight and Air Temperature.

© 2003 by CRC Press LLC


Table 4.2. Partial Turnip Green Data from Draper and Smith (1981)
Plant # Vitamin F# (C3 Ñ Sunlight aB" b Soil Moisture aB# b Air Temp. aB$ b
" ""!.% "(' (Þ! ()
# "!#.) "&& (Þ! )*
$ "!".! #($ (Þ! )*
% "!).% #($ (Þ! (#
& "!!.( #&' (Þ! )%
' "!!.$ #)! (Þ! )(
( "!#.! #)! (Þ! (%
) *$.( ")% (Þ! )(
ã ã ã ã ã
#$ '".! (' %(.% (%
#% &$.# #"$ %(.% ('
#& &*.% #"$ %(.% '*
#' &).( "&" %(.% (&
#( &)Þ! #!& %(.% ('
Draper, N.R. and Smith, H. (1981) Applied Regression Analysis. 2nd ed. Wiley and Sons,
New York. © 1981 John Wiley and Sons, Inc. This material is used by permission of John
Wiley and Sons, Inc.

Any one of the explanatory variables does not seem very closely related to the vitamin
content of the turnip leaves (Figure 4.3). Running separate linear regressions between
] and each of the three explanatory variables, only Soil Moisture seems to explain a
significant amount of Vitamin F# variation. Should we conclude based on this finding
that the amount of sunlight and the air temperature have no effect on the vitamin
content of turnip plant leaves? Is it possible that Air Temperature is an important pre-
dictor of vitamin F# content if we simultaneously adjust for Soil Moisture? Is a linear
trend in Soil Moisture reasonable even if it appears to be significant?

Analysis of the Turnip Greens data does not utilize a linear model suggested by the
processes of randomization and experimental control. It is the modeler's task to discover
the importance of the explanatory variables on the response and their interaction with
each other in building a model for these data. We use methods of multiple linear regres-
sion (MLR) to that end. The purposes of an MLR model can be any or all of the
following:

• to determine a (small) set of explanatory variables from which the response


can be predicted with reasonable confidence and to discover the inter-
relationships among them;

• to develop a mathematical model that describes how the mean response


changes with explanatory variables;

• to predict the outcome of interest for values of the explanatory variables not
in the data set.

© 2003 by CRC Press LLC


100 100

Vitamin B2

Vitamin B2
60 60

100 200 300 10 30


Sunlight Soil Moisture

100
Vitamin B2

60

61 72 83
Air Temperature

Figure 4.3. Vitamin F# in leaves of #( turnip plants plotted against explanatory


variables.

Examples 4.1 and 4.2 are analyzed with standard analysis of variance and multiple linear
regression techniques. The parameters of the respective models will be estimated by ordinary
least squares or one of its variations (§4.2.1) because of the efficiency of least squares esti-
mates under standard conditions. Least squares estimates are not necessarily the best esti-
mates. They are easily distorted in a variety of situations. Strong dependencies among the
columns in the regressor matrix X, for example, can lead to numerical instabilities producing
least squares estimates of inappropriate sign, inappropriate magnitude, and of low precision.
Diagnosing and remedying this condition  known as multicollinearity  is discussed in
§4.4.4, with additional details in §A4.8.3.
Outlying observations also have high (negative) influence on least squares analysis and a
single outlier can substantially distort the analysis. Methods resistant and/or robust against
outliers were developed decades ago but are being applied to agronomic data only infre-
quently. To delete suspicious observations from the analysis is a common course of action,
but the fact that an observation is outlying does not warrant its removal. Outliers can be the
most interesting observations in a set of data that should be investigated with extra
thoroughness. Outliers can be due to a breakdown of an assumed model; therefore, the model
needs to be changed, not the data. One such breakdown concerns the distribution of the model
errors. Compared to a Gaussian distribution, outliers will be occurring more frequently if the
distribution of the errors is heavy-tailed or skewed. Another model breakdown agronomists
should be particularly aware of concerns the presence of block ‚ treatment interactions in
randomized complete block designs (RCBD). The standard analysis of an RCBD is not valid
if treatment comparisons do not remain constant from block to block. A single observation
 often an extreme observation  can induce a significant interaction.

© 2003 by CRC Press LLC


Example 4.3. Dollar Spot Counts. A turfgrass experiment was conducted in a random-
ized complete block design with fourteen treatments in four blocks. The outcome of
interest was the number of dollar spot infection centers on each experimental unit
(Table 4.3).

Table 4.3. Dollar spot counts† in a randomized complete block design


Treatment
Block 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 "' %! #* $! $" %# $' %( *( ""! ")" '$ %! "%
2 $( )" "* ') (" ** %% %) )" "!) )& "* $$ "'
3 #% (% $) #& '' %! %& (* %# )) "!& #" #! $"
4 $% )) #( $% %# %! %& $* *# #() "&# %' #$ $*

Data kindly provided by David Gilstrap, Department of Crop and Soil Sciences,
Michigan State University. Used with permission.

Denote the entry for block (row) 3 and treatment (column) 4 as ]34 . Since the data are
counts (without natural demoninator, §2.2), one may consider the entry in each cell as a
realization of a Poisson random variable with mean Ec]34 d. Poisson random variables
with mean greater than "&, say, can be well approximated by Gaussian random
variables. The entries in Table 4.3 appear sufficiently large to invoke the approxima-
tion. If one analyzes these data with a standard analysis of variance and performs
hypothesis tests based on the Gaussian distribution of the model errors, will the obser-
vation for treatment "! in block % negatively affect the inference? Are there any other
unusual or influential observations? Are there interactions between treatments and
blocks? If so, could they be induced by extreme observations? Will the answers to these
important questions change if a transformation of the counts is employed?

In §4.5.3 we apply the outlier resistant method of Median Polishing to study the
potential block ‚ treatment interactions and in §4.6.4 we estimate treatment effects with
an outlier robust method.

Many agronomic data sets are comparatively small, and estimates of residual variation
(mean square errors) rest on only a few degrees of freedom. This is particularly true for
designed experiments but analysis in observational studies may be hampered too. Losing
additional degrees of freedom by removing suspicious observations is then particularly costly
since it can reduce the power of the analysis considerably. An outlier robust method that re-
tains all observations but reduces their negative influence on the analysis is then preferred.

Example 4.4. Prediction Efficiency. Mueller et al. (2001) investigate the accuracy and
precision of mapping spatially variable soil attributes for site-specific fertility manage-
ment. The efficiency of geostatistical prediction via kriging (§9.4) was expressed as the
ratio of the kriging mean square error relative to a whole-field average prediction which
does not take into account the spatial dependency of the data. Data were collected on a

© 2003 by CRC Press LLC


#!Þ%-ha field in Clinton County, Michigan, on a $! ‚ $! meter grid. The field had been
in a corn (Zea mays L.)-soybean (Glycine max L. [Merr.]) rotation for ## years and a
portion of the field had been subirrigated since "*)). Of the attributes for which spatial
analyses were performed, we focus on

• pH œ soil pH (1:1 soil water mixture)


• P œ Bray P-1 extractable soil phosphorus
• Ca œ Calcium 1M NH$ OAc extractable
• Mg = Magnesium 1M NH$ OAc extractable
• CEC = Cation exchange capacity
• Lime, Prec = Lime and P fertilizer recommendations
according to tristate fertilizer recommendations (Vitosh et al. 1995)
for corn with a uniform yield goal of ""Þ$ MG ha" .

50

Prec
40 P
Prediction Efficacy

Ca
30
pH

20
Mg
CEC

10

0 lime

30 40 50 60 70 80 90 100 110 120 130


Range of Spatial Correlation

Figure 4.4. Prediction efficiency for kriging of various soil attributes as a function of
the range of spatial correlation overlaid with predictions from a quadratic polynomial.
Adapted from Figure 1.12 in Mueller (1998). Data kindly provided by Dr. Thomas G.
Mueller, Department of Agronomy, University of Kentucky. Used with permission.

The precision with which observations at unobserved spatial locations can be predicted
based on geostatistical methods (§9.4) is a function of the spatial autocorrelation among
the observations. The stronger the autocorrelation, the greater the precision of spatially
explicit methods compared to whole-field average prediction which does not utilize
spatial information. The degree of autocorrelation is strongly related to the range of the
spatial process. The range is defined as the spatial separation distance beyond which
measurements of an attribute can be considered uncorrelated. It is expected that the
geostatistical efficiency increases with the range of the attribute. This is clearly seen in
Figure 4.4. Finding a model that captures (on average) the dependency between range
and efficiency is of primary interest in this application. In contrast to the Turnip Greens
study (Example 4.2), tests of hypotheses about the relationship between prediction effi-
ciency and covariates are secondary in this study.

© 2003 by CRC Press LLC


The seven observations plotted in Figure 4.4 suggest a quadratic trend between
efficiency and the range of the spatial dependency but the Mg observation clearly
stands out. It is considerably off the trend. Deleting this observation will result in
cleaner least squares estimates of the relationship but also reduces the data set by "%%.
In §4.6.3 we examine the impact of the Mg observation on the least squares estimates
and fit a model to these data that is robust to outlying observations.

4.2 Least Squares Estimation and Partitioning of


Variation

4.2.1 The Principle


Box 4.1 Least Squares

• Least squares estimation rests on a geometric principle. The parameters are


estimated by those values that minimize the sum of squared deviations
between observations and the model: aY  X" bw aY  X" b. It does not
require Gaussian errors.

• Ordinary least squares (OLS) leads to best linear unbiased estimators in the
classical model and minimum variance unbiased estimators if
e µ K a!ß 5 # Ib.

Recall the classical linear model Y œ X" € e where the errors are uncorrelated and homo-
scedastic, Varced œ 5 # I. The least squares principle chooses as estimators of the parameters
" œ c"! ß "" ß âß "5" dw those values that minimize the sum of the squared residuals
W a" b œ llell# œ ew e œ aY  X" bw aY  X" b. [4.2]

One approach is to set derivatives of Wa" b with respect to " to zero and to solve. This leads
to the normal equations Xw Y œ Xw X" s with solution " s œ aXw Xb" Xw Y provided X is of full
rank 5 . If X is rank deficient, a generalized inverse is used instead and the estimator becomes
s œ aX w X b  X w Y .
"
The calculus approach disguises the geometric principle behind least squares somewhat.
The simple identity
s € ÐY  X "
Y œ X" s Ñ, [4.3]

s of fitted values and a residual vec-


that expresses the observations as the sum of a vector X"
s
tor ÐY  X" Ñ leads to the following argument. The a8 ‚ "b vector Y is a point in an 8-
dimensional space ‘8 . If the a8 ‚ 5 b matrix X is of full rank, its columns generate a 5 -

© 2003 by CRC Press LLC


dimensional subspace of ‘8 . In other words, the mean values X" cannot be points anywhere
µ µ
in ‘8 , but can only "live" in a subspace thereof. Whatever estimator " we choose, X" will
µ
also be a point in this subspace. So why not choose X" so that its distance from Y is mini-
mized by projecting Y perpendicularly onto the space generated by X" (Figure 4.5)? This
requires that
w
s ‹ X"
ŠY  X" s w Xw Y œ "
sœ0Í" s w Xw X"
s Í Xw Y œ Xw X"
s [4.4]

and the ordinary least squares (OLS) estimate follows.

Y β
Y-Xβ
^
β
Y-Xβ

β^

0 β

The space generated by the columns of X

Figure 4.5. The geometry of least squares.

If the classical linear model holds, the least square estimator enjoys certain optimal
properties. It is a best linear unbiased estimator (BLUE) since EÒ" s Ó œ " and no other
unbiased estimator that is a linear function of Y has smaller variability. These appealing fea-
tures do not require Gaussianity. It is possible, however, that some other, nonlinear estimator
of " would have greater precision. If the model errors are Gaussian, i.e., e µ Ka0ß 5 # Ib, the
ordinary least squares estimator of " is a minimum variance unbiased estimator (MVUE),
extending its optimality beyond those estimators which are linear in Y. We will frequently
denote the ordinary least squares estimator as " s SPW , to distinguish it from the generalized
s
least squares estimator " KPW that arises if e µ a0ß Vb, where V is a general variance-
covariance matrix. In this case we minimize ew V" e and obtain the estimator
s KPW œ ˆXw V" X‰" Xw V" Y.
" [4.5]

Table 4.4 summarizes some properties of the ordinary and generalized least square estimators.
We see that the ordinary least squares estimator remains unbiased, even if the error variance
s SPW Ó is typically larger than VarÒaw "
is not 5 # I. However, VarÒaw " s KPW Ó in this case. The OLS
estimator is less efficient. Additional details about the derivation and properties of " s KPW can
be found in §A4.8.1. A third case positioned between ordinary and generalized least squares
arises when V is a diagonal matrix and is termed weighted least squares estimation
(WLSE). It is the appropriate estimation principle if the model errors are heteroscedastic but
uncorrelated. If Varced œ Diaga5 b œ W, where 5 is a vector containing the variances of the
/3 , then the weighted least squares estimator is " s " w " #
[ PW œ aX W Xb X W Y. If V œ 5 I, the
w "

© 2003 by CRC Press LLC


GLS estimator reduces to the OLS estimator, if V œ Diaga5 b then GLS reduces to WLS, and
if 5 œ 5 # 1, then WLS reduces to OLS.

Table 4.4. Properties of ordinary and generalized least squares estimators


e µ a 0 ß 5 # Ib e µ a0 ß V b
s SPW “
E’" " "
s KPW “
E’" " "
s SPW “
Var’" 5 # aXw Xb" aXw Xb" Xw VXaXw Xb"
s KPW “ "
Var’" 5 # aXw Xb" aXw V" Xb

Notes s SPW is BLUE


" s KPW is BLUE
"
s SPW is MVUE
if e is Gaussian, " s KPW is MVUE
if e is Gaussian, "

s in Y it is simple to derive the distribution of "


Because of the linearity of " s if the errors
of the linear model are Gaussian. In the model with general error variance V, for example,
s KPW µ KŠ" ß ˆXw V" X‰" ‹.
"

Since OLS is a special case of GLS with V œ 5 # I we also have


s SPW µ KŠ" ß 5 # aXw Xb" ‹.
"

Standard hypothesis tests can be derived based on the Gaussian distribution of the estimator
and usually lead to >- or J -tests (§A4.8.2). If the model errors are not Gaussian, the asymp-
s is Gaussian nevertheless. With sufficiently large sample size one can
totic distribution of "
thus proceed as if Gaussianity of "s holds.

4.2.2 Partitioning Variability through Sums of Squares


Recall that the norm llall of a a: ‚ "b vector a is defined as
:
llall œ Èaw a œ Ë"+3#
3œ"

and measures the length of the vector. By the orthogonality of the residual vector Y  X" s
and the vector of predicted values X" s (Figure 4.5), the length of the observed vector Y is re-
lated to the length of the predictions and residuals by the Pythagorean theorem:
s ll# € llX"
llYll# œ llY  X" s ll# . [4.6]

The three terms in [4.6] correspond to the uncorrected total sum of squares aWWX œ llYll# b,
the residual sum of squares ÐWWV œ llY  X" s ll# Ñ, and the model sum of squares

© 2003 by CRC Press LLC


ÐWWQ œ llX" s ll# Ñ. Some straightforward manipulations yield the simpler expressions shown
in the analysis of variance table for [4.1]. For example,
w w
s ‹ ŠY  X"
WWV œ ŠY  X" s ‹ œ ŠY  X"
s‹ Y
w
s X w Y.
œ Yw Y  " [4.7]

Table 4.5. Analysis of variance table for the standard linear model (<aXb denotes rank of X)
Source df SS MS
w
Model < aX b s Xw Y œ WWQ
" WWQ Î<aXb
w
Residual (Error) 8  < aX b YY"w s X Y œ WWV w
WWVÎa8  <aXbb
w
Uncorrected Total 8 Y Y œ WWX

When the X matrix contains an intercept (regression models) or grand mean


(classification models), it is common to correct the entries of the analysis of variance table for
#
the mean with the term 8] (Table 4.6).

Table 4.6. Analysis of variance table corrected for the mean


Source df SS MS
w #
Model < aX b  " s X Y  8] œ WWQ7
" w
WWQ7 ÎÐ<aXb  "Ñ
w
Residual (Error) 8  < aX b s Xw Y œ WWV
Yw Y  " WWVÎa8  <aXbb
w #
Total 8" Y Y  8] œ WWX7

The model sum of squares WWQ measures the joint explanatory power of the variables
(including the intercept). If the explanatory variables in X were unrelated to the response, we
would use ] to predict the mean response, which is the ordinary least squares estimate in an
intercept-only model. WWQ7 , which measures variability explained beyond an intercept-only
model, is thus the appropriate statistic for evaluating the predictive value of the explanatory
variables and we use ANOVA tables in which the pertinent terms are corrected for the mean.
The coefficient of determination aV # b is defined as

WWQ7 s w Xw Y  8] #
"
V# œ œ #
. [4.8]
WWX7 Yw Y  8]

4.2.3 Sequential and Partial Sums of Squares and the Sum of


Squares Reduction Test
To assess the contribution of individual columns (variables) in X, WWQ7 is further decom-
posed into <aXb  " single degree of freedom components. Consider the regression model
]3 œ "! € "" B"3 € "# B#3 € ⠀ "5" B5"3 € /3 .

© 2003 by CRC Press LLC


To underline that WWQ is the joint contribution of "! through "5" , we use the expression
WWQ œ WW a"! ß "" ß âß "5" b.

Similarly, WWQ7 œ WW a"" ß âß "5" l"! b is the joint contribution of "" ß âß "5" after adjust-
ment for the intercept (correction for the mean, Table 4.6). A partitioning of WWQ7 into
sequential one degree of freedom sums of squares is
WWQ7 œ WW a"" l"! b
€ WW a"# l"! ß "" b
€ WW a"$ l"! ß "" ß "# b [4.9]
ۉ
€ WW a"5" l"! ß âß "5# b.

WW a"# l"! ß "" b, for example, is the sum of squares contribution accounted for by adding the
regressor \# to a model already containing an intercept and \" . The test statistic
WW a"# l"! ß "" bÎQ WV can be used to test whether the addition of \# to a model containing
\" and an intercept provides significant improvement of fit and hence is a gauge for the
explanatory value of the regressor \# . If the model errors are Gaussian,
WW a"# l"! ß "" bÎQ WV has an J distribution with one numerator and 8  <aXb denominator
degrees of freedom. Since the sum of squares WW a"# l"! ß "" b has a single degree of freedom,
we can also express this test statistic as
Q W a"# l"! ß "" b
J9,= œ ,
Q WV

where Q W a"# l"! ß "" b œ WW a"# l"! ß "" bÎ" is the sequential mean square. This is a special
case of a sum of squares reduction test. Imagine we wish to test whether adding regressors
\# and \$ simultaneously to a model containing \" and an intercept improves the model fit.
The change in the model sum of squares is calculated as
WW a"# ß "$ l"! ß "" b œ WW a"! ß "" ß "# ß "$ b  WW a"! ß "" b.

To obtain WW a"# ß "$ l"! ß "" b we fit a model containing four regressors and obtain its model
sum of squares. Call it the full model. Then a reduced model is fit, containing only an inter-
cept and \" . The difference of the model sums of squares of the two models is the contri-
bution of adding \# and \$ simultaneously. The mean square associated with the addition of
the two regressors is Q W a"# ß "$ l"! ß "" b œ WW a"# ß "$ l"! ß "" bÎ#. Since both models have the
same (corrected or uncorrected) total sum of squares we can express the test mean square also
in terms of a residual sum of squares difference. This leads us to the general version of the
sum of squares reduction test.
Consider a full model Q0 and a reduced model Q< where Q< is obtained from Q0 by
constraining some (or all) of its parameters. Usually the constraints mean setting one or more
parameters to zero, but other constraints are possible, for example, "" œ "# œ '. If (i) WWV0
is the residual sum of squares obtained from fitting the full model, (ii) WWV< is the respective
sum of square for the reduced model, and (iii) ; is the number of parameters constrained in
the full model to obtain Q< , then

© 2003 by CRC Press LLC


aWWV<  WWV0 bÎ; aWWV<  WWV0 bÎ;
J9,= œ œ [4.10]
WWV0 Î.0 V0 Q WV0

is distributed as an J random variable on ; numerator and .0 V0 denominator degrees of


freedom, provided that the model errors are Gaussian-distributed. Here, .0 V0 are the residual
degrees of freedom in the full model, i.e., the denominator of J9,= is the mean square error in
the full model. [4.10] is called a sum of squares reduction statistic because the term
WWV<  WWV0 in the numerator measures how much the residual sum of squares is reduced
when the constraints on the reduced model are removed.
It is now easy to see that J9,= œ Q WQ7 ÎQ WV is a special case of a sum of squares
reduction test. In a model with 5 parameters (including the intercept), we have WWQ œ
WW a"! ß "" ß "# ß âß "5" b and can test L! À "" œ "# œ ÞÞÞ œ "5" œ ! by choosing a reduced
model containing only an intercept. The model sum of squares of the reduced model is
WWa"! b and their difference,
WW a"! ß "" ß "# ß âß "5" b  WW a"! b

is WWQ7 with 5  " degrees of freedom. See §A4.8.2 for more details on the sum of squares
reduction test.
The sequential sum of squares decomposition [4.9] is not unique; it depends on the order
in which the regressors enter the model. Regardless of this order, the sequential sums of
squares add up to the model sum of squares. This may appear to be an appealing feature, but
in practice it is of secondary importance. A decomposition in one degree of freedom sum of
squares that does not (necessarily) add up to anything useful but is much more relevant in
practice is based on partial sums of squares. A partial sum of squares is the contribution
made by one explanatory variable in the presence of all other regressors, not only the
regressors preceding it. In a four regressor model (not counting the intercept), for example,
the partial sums of squares are

WW a"" l"! ß "# ß "$ ß "% b


WW a"# l"! ß "" ß "$ ß "% b
WW a"$ l"! ß "" ß "# ß "% b
WW a"% l"! ß "" ß "# ß "$ b.

Partial sums of squares do not depend on the order in which the regressors enter a model and
are usually more informative for purposes of hypothesis testing.

Example 4.2 Turnip Greens (continued). Recall the Turnip Greens data on p. 91. We
need to develop a model that relates the Viatmin F# content in turnip leaves to the
explanatory variables Sunlight a\" b, Soil Moisture a\# b, and Air Temperature a\$ b.
Figure 4.3 on p. 92 suggests that the relationship between vitamin content and soil
moisture is probably quadratic, rather than linear. We fit the following multiple
regression model to the #( observations
]3 œ "! € "" B"3 € "# B#3 € "$ B$3 € "% B#%3 € /3 ,

where B%3 œ B##3 . Assume the following questions are of particular interest:

© 2003 by CRC Press LLC


• Does the addition of Air Temperature a\$ b significantly improve the fit of the
model in the presence of the other variables?
• Is the quadratic Soil Moisture term necessary?
• Is the simultaneous contribution of Air Temperature and Sunlight to the model
significant?

In terms of the model parameters, the questions translate into the hypotheses

• L! : "$ œ ! (given \" ß \# ß \% are in the model)


• L! : "% œ ! (given \" ß \# ß \$ are in the model)
• L! : "" œ "$ œ ! (given \# and \% are in the model).

The full model is the four regressor model and the reduced models are given in Table
4.7 along with their residual sums of squares. Notice that the reduced model for the
third hypothesis is a quadratic polynomial in soil moisture.

Table 4.7. Residual sums of squares and degrees of freedom of various models
along with test statistics for sum of squares reduction test
Reduced Model
Hypothesis Contains Residual SS Residual df J9,= T -value
"$ œ ! \" ß \# ß \% "ß !$"Þ") #$ &Þ'(( !Þ!#'$
"% œ ! \" ß \# ß \$ #ß #%$Þ") #$ $)Þ#!' !Þ!!!"
"" œ "$ œ ! \ # ß \% "ß ")!Þ'! #% %Þ)%$ !Þ!")"
Full Model )"*Þ') ##

The test statistic for the sum of squares reduction tests is


aWWV<  WWV0 bÎ;
J9,= œ ,
WWV0 Î##

where ; is the difference in residual degrees of freedom between the full and a reduced
model. For example, for the test of L! : "$ œ ! we have
a"ß !$"Þ")  )"*Þ')bÎ" #""Þ&!
J9,= œ œ œ &Þ'((
)"*Þ')Î## $(Þ#&*

and for L! À "% œ !


a#ß #%$Þ")  )"*Þ')bÎ" "ß %#$Þ&!
J9,= œ œ œ $)Þ#!'.
)"*Þ')Î## $(Þ#&*

T -values are calculated from an J distribution with ; numerator and ## denominator


degrees of freedom.

Using proc glm of The SAS® System, the full model is analyzed with the statements

© 2003 by CRC Press LLC


proc glm data=turnip;
model vitamin = sunlight moisture airtemp x4 ;
run; quit;

Output 4.1.
The GLM Procedure

Dependent Variable: vitamin


Sum of
Source DF Squares Mean Square F Value Pr > F
Model 4 8330.845545 2082.711386 55.90 <.0001
Error 22 819.684084 37.258367
Corrected Total 26 9150.529630

R-Square Coeff Var Root MSE vitamin Mean


0.910422 7.249044 6.103963 84.20370

Source DF Type I SS Mean Square F Value Pr > F


sunlight 1 97.749940 97.749940 2.62 0.1195
moisture 1 6779.103952 6779.103952 181.95 <.0001
airtemp 1 30.492162 30.492162 0.82 0.3754
X4 1 1423.499492 1423.499492 38.21 <.0001

Source DF Type III SS Mean Square F Value Pr > F


sunlight 1 71.084508 71.084508 1.91 0.1811
moisture 1 1085.104925 1085.104925 29.12 <.0001
airtemp 1 211.495902 211.495902 5.68 0.0263
X4 1 1423.499492 1423.499492 38.21 <.0001

Standard
Parameter Estimate Error t Value Pr > |t|

Intercept 119.5714052 13.67649483 8.74 <.0001


sunlight -0.0336716 0.02437748 -1.38 0.1811
moisture 5.4250416 1.00526166 5.40 <.0001
airtemp -0.5025756 0.21094164 -2.38 0.0263
X4 -0.1209047 0.01956034 -6.18 <.0001

The analysis of variance for the full model leads to WWQ7 œ )ß $$!Þ)%&, WWV œ
)"*Þ')%, and WWX7 œ *"&!Þ&#*. The mean square error estimate is Q WV œ $(Þ#&).
The four regressors jointly account for *"% of the variability in the vitamin F# content
of turnip leaves. The test statistic J9,= œ Q WQ7 ÎQ WV œ &&Þ*! a:  Þ!!!"b is used
to test the global hypothesis L! : "" œ "# œ "$ œ "% œ !. Since it is rejected, we con-
clude that at least one of the regressors explains a significant amount of vitamin F#
variability in the presence of the others.

Sequential sums of squares are labeled Type I SS by proc glm (Output 4.1). For
example, WW a"" l"! b œ *(Þ(%*, WW a"# l"! ß "" b œ '((*Þ"!. Also notice that the sequen-
tial sums of squares add up to WWQ7 ,
*(Þ(%* € 'ß ((*Þ"!% € $!Þ%*# € "ß %#$Þ&!! œ )ß $$!Þ)%& œ WWQ7 Þ
The partial sums of squares are listed as Type III SS. The partial hypothesis
L! À "$ œ ! can be answered directly from the output of the fitted full model without
actually fitting a reduced model and calculating the sum of squares reduction test. The
J9,= statistics shown as F Value and the :-values shown as Pr > F in the Type III SS
table are partial tests of the individual parameters. Notice that the last sequential and the

© 2003 by CRC Press LLC


last partial sums of squares are always identical. The hypothesis L! À "% œ ! can be
tested based on either sum of squares for X4.

The last table of Output 4.1 shows the parameter estimates and their standard errors.
For example, " s ! œ ""*Þ&(", " s " œ  !Þ!$$(, and so forth. The >9,= statistics shown in
the column t Value are the ratios of a parameter estimate and its estimated standard
s 4 ÎeseÐ"
error, >9,= œ " s 4 Ñ. The two-sided >-tests are identical to the partial J -tests in the
Type III SS table. See §A4.8.2 for the precise correspondence between partial J -tests
for single variables and the >-tests.

In the Turnip Greens example, sequential and partial sums of squares are not identical. If
they were, it would not matter in which order the regressors enter the model. When should we
expect the two sets of sum of squares to be the same? For example, when the explanatory
variables are orthogonal which requires that the inner product of pairs of columns of X is
zero. The sum of square contribution of one regressor then does not depend on whether the
other regressor is in the model. Sequential and partial sums of squares may coincide under
conditions less stringent than orthogonality of the columns of X. In classification (ANOVA)
models where the columns of X consist of dummy (design) variables, the sums of squares
coincide if the data exhibit a certain balance. Hinkelmann and Kempthorne (1994, pp. 87-88)
show that for the two-way classification without interaction, a sufficient condition is equal
frequencies (replication) of the factor-level combinations. An analysis of variance table where
sequential and partial sums of squares are identical is termed an orthogonal ANOVA. There
are other differences in the sum of squares partitioning between regression and analysis of
variance models. Most notably, in ANOVA models we usually test subsets rather than indi-
vidual parameters and the identity of the parameters in subsets reflects informative structural
relationships among the factor levels. The single degree of freedom sum of squares parti-
tioning in ANOVA classification models can be accomplished with sums of squares of ortho-
gonal contrasts, which are linear combinations of the model parameters. Because of these
subtleties and the importance of ANOVA models in agronomic data analysis, we devote §4.3
to classification models exclusively.
Sequential sums of squares are of relatively little interest unless there is a natural order in
which the various explanatory variables or effects should enter the model. A case in point are
polynomial regression models where regressors reflect successively higher powers of a single
variable. Consider the cubic polynomial
]3 œ "! € "" B3 € "# B#3 € "$ B$3 € /3 .

The model sum of squares is decomposed sequentially as


WWQ7 œ WW a"" l"! b
€ WW a"# l"! ß "" b
€ WW a"$ l"! ß "" ß "# bÞ

Read from the bottom, these sums of squares can be used to answer the questions

© 2003 by CRC Press LLC


Is there a cubic trend beyond the linear and quadratic trends?
Is there a quadratic trend beyond the linear trend?
Is there a linear trend?

Statisticians and practitioners do not perfectly agree on how to build a final model based on
the answers to these questions. One school of thought is that if an interaction is found
significant the associated main effects (lower-order terms) should also be included in the
model. Since we can think of the cubic term as the interaction between the linear and
quadratic term aB$ œ B ‚ B# b, if B$ is found to make a significant contribution in the
presence of B and B# , these terms would not be tested further and remain in the model. The
second school of thought tests all terms individually and retains only the significant ones. A
third-order term without the linear or quadratic term in the model is then possible. In observa-
tional studies where the regressors are rarely orthogonal, we adopt the first philosophy. In
designed experiments where the treatments are levels on a continuous scale such as a rate of
application, we prefer to adopt the second school of thought. If the design matrix is orthog-
onal we can then easily test the significance of a cubic term independently from the linear or
quadratic term using orthogonal polynomial contrasts.

4.3 Factorial Classification


A factorial classification model is a statistical model in which two or more classification
variables (factors) are related to the response of interest and the levels of the factors are
crossed. Two factors E and F are said to be crossed if each level of one factor is combined
with each level of the other factor. The factors in Example 4.1 are crossed, for example, since
each level of factor Lime Type (AL, GL) is combined with each level of the factor Rate of
Application. If in addition there are replicate values for the factor level combination, we can
study main effects and interactions of the factors. A randomized complete block design, for
example, also involves two factors (a block and a treatment factor), but since each treatment
appears exactly once in each block, the design does not provide replications of the
block ‚ treatment combinations. This is the reason why block ‚ treatment interactions cannot
be studied in a randomized block design but in the generalized randomized block design
where treatments are replicated within the blocks.
If the levels of one factor are not identical across the levels of another factor, the factors
are said to be nested. For example, if the rates of application for agricultural lime would be
different from the rates of application of granulated lime, the Rate factor would be nested
within the factor Lime Type. In multilocation trials where randomized block designs with the
same treatments are performed at different locations, the block effects are nested within loca-
tions, because block number " at location " is not the same physical entity as block number "
at location #. If the same treatments are applied at either location, treatments are crossed with
locations, however. In studies of heritability where a random sample of dams are mated to a
particular sire, dams are nested within sires and the offspring are nested within dams. If fac-
tors are nested, a study of their interaction is not possible. This can be gleaned from the
degree of freedom decomposition. Consider two factors E and F with + and , levels, respec-
tively. If the factors are crossed the treatment source of variability is associated with
+ ‚ ,  " degrees of freedom. It decomposes into a+  "b degrees of freedom for the E main

© 2003 by CRC Press LLC


effect, a,  "b degrees of freedom for the F main effect, and a+  "ba,  "b degrees of free-
dom for the interaction. If F is nested within E, the treatment degrees of freedom decompose
into a+  "b degrees of freedom for the E main effect and +a,  "b degrees of freedom for
the nested factor combining the degrees of freedom for the F main effect and the E ‚ F
interaction: a,  "b € a+  "ba,  "b œ +a,  "b. In this section we focus on the two-factor
crossed classification with replications in which main effects and interactions can be studied
(see §4.3.2 for definitions). Rather than providing a comprehensive treatise of experimental
data analysis this section intends to highlight the standard operations and techniques that
apply in classification models where treatment comparisons are of primary concern.

4.3.1 The Means and Effects Model


Box 4.2 Means and Effects Model

• The means model expresses observations as random deviations from the cell
means; its design matrix is of full rank.

• The effects model decomposes cell means in grand mean, main effects, and
interactions; its design matrix is deficient in rank.

Two equivalent ways of representing a classification models are termed the means and the
effects model. We prefer effects models in general, although on the surface means models are
simpler. But the study of main effects and interactions, which are of great concern in classifi-
cation models, is simpler in means models. In the two-way classification with replications
(e.g., Example 4.1) an observation ]345 for the 5 th replicate of Lime Type 3 and Rate 4 can be
expressed as a random deviation from the mean of that particular treatment combination:
]345 œ .34 € /345 . [4.11]

Here, /345 denotes the experimental error associated with the 5 th replicate of the 3th lime type
and 4th rate of application e3 œ "ß + œ #à 4 œ "ß âß , œ &à 5 œ "ß âß < œ &f. .34 denotes the
mean pH of an experimental unit receiving lime type 3 at rate 4; hence the name means model.
To finish the model formulation assume /345 are uncorrelated random variables with mean !
and common variance 5 # . Model [4.11] can then be written in matrix vector notation as
Y œ X. € e, where

Ô Y"" × Ô 1 0 0 0 0 0 0 0 0 0× Ô ."" × Ô e"" ×


Ö Y"# Ù Ö 0 1 0 0 0 0 0 0 0 0Ù Ö ."# Ù Ö e"# Ù
Ö Ù Ö Ù Ö Ù Ö Ù
Ö Y"$ Ù Ö 0 0 1 0 0 0 0 0 0 0Ù Ö ."$ Ù Ö e"$ Ù
Ö Ù Ö Ù Ö Ù Ö Ù
Ö Y"% Ù Ö 0 0 0 1 0 0 0 0 0 0Ù Ö ."% Ù Ö e"% Ù
Ö Ù Ö Ù Ö Ù Ö Ù
ÖY Ù Ö0 0 0 0 1 0 0 0 0 0Ù Ö ."& Ù Ö e"& Ù
Ya&!‚"b œ Ö "& Ù œ Ö Ù Ö Ù€Ö Ù. [4.12]
Ö Y#" Ù Ö 0 0 0 0 0 1 0 0 0 0Ù Ö .#" Ù Ö e#" Ù
Ö Ù Ö Ù Ö Ù Ö Ù
Ö Y## Ù Ö 0 0 0 0 0 0 1 0 0 0Ù Ö .## Ù Ö e## Ù
Ö Ù Ö Ù Ö Ù Ö Ù
Ö Y#$ Ù Ö 0 0 0 0 0 0 0 1 0 0Ù Ö .#$ Ù Ö e#$ Ù
Ö Ù Ö Ù Ö Ù Ö Ù
Y#% 0 0 0 0 0 0 0 0 1 0 .#% e#%
Õ Y#& Ø Õ 0 0 0 0 0 0 0 0 0 1Ø Õ .#& Ø Õ e#& Ø
a&!‚"!b

The a& ‚ "b vector Y"" , for example, contains the replicate observations for lime type " and

© 2003 by CRC Press LLC


application rate ! tons. Notice that X is not the identity matrix. Each 1 in X is a a& ‚ "b vec-
tor of ones and each 0 is a a& ‚ "b vector of zeros. The design matrix X is of full column
rank and the inverse of Xw X exists:

Ô !Þ# ! ! ! â ! ×
Ö ! !Þ# ! ! â ! Ù
Ö Ù
" " Ö ! ! !Þ# ! â ! Ù
aX w X b œ DiagŒ ‡1"!  œ Ö Ù.
& Ö ! ! ! !Þ# â ! Ù
Ö Ù
ã ã ã ã ä ã
Õ ! ! ! ! â !Þ# Ø

The ordinary least squares estimate of . is thus simply the vector of sample means in the
+ ‚ , groups.

Ô C ""Þ ×
Ö C "#Þ Ù
Ö Ù
Ö ã Ù
Ö Ù
" ÖC Ù
s œ aXw Xb Xw y œ Ö "&Þ Ù.
.
Ö C #"Þ Ù
Ö Ù
Ö C ##Þ Ù
Ö Ù
ã
Õ C #&Þ Ø

A different parameterization of the two-way model can be derived if we think of the .34 as
cell means in the body of a two-dimensional table in which the factors are cross-classified
(Table 4.8). The row and column averages of this table are denoted .3Þ and .Þ4 where the dot
replaces the index over which averaging is carried out. We call .3Þ and .Þ4 the marginal
means for factor levels 3 of Lime type and 4 of Rate of Application, respectively, since they
occupy positions in the margin of the table (Table 4.8).

Table 4.8. Cell and marginal means in Lime Application (Example 4.1)
Rate of Application
0 1 2 4 8
Agricultural lime ."" ."# ."$ ."% ."& ."Þ
Granulated lime .#" .## .#$ .#% .#& .#Þ
.Þ" .Þ# .Þ$ .Þ% .Þ&

Marginal means are arithmetic averages of cell means (Yandell, 1997, p. 109), even if the
data are unbalanced:

" , " +
.3Þ œ ".34 .Þ4 œ ".34 .
, 4œ" + 3œ"

The grand mean . is defined as the average of all cell means, . œ !3 !4 .34 Î+, . To construe
marginal means as weighted means  where weighing would take into account the number of
observations 834 for a particular cell  would define population quantities as functions of

© 2003 by CRC Press LLC


sample quantities. The relationships of the means should not depend on how many observa-
tions are sampled or how many times treatments are replicated. See Chapters 4.6 and 4.7 in
Searle (1987) for a comparison of the weighted and unweighted schemes.
We can now write the mathematical identity
.34 œ . € a.3Þ  .b € a.Þ4  .b € a.34  .3Þ  .Þ4 € .b [4.13]
œ. € !3 € "4 € a!" b34 .

Models based on this decomposition are termed effects models since !3 œ a.3Þ  .b measures
the effect of the 3th level of factor E, "4 œ a.Þ4  .b the effect of the 4th level of factor F and
a!" b34 their interaction. The nature and precise interpretation of main effects and interactions
is studied in more detail in §4.3.2. For now we notice that the effects obey certain constraints
by construction:
+
"!3 œ !
3œ"
,
""4 œ ! [4.14]
4œ"
+ ,
"a!" b34 œ "a!" b34 œ !.
3œ" 4œ"

A two-way factorial layout coded as an effects model can be expressed as a sum of separate
vectors. We obtain
Y œ 1. € X! ! € X" " € Xa!" b 9 € e, [4.15]

where

Ô Y"" × Ô 1 × Ô1 0× Ô1 0 0 0 0× Ô a!" b"" × Ô e"" ×


Ö Y"# Ù Ö 1 Ù Ö1 0Ù Ö0 1 0 0 0Ù Ö a!" b"# Ù Ö e"# Ù
Ö Ù Ö Ù Ö Ù Ö Ù Ö Ù Ö Ù
Ö Y"$ Ù Ö 1 Ù Ö1 0Ù Ö0 0 1 0 0Ù Ö a!" b"$ Ù Ö e"$ Ù
Ö Ù Ö Ù Ö Ù Ö ÙÔ "" × Ö Ù Ö Ù
Ö Y"% Ù Ö 1 Ù Ö1 0Ù Ö0 0 0 1 0Ù Ö a!" b"% Ù Ö e"% Ù
Ö Ù Ö Ù Ö Ù Ö ÙÖ " Ù Ö Ù Ö Ù
Ö Y"& Ù Ö 1 Ù Ö1 0 Ù !" Ö0 0 0 0 1 ÙÖ # Ù Ö a!" b"& Ù Ö e"& Ù
Ö Ù œ Ö Ù. € Ö Ù” • € Ö ÙÖ "$ Ù € Xa!" b Ö Ù€Ö Ù.
Ö Y#" Ù Ö 1 Ù Ö0 1 Ù !# Ö1 0 0 0 0 ÙÖ Ù Ö a!" b#" Ù Ö e#" Ù
Ö Ù Ö Ù Ö Ù Ö Ù "% Ö Ù Ö Ù
Ö Y## Ù Ö 1 Ù Ö0 1Ù Ö0 1 0 0 0 ÙÕ Ø Ö a!" b## Ù Ö e## Ù
Ö Ù Ö Ù Ö Ù Ö Ù "& Ö Ù Ö Ù
Ö Y#$ Ù Ö 1 Ù Ö0 1Ù Ö0 0 1 0 0Ù Ö a!" b#$ Ù Ö e#$ Ù
ÖY Ù Ö1Ù Ö0 1 Ù Ö0 0 0 1 0 Ù Ö Ù Ö e#% Ù
#% a!" b#%
Õ Y#& Ø Õ 1 Ø Õ0 1Ø Õ0 0 0 0 1Ø Õ a!" b Ø Õ e#& Ø
#&

The matrix Xa!" b is the same as the X matrix in [4.12]. Although the latter is non-
singular, it is clear that in the effects model the complete design matrix
P œ 1ß X! ß X" ß Xa!" b ‘ is rank-deficient. The columns of X! , the columns of X" , and the
columns of XÐ!" Ñ sum to 1. This results from the linear constraints in [4.14].
We now proceed to define various effect types based on the effects model.

© 2003 by CRC Press LLC


4.3.2 Effect Types in Factorial Designs
Box 4.3 Effect Types in Crossed Classifications

• Simple effects are comparisons of the cell means where one factor is held
fixed.

• Interaction effects are contrasts among simple effects.

• Main effects are contrasts among the marginal means and can be expressed
as averages of simple effects.

• Simple main effects are differences of cell and marginal means.

• Slices are collections of simple main effects.

Hypotheses can be expressed in terms of relationships among the cell means .34 or in terms of
the effects !3 , "4 , and a!" b34 in [4.13]. These relationships are classified into the following
categories (effects):
• simple effects
• interaction effects
• main effects, and
• simple main effects

A simple effect is the most elementary comparison. It is a comparison of the .34 where
one of the factors is held fixed. For example, ."$  .#$ is a comparison of lime types at rate 2
tons, and .##  .#$ is a comparison of the " and # ton application rate for granulated lime. By
comparisons we do not just have pairwise tests in mind, but more generally, contrasts. A
contrast is a linear function of parameters in which the coefficients of the linear function sum
to zero. A simple effect of application rate for agricultural lime a3 œ "b is a contrast among
the ."4 ,
,
j œ "-4 ."4 ,
4œ"

where !,4œ" -4 œ !. The -4 are called the contrast coefficients. The simple effect .##  .#$
has contrast coefficients a!ß "ß  "ß !ß !b.
Interaction effects are contrasts among simple effects. Consider j" œ .""  .#" ,
j# œ ."#  .## , and j$ œ ."$  .#$ which are simple Lime Type effects at rates !, ", and #
tons, respectively. The contrast
j% œ " ‚ j "  # ‚ j # € " ‚ j $
is an interaction effect. It tests whether the difference in pH between lime types changes

© 2003 by CRC Press LLC


linearly with application rates between ! and # tons. Interactions are interpreted as the non-
constancy of contrasts in one factor as the levels of the other factor are changed. In the
absence of interactions, j% œ !. In this case it would be reasonable to disengage the Rate of
Application in comparisons of Lime Types and vice versa. Instead, one should focus on
contrasts among the marginal averages across rates of applications and averages across lime
types.
Main effects are contrasts among the marginal means. A main effect of Lime Type is a
contrast among the .3Þ and a main effect of Application Rate is a contrast among the .Þ4 .
Since simple effects are the elementary comparisons, it is of interest to find out how main
effects relate to these. Consider the contrast ."Þ  .#Þ , which is the main effect of Lime Type
since the factor has only two levels. Write ."Þ œ !Þ#a."" € ."# € ."$ € ."% € ."& b and
similarly for .#Þ . Thus,
" "
."Þ  .#Þ œ a."" € ."# € ."$ € ."% € ."& b  a.#" € .## € .#$ € .#% € .#& b
& &
" " "
œ a.""  .#" b € a."#  .## b € ⠀ a."&  .#& b. [4.16]
& & &
Each of the five terms in this expression is a Lime Type simple effect, one at each rate of
application and we conclude that main effects are averages of simple effects. The marginal
difference, ."Þ  .#Þ , will be identical to the simple effects in [4.16] if the lines in Figure 4.2
(p. 89) are parallel. This is the condition under which interactions are absent and immediately
sheds some light on the proper interpretation and applicability of main effects. If interactions
are present, some of the simple effects in [4.16] may be of positive sign, others may be of
negative sign. The marginal difference, ."Þ  .#Þ , may be close to zero, masking the fact that
lime types are effective at various rates of application. If the sign of the differences remains
the same across application rates, as is the case here (see Figure 4.2), an interpretation of the
main effects remains possible even in the light of interactions.
The presence of interactions implies that at least some simple effects change with the
level of the factor held fixed in the simple effects. One may thus be tempted to perform sepa-
rate one-way analyses at each rate of application and separate one-way analyses for each lime
type. The drawback of this approach is a considerable loss of degrees of freedom and hence a
loss of power. If, for example, an analysis is performed with data from ! tons alone, the resi-
dual error will be associated with #a&  "b œ ) degrees of freedom, whereas #‡&a&  "b œ
%! degrees of freedom are availablein the two-way factorial analysis. It is more efficient to
test the hypotheses in the analysis based on the full data.
A comparison of application rates for each lime type involves tests of the hypotheses
ó L! À ."" œ ."# œ ."$ œ ."% œ ."&
[4.17]
ô L! À .#" œ .## œ .#$ œ .#% œ .#& .

ó is called the slice of the Lime ‚ Rate interaction at lime type " and ô is called the slice of
the Lime ‚ Rate interaction at lime type #. Similarly, slices at the various application rates are

© 2003 by CRC Press LLC


tests of the hypotheses
ó L! À ."" œ .#"
ô L! À ."# œ .##
õ L! À ."$ œ .#$
ö L! À ."% œ .#%
÷ L! À ."& œ .#&.

How do slices relate to simple and main effects? Kirk a1995, p. 377b defines as simple
main effects comparisons of the type .35  .3Þ and .54  .Þ4 . Consider the first case. If all
simple main effects are identical, then
.3"  .3Þ œ .3#  .3Þ œ ⠜ .3,  .3Þ
which implies .3" œ .3# œ .3$ œ .3% œ .3, . This is the slice at the 3th level of factor A.
Schabenberger, Gregoire, and Kong (2000) made this correspondence between the various
effects in a factorial structure more precise in terms of matrices and vectors. They also proved
the following result noted earlier by Winer (1971, p. 347). If you assemble all possible slices
of E ‚ F by the levels of E (for example, ó and ô in [4.17]), the sum of squares associated
with this assembly is identical to the sum of squares for the F main effect plus that for the
E ‚ F interaction.

4.3.3 Sum of Squares Partitioning through Contrasts


The definition of main effects as contrasts among marginal means and interactions as
contrasts among simple effects seems at odds with the concept of the main effect of factor E
(or F ) and the interaction of E and F. The main effect of a factor can be represented by a
collection of contrasts among marginal means, all of which are mutually orthogonal. Two
contrasts j" œ !+3œ" -"3 .3Þ and j# œ !+3œ" -#3 .3Þ are orthogonal if
+
"-"3 -#3 œ !,
3œ"

We do not distinguish between balanced and unbalanced cases since the definition of
population quantities j" and j# should be independent of the sample design. A complete set
of orthogonal contrasts for a factor with : levels is any set of a:  "b contrasts in which the
members are mutually orthogonal. We can always fall back on the following complete set:

Ô" " ! ! â ! ×
: Ö" " # ! â ! Ù
@ œÖ Ù . [4.18]
ã ã
Õ" " " " ⠁ a:  "b ØaÐ:"т:b

For the factor Lime Type with only two levels a complete set contains only a single contrast
with coefficients " and  ", for the factor Rate of Application the set would be

© 2003 by CRC Press LLC


Ô" " ! ! !×
Ö" " # ! !Ù
@& œ Ö Ù.
" " " $ !
Õ" " " "  %Ø

The sum of squares for a contrast among the marginal means of application rate is calculated
as
sj#
WW ajb œ ,
! -4# Î8Þ4
4œ"

and if contrasts are orthogonal, their sum of squares contributions are additive. If the contrast
sums of squares for any four orthogonal contrasts among the marginal Rate means .Þ4 are
added, the resulting sum of squares is that of the Rate main effect. Using the generic
complete set above, let the contrasts be
j" œ .Þ"  .Þ#
j# œ .Þ" € .Þ#  #.Þ$
j$ œ .Þ" € .Þ# € .Þ$  $.Þ%
j% œ .Þ" € .Þ# € .Þ$ € .Þ%  %.Þ& .

The contrast sum of squares are calculated in Table 4.9Þ The only contrast defining the Lime
Type main effect is ."Þ  .#Þ .

Table 4.9. Main effects contrast for Rate of Application and Lime Type
(8Þ4 œ "!, 83Þ œ #&)
!- # Î8 WW ajb
Rate ! tons C Þ" œ &Þ($$ sj" œ  !Þ!*% !Þ# !Þ!%%
Rate " ton C Þ# œ &Þ)#( sj# œ  !Þ#!# !Þ' !Þ!')
Rate # tons C Þ$ œ &Þ))" sj$ œ  !Þ'$$ "Þ# !Þ$$%
Rate % tons C Þ% œ 'Þ!#& sj% œ  "Þ$)" #Þ! !Þ*&%
Rate ) tons C Þ& œ 'Þ#"#
Agricultural lime C "Þ œ 'Þ!$) sj œ !Þ#!& !Þ!) !Þ!&#&
Granulated lime C #Þ œ &Þ)$$

The sum of squares for the Rate main effect is


WW aRateb œ !Þ!%% € !Þ!') € !Þ$$% € !Þ*&% œ "Þ%!

and that for the Lime Type main effect is WW aLimeb œ !Þ!&#&.
The contrast set from which to obtain the interaction sum of squares is constructed by un-
folding the main effects contrasts to correspond to the cell means and multiplying the contrast
coefficients element by element. Since there are four contrasts defining the Rate main effect

© 2003 by CRC Press LLC


and one contrast for the Lime main effect, there will be % ‚ " œ % contrasts defining the inter-
actions (Table 4.10).

Table 4.10. Construction of interaction contrasts by unfolding the design


(empty cells correspond to zero coefficients)
-"" -"# -"$ -"% -"& -#" -## -#$ -#% -#&
Rate Main Effects
j" " " " "
j# " " # " " #
j$ " " " $ " " " $
j% " " " " % " " " " %
Lime Main Effect
j " " " " " " " " " "
Interaction Contrasts
j" ‚ j " " " "
j# ‚ j " " # " " #
j$ ‚ j " " " $ " " " $
j% ‚ j " " " " % " " " " %

Sum of squares of the interaction contrasts are calculated as linear functions among the
cell means a.34 b. For example,

aC""Þ  C"#Þ  C#"Þ € C##Þ b#


WW aj" ‚ jb œ .
˜"# € a  "b# € a  "b# € "# ™Î&

The divisor of & stems from the fact that each cell mean is estimated as the arithmetic average
of the five replications for that treatment combination. The interaction sum of squares is then
finally
WW aLime ‚ Rateb œ WW aj" ‚ jb € WW aj# ‚ jb € WW aj$ ‚ jb € WW aj% ‚ jb.

This procedure of calculating main effects and interaction sums of squares is cumber-
some. It demonstrates, however, that a main effect with a+  "b degrees of freedom can be
partitioned into a+  "b mutually orthogonal single degree of freedom contrasts. It also high-
lights why there are a+  "ba,  "b degrees of freedom in the interaction between two factors
with + and , levels, respectively, because of the way they are obtained from crossing a+  "b
and a,  "b main effects contrasts. In practice one obtains the sum of squares of contrasts and
the sum of squares partitioning into main effects and interactions with a statistical computing
package.

4.3.4 Effects and Contrasts in The SAS® System


Contrasts in The SAS® System are coded in the effects decomposition [4.13], not the means
model. To test for main effects, interactions, simple effects, and slices, the contrasts need to
be expressed in terms of the !3 , "4 , and a!" b34 . Consider the proc glm statements for the two-

© 2003 by CRC Press LLC


factor completely randomized design in Example 4.1. In the model statement of the code
segment
proc glm data=limereq;
class lime rate;
model ApH = lime rate lime*rate;
run;

lime refers to the main effects !3 of factor Lime Type, rate to the main effects of factor
Application Rate and lime*rate to the a!" b34 interaction terms. In Output 4.2 we find
sequential (Type I) and partial (Type III) sums of squares and tests for the three effects listed
in the model statement. Because the design is orthogonal the two groups of sums of squares
are identical. By default, proc glm produces these tests for any term listed on the right-hand
side of the model statement. Also, we obtain WWV œ !Þ"#))& and Q WV œ !Þ!!$##. The
main effects of Lime Type and Rate of Application are shown as sources LIME and RATE on
the output. The difference between our calculation of WW aRateb œ "Þ%! and the calculation by
SAS® aWW aRateb œ "Þ$*)b is due to round-off errors. There is a significant interaction
between the two factors aJ9,= œ #%Þ"#ß :  !Þ!!!"b. Neither of the main effects is masked.
Since the trends in application rates do not cross (Figure 4.2, p. 89) this is to be expected.

Output 4.2.
The GLM Procedure

Class Level Information

Class Levels Values


LIME 2 AG lime Pell lime
RATE 5 0 1 2 4 8

Number of observations 50

Dependent Variable: APH

Sum of
Source DF Squares Mean Square F Value Pr > F
Model 9 2.23349200 0.24816578 77.04 <.0001
Error 40 0.12885000 0.00322125
Corrected Total 49 2.36234200

R-Square Coeff Var Root MSE APH Mean


0.945457 0.956230 0.056756 5.935400

Source DF Type I SS Mean Square F Value Pr > F


LIME 1 0.52428800 0.52428800 162.76 <.0001
RATE 4 1.39845700 0.34961425 108.53 <.0001
LIME*RATE 4 0.31074700 0.07768675 24.12 <.0001

Source DF Type III SS Mean Square F Value Pr > F


LIME 1 0.52428800 0.52428800 162.76 <.0001
RATE 4 1.39845700 0.34961425 108.53 <.0001
LIME*RATE 4 0.31074700 0.07768675 24.12 <.0001

We now reconstruct the main effects and interaction tests with contrasts for expository
purposes. Recall that an E factor main effect contrast is a contrast among the marginal means
.3Þ and a F factor main effect is a contrast among the marginal means .4Þ . We obtain

© 2003 by CRC Press LLC


+ + ,
"
"-3 .3Þ œ "-3 " ˆ. € !3 € "4 € a!" b34 ‰
3œ" 3œ" 4œ"
,
+
" + ,
œ "-3 !3 € " "-3 a!" b34
3œ"
, 3œ" 4œ"

and
, , +
"
"-4 .Þ4 œ "-4 " ˆ. € !3 € "4 € a!" b34 ‰
4œ" 4œ" 3œ"
+
,
" , +
œ "-4 "4 € " "-4 a!" b34 .
4œ"
+ 4œ" 3œ"

Fortunately, The SAS® System does not require that we specify coefficients for effects
which contain other effects for which coefficients are given. For main effect contrasts it is
sufficient to specify the contrast coefficients for lime or rate; the interaction coefficients
-3 Î, and -4 Î+ are assigned automatically. Using the generic contrast set [4.18] for the main
effects and the unfolded contrast coefficients for the interaction in Table 4.10, the following
contrast statements in proc glm add Output 4.3 to Output 4.2.

proc glm data=limereq;


class lime rate;
model ApH = lime rate lime*rate;
contrast 'Lime main effect' lime 1 -1;
contrast 'Rate main effect' rate 1 -1 ,
rate 1 1 -2 ,
rate 1 1 1 -3 ,
rate 1 1 1 1 -4;
contrast 'Lime*Rate interaction' lime*rate 1 -1 0 0 0 -1 1 0 0 0,
lime*rate 1 1 -2 0 0 -1 -1 2 0 0,
lime*rate 1 1 1 -3 0 -1 -1 -1 3 0,
lime*rate 1 1 1 1 -4 -1 -1 -1 -1 4;
run;

Output 4.3.
Contrast DF Contrast SS Mean Square F Value Pr > F

Lime main effect 1 0.52428800 0.52428800 162.76 <.0001


Rate main effect 4 1.39845700 0.34961425 108.53 <.0001
Lime*Rate interaction 4 0.31074700 0.07768675 24.12 <.0001

For the simple effect !,4œ" -4 .34 substitution of [4.13] yields


, ,
"-4 .34 œ "-4 ˆ. € !3 € "4 € a!" b34 ‰
4œ" 4œ"
, ,
œ "-4 "4 € "-4 a!" b34 . [4.19]
4œ" 4œ"

In this case the user must specify the -4 coefficients for the interaction terms and the "4 . If

© 2003 by CRC Press LLC


only the latter are given, SAS® will apply the rules for testing main effects and assign the
average coefficient -4 Î+ to the interaction terms, which does not produce a simple effect.
Recall that interaction effects are contrasts among simple effects. Consider the question
whether the simple effect !,4œ" -4 .34 is constant for the first two levels of factor A. The
interaction effect becomes
, , , , , ,
"-4 ."4  "-4 .#4 œ "-4 "4 € "-4 a!" b"4  "-4 "4  "-4 a!" b#4
4œ" 4œ" 4œ" 4œ" 4œ" 4œ"
, ,
œ "-4 a!" b"4  "-4 a!" b#4 . [4.20]
4œ" 4œ"

Genuine interaction effects will involve only the a!" b34 terms.
For the lime requirement data we now return to the research questions raised in the
introduction:
ó: Do Lime Type and Application Rate interact?
ô: Are there main effects of Lime Type and Application Rate?
õ: Is there a difference between lime types at the " ton application rate?
ö: Does the difference between lime types depend on whether " or # tons are applied?
÷: How does the comparison of lime types change with application rate?

Questions ó and ô refer to the interactions and the main effects of the two factors.
Although we have seen in Output 4.3 how to obtain the main effects with contrasts, it is of
course much simpler to locate the particular sources in the Type III sum of squares table. õ
is a simple effect ([4.19]), ö an interaction effect ([4.20]), and ÷ slices of the Lime
Type ‚ Rate interaction by application rates. The proc glm statements are as follows (Output
4.4).
proc glm data=limereq;
class lime rate ;
model aph = lime rate lime*rate; /* ó and ô */
contrast 'Lime at 1 ton (C)'
lime 1 -1 lime*rate 0 1 0 0 0 0 -1 0 0 0; /* õ */
contrast 'Lime effect at 1 vs. 2 (D)'
lime*rate 0 1 -1 0 0 0 -1 1 0 0; /* ö */
lsmeans lime*rate / slice=(rate); /* ÷ */
run; quit;

In specifying contrast coefficients for an effect in SAS® , one should pay attention to (i)
the Class Level Information table printed at the top of the output (see Output 4.2) and (ii)
the order in which the factors are listed in the class statement. The Class Level
Information table depicts how the levels of the factors are ordered internally. If factor
variables are character variables, the default ordering of the levels is alphabetical, which may
be counterintuitive in the assignment of contrast coefficients (for example, “0 tons/acre”
appears before “100 tons/acre” which appears before “50 tons/acre”).

© 2003 by CRC Press LLC


Output 4.4.
The GLM Procedure

Class Level Information


Class Levels Values
LIME 2 AG lime Pell lime
RATE 5 0 1 2 4 8

Number of observations 50

Dependent Variable: APH


Sum of
Source DF Squares Mean Square F Value Pr > F
Model 9 2.23349200 0.24816578 77.04 <.0001
Error 40 0.12885000 0.00322125
Corrected Total 49 2.36234200

Source DF Type I SS Mean Square F Value Pr > F


LIME 1 0.52428800 0.52428800 162.76 <.0001
RATE 4 1.39845700 0.34961425 108.53 <.0001
LIME*RATE 4 0.31074700 0.07768675 24.12 <.0001

Source DF Type III SS Mean Square F Value Pr > F


LIME 1 0.52428800 0.52428800 162.76 <.0001
RATE 4 1.39845700 0.34961425 108.53 <.0001
LIME*RATE 4 0.31074700 0.07768675 24.12 <.0001

Contrast DF Contrast SS Mean Squ. F Value Pr > F

Lime at 1 ton (C) 1 0.015210 0.015210 4.72 0.0358


Lime effect at 1 vs. 2 (D) 1 0.027380 0.027380 8.50 0.0058

LIME*RATE Effect Sliced by RATE for APH

Sum of
RATE DF Squares Mean Square F Value Pr > F

0 1 0.001210 0.001210 0.38 0.5434


1 1 0.015210 0.015210 4.72 0.0358
2 1 0.127690 0.127690 39.64 <.0001
4 1 0.122102 0.122102 37.91 <.0001
8 1 0.568822 0.568822 176.58 <.0001

The order in which variables are listed in the class statement determines how SAS®
organizes the cell means. The subscript of the factor listed first varies slower than the
subscript of the factor listed second and so forth. The class statement
class lime rate;

results in cell means ordered ."" ß ."# ß ."$ ß ."% ß ."& ß .#" ß .## ß .#$ ß .#% ß .#& . Contrast coeffi-
cients are assigned in the same order to the lime*rate effect in the contrast statement. If the
class statement were

class rate lime;

the cell means would be ordered ."" ß .#" ß ."# ß .## ß ."$ ß .#$ ß ."% ß .#% ß ."& ß .#& and the arrange-
ment of contrast coefficients for the lime*rate effect changes accordingly.

© 2003 by CRC Press LLC


The slice option of the lsmeans statement makes it particularly simple to obtain com-
parisons of Lime Types separately at each level of the Rate factor. Slices by either factor can
be obtained with the statement
lsmeans lime*rate / slice=(rate lime);

and slices of three-way interactions are best carried out as


lsmeans A*B*C / slice=(A*B A*C B*C);

so that only a single factor is being compared at each combination of two other factors.
The table of effect slices at the bottom of Output 4.4 conveys no significant difference
among lime types at ! tons (J9,= œ !Þ$)ß : œ !Þ&%$%), but the J statistics increase with in-
creasing rate of application. This represents the increasing separation of the trends in Figure
4.2. Since factor Lime Type has only two levels the contrast comparing Lime Types at " ton
is identical to the slice at that rate aJ9,= œ %Þ(#ß : œ !Þ!$&)b.
One could perform slices of Lime Type ‚ Rate in the other direction, comparing the rates
of application at each lime type. We notice, however, that the rate of application is a quanti-
tative factor. It is thus more meaningful to test the nature of the trend between pH and Rate
of Application with regression contrasts (orthogonal polynomials). Since five rates were
applied we can test for quartic, cubic, quadratic, and linear trends. Published tables of orthog-
onal polynomial coefficients require that the levels of the factor are evenly spaced and that
the data are balanced. In this example, the factors are unevenly spaced. The correct contrast
coefficients can be calculated with the OrPol() function of SAS/IML® . The %orpoly macro
contained on the CD-ROM finds coefficients up to the eighth degree for any particular factor
level spacing. When the macro is executed for the factor level spacing !ß "ß #ß %ß ) the
coefficients shown in Output 4.5 result.
data rates;
input levels @@;
datalines;
0 1 2 4 8
;;
run;
%orpoly(data=rates,var=levels);

Output 4.5.
linear quadratic cubic quartic

0 -.47434 0.54596 -.46999 0.23672


1 -.31623 0.03522 0.42226 -.72143
2 -.15811 -.33462 0.51435 0.63125
4 0.15811 -.65163 -.57050 -.15781
8 0.79057 0.40507 0.10388 0.01127

The coefficients are fractional numbers. Sometimes, they can be converted to values that
are easier to code using the following trick. Since the contrast J statistic is unaffected by a
re-scaling of the contrast coefficients, we can multiply the coefficients for a contrast with an
arbitrary constant. Dividing the coefficients by the smallest coefficient for each trend yields
the coefficients in Table 4.11.

© 2003 by CRC Press LLC


Table 4.11. Contrast coefficients from Output 4.5 divided by the
smallest coefficient in each column
Order of Trend
Level Linear Quadratic Cubic Quartic
! tons $ "&Þ&  %Þ& #"
" ton # " %Þ!&  '%
# tons "  *Þ& %Þ*& &'
% tons "  ")Þ&  &Þ&  "%
) tons & ""Þ& " "

Since the factors interact we test the order of the trends for agricultural lime (AL) and
granulated lime (GL) separately by adding to the proc glm code (Output 4.6) the statements
contrast 'rate quart (AL)' rate 21 -64 56 -14 1
lime*rate 21 -64 56 -14 1;
contrast 'rate cubic (AL)' rate -4.5 4.05 4.95 -5.5 1
lime*rate -4.5 4.05 4.95 -5.5 1;
contrast 'rate quadr.(AL)' rate 15.5 1 -9.5 -18.5 11.5
lime*rate 15.5 1 -9.5 -18.5 11.5;
contrast 'rate linear(AL)' rate -3 -2 -1 1 5
lime*rate -3 -2 -1 1 5;

contrast 'rate quart (GL)' rate 21 -64 56 -14 1


lime*rate 0 0 0 0 0 21 -64 56 -14 1;
contrast 'rate cubic (GL)' rate -4.5 4.05 4.95 -5.5 1
lime*rate 0 0 0 0 0 -4.5 4.05 4.95 -5.5 1;
contrast 'rate quadr.(GL)' rate 15.5 1 -9.5 -18.5 11.5
lime*rate 0 0 0 0 0 15.5 1 -9.5 -18.5 11.5;
contrast 'rate linear(GL)' rate -3 -2 -1 1 5
lime*rate 0 0 0 0 0 -3 -2 -1 1 5;

Output 4.6.
Contrast DF Contrast SS Mean Square F Value Pr > F

rate quart (AL) 1 0.00128829 0.00128829 0.40 0.5307


rate cubic (AL) 1 0.00446006 0.00446006 1.38 0.2463
rate quadr.(AL) 1 0.01160085 0.01160085 3.60 0.0650
rate linear(AL) 1 1.46804112 1.46804112 455.74 <.0001

rate quart (GL) 1 0.01060179 0.01060179 3.29 0.0772


rate cubic (GL) 1 0.00519994 0.00519994 1.61 0.2112
rate quadr.(GL) 1 0.00666459 0.00666459 2.07 0.1581
rate linear(GL) 1 0.20129513 0.20129513 62.49 <.0001

For either Lime Type, one concludes that the trend of pH is linear in application rate at
the &% significance level. For agricultural lime a slight quadratic effect is noticeable. The
interaction between the two factors should be evident in a comparison of the linear trends
among the Lime Types. This is accomplished with the contrast statement
contrast 'Linear(AL vs. GL)'
lime*rate -3 -2 -1 1 5 3 2 1 -1 -5;

© 2003 by CRC Press LLC


Output 4.7.
Contrast DF Contrast SS Mean Square F Value Pr > F
Linear(AL vs. GL) 1 0.29106025 0.29106025 90.36 <.0001

The linear trends are significantly different aJ9,= œ *!Þ$'ß :  !Þ!!!"b. From Figure 4.2
this is evidently due to differences in the slopes of the two lime types.

4.4 Diagnosing Regression Models


Box 4.4 Model Diagnostics

• Diagnosing (criticizing) a linear regression model utilizes


— residual analysis
— case deletion diagnostics
— collinearity diagnostics and
— other diagnostics concerned with assumptions of the linear model.

• If breakdowns of a correct model cannot be remedied because of outlying or


influential observations, alternative estimation methods may be employed.

Diagnosing the model and its agreement with a particular data set is an essential step in
developing a good statistical model and sound inferences. Estimating model parameters and
drawing statistical inferences must be accompanied by sufficient criticism of the model. This
criticism should highlight whether the assumptions of the model are met and if not, to what
degree they are violated. Key assumptions of the classical linear model [4.1] are
• correctness of the model (Eced œ 0) and
• homoscedastic, uncorrelated errors (Varced œ 5 # I).

Often the assumption of Gaussian errors is added and must be diagnosed, too, not
because least squares estimation requires it, but to check the validity of exact inferences. The
importance of these assumptions, in terms of complications introduced into the analysis by
their violation, follows roughly the same order.

4.4.1 Residual Analysis


Box 4.5 Residuals

• The raw residual s/3 œ C3  sC 3 is not a good diagnostic tool, since it does not
mimic the behavior of the model disturbances /3 . The 8 residuals are
correlated, heteroscedastic, and do not constitute 8 pieces of information.

© 2003 by CRC Press LLC


• Studentized residuals are suitably standardized residuals that at least
remedy the heteroscedasticity problem.

Since the model residuals (errors) e are unobservable it seems natural to focus model criticism
s The fitted residuals do not behave exactly as the
on the fitted residuals se œ y  sy œ y  X".
s
model residuals. If " is an unbiased estimator of " , then Ecsed œ Eced œ 0, but the fitted resi-
duals are neither uncorrelated nor homoscedastic. We can write se œ y  sy œ y  Hy, where
H œ XaXw Xb" Xw is called the "Hat" matrix since it produces the sy values when post-
multiplied by y (see §A4.8.3 for details). The hat matrix is a symmetric idempotent matrix
which implies that
Hw œ H, HH œ H and aI  HbaI  Hb œ aI  Hb.

The variance of the fitted residuals now follows easily:


Varcsed œ VarcY  HYd œ VarcaI  HbYd œ 5 # IaI  HbaI  Hb œ 5 # aI  Hb. [4.21]

H is not a diagonal matrix and the entries of its diagonal are not equal. The fitted residuals
thus are neither uncorrelated nor homoscedastic. Furthermore, H is a singular matrix of rank
8  <aXb. If one fits a standard regression model with intercept and 5  " regressors only
8  5 least squares residuals carry information about the model disturbances, the remaining
residuals are redundant. In classification models where the rank of X can be large relative to
the sample size, only a few residuals are nonredundant.
Because of their heteroscedastic nature, we do not recommend using the raw residuals s/3
to diagnose model assumptions. First, the residual should be properly scaled. The 3th diagonal
value of H is called the leverage of the 3th data point. Denote it as 233 and it follows that
Varcs/3 d œ 5 # a"  233 b. [4.22]

Standardized residuals have mean ! and variance " and are obtained as s/‡3 œ
s/3 ÎÐ5 È"  233 Ñ. Since the variance 5 # is unknown, 5 is replaced by its estimate 5
s , the square
root of the model mean square error. The residual
s/3
<3 œ [4.23]
s È"  233
5

is called the studentized residual and is a more appropriate diagnostic measure. If the model
errors /3 are Gaussian, the scale-free studentized residuals are akin to > random variables
which justifies — to some extent — their use in diagnosing outliers in regression models (see
below and Myers 1990, Ch. 5.3). Plots of <3 against the regressor variables can be used to
diagnose the equal variance assumption and the need to transform regressor variables. A
graph of <3 against the fitted values sC 3 highlights the correctness of the model. In either type
of plot the residuals should appear as a stable band of random scatter around a horizontal line
at zero.

© 2003 by CRC Press LLC


Studentized
Residuals 1

-2

60 70 80 90 100 110
Predicted Value

Air Temperature

-2
60 65 70 75 80 85 90
Studentized Residuals

Soil Moisture

1
without quadratic term for soil moisture
with quadratic term for soil moisture -2
0 10 20 30 40 50
Sunlight

-2
50 100 150 200 250 300
Value of Covariate

Figure 4.6. Studentized residuals in the turnip green data analysis for the model
C3 œ "! € "" B"3 € "# B#3 € "$ B$3 (full circles) and C3 œ "! € "" B"3 € "# B#3 € "$ B$3 € "% B##3
(open circles).

For the Turnip Greens data (Example 4.2) studentized residuals for the model with and
without the squared term for soil moisture are shown in Figure 4.6. The quadratic trend in the
residuals for soil moisture is obvious if \% œ \## is omitted.
A problem of the residual by covariate plot is the interdependency of the covariates. The
plot of <3 vs. Air Temperature, for example, assumes that the other variables are held con-
stant. But changing the amount of sunlight obviously changes air temperature. How this col-
linearity of the regressor variables impacts not only the interpretation but also the stability of
the least squares estimates is addressed in §4.4.4.
To diagnose the assumption of Gaussian model errors one can resort to graphical tools
such as normal quantile and normal probability plots, to formal tests for normality, for
example, the tests of Shapiro and Wilk (1965) and Anderson and Darling (1954) or to
goodness-of-fit tests based on the empirical distribution function. The normal probability plot
is a plot of the ranked studentized residuals (ordinate) against the expected value of the 3th
smallest value in a random sample of size 8 from a standard Gaussian distribution (abscissa).
This expected value of the 3th smallest value can be approximated as the Ð: ‚ "!!Ñth K a!ß "b
percentile D: ,

© 2003 by CRC Press LLC


3  !Þ$(&
:œŒ .
8 € !Þ#&

The reference line in a normal probability plot has intercept –.Î5 and slope "Î5 where . and
5 # are the mean and variance of the Gaussian reference distribution. If the reference is the
Ka!ß "b, deviations of the points in a normal probability plot from the straight line through the
origin with slope "Þ! indicate the magnitude of the departure from Gaussianity. Myers (1990,
p. 64) shows how certain patterned deviations from Gaussianity can be diagnosed in this plot.
For studentized residuals the %&°-line is the correct reference since these residuals have mean
! and variance ". For the raw residuals the %&°-line is not the correct reference since their
variance is not ". The normal probability plot of the studentized residuals for the Turnip
Greens analysis suggests model disturbances that are symmetric but less heavy in the tails
than a Gaussian distribution.

1
Studentized Residual

-1

-2

-3
-3 -2 -1 0 1 2 3

Standard Gaussian Quantiles

Figure 4.7. Normal probability plot of studentized residuals for full model in Turnip Greens
analysis.

Plots of raw or studentized residuals against regressor or fitted values are helpful to
visually assess the quality of a model without attaching quantitative performance measures.
Our objection concerns using these types of residuals for methods of residual analysis that
predicate a random sample, i.e., independent observations. These methods can be graphical
(such as the normal probability plot) or quantitative (such as tests of Gaussianity). Such
diagnostic tools should be based on 8  5 residuals that are homoscedastic and uncorrelated.
These can be constructed as recursive or linearly recovered errors.

4.4.2 Recursive and Linearly Recovered Errors


The three goals of error recovery are (i) to remove the correlation among the fitted residuals,
(ii) to recover homoscedastic variables, and (iii) to avoid the illusion that there are actually 8
observations. From the previous discussion it is clear that if X contains 5 linearly indepen-

© 2003 by CRC Press LLC


dent columns, only 8  5 errors can be recovered after a least squares fit, since the ordinary
least square residuals satisfy 5 constraints (the vector se is orthogonal to each column of X).
T -values in tests for normality that depend on sample size will be reported erroneously if the
set of 8 raw or studentized residuals is used for testing. The correlations among the 8  5
nonredundant residuals will further distort these :-values since the tests assume independence
of the observations. Although studentized residuals have unit variance, they still are cor-
related and for the purpose of validating the Gaussianity assumption, they suffer from similar
problems as the raw residuals.
Error recovery is based either on employing the projection properties of the "Hat" matrix,
H œ XaXw Xb" Xw , or on sequentially forecasting observations based on a fit of the model to
preceding observations. Recovered errors of the first type are called Linear Unbiased Scaled
(LUS) residuals (Theil 1971), errors of the second type are called recursive or sequential
residuals. Kianifard and Swallow (1996) provide a review of the development of recursive
residuals. We will report only the most important details here.
Consider the standard linear model with Gaussian errors,
Ya8‚"b œ Xa8‚5b " € eß e µ K ˆ0ß 5 # I‰,

and fit the model to 5 data points. The remaining 8  5 data points are then entered sequen-
tially and the 4th recursive residual is the scaled difference of predicting the next observation
from the model fit to the previous observations. More formally, let X4" be the matrix con-
sisting of the first 4  " rows of X. If Xw4" X4" is nonsingular and 4   5 € " the parameter
vector " can be estimated as
s 4" œ ˆXw4" X4" ‰" X4"
" w
y4" . [4.24]

Now consider adding the next observation C4 . Define as the (unstandardized) recursive
residual A‡3 the difference between C4 and the predicted value based on fiting the model to the
preceding observations,
s 4" .
A‡3 œ C4  xw4 "

Finally, scale A‡3 and define the recursive residual (Brown et al. 1975) as
A‡3
A3 œ ß 4 œ 5 € "ß âß 8. [4.25]
"
É" € xw4 ˆXw4" X4" ‰ x4

The A3 are independent random variables with mean ! and variance VarcA3 d œ 5 # , just as the
model disturbances. However, only 8  5 of them are available. Recursive residuals are
unfortunately not unique. They depend on the set of 5 points chosen initially and on the order
of the remaining data. It is not at all clear how to compute the best set of recursive residuals.
Because data points with high leverage can have negative impact on the analysis, one
possibility is to order the data by leverage to circumvent calculation of recursive residuals for
potentially influential data points. In other circumstances, for example, when the detection of
outliers is paramount, one may want to produce recursive residuals for precisely these obser-
vations. On occasion a particular ordering of the data suggests itself, for example, if a
covariate relates to time or distance. A second complication of initially fitting the model to 5
observations is the need for Xw4" X4" to be nonsingular. When columns of X contain repeat

© 2003 by CRC Press LLC


values, for example, treatment variables or soil moisture in the Turnip Greens data (Table
4.2), the data must be rearranged or the number of data points in the initial fit must be
enlarged. The latter leads to the recovery of fewer errors.
Recursive residuals in regression models can be calculated with proc autoreg of The
SAS® System (Release 7.0 or higher). The following code simulates a data set with fifteen
observation and a systematic quadratic trend. The error distribution was chosen as > on seven
degrees of freedom and is thus slightly heavier in the tails than the standard Gaussian distri-
bution. The normal probability plots of the raw and recursive residuals are shown in Figure
4.8.
data simu;
do i = 1 to 15;
x = ranuni(4355)*30;
u = ranuni(6573);
e = tinv(u,7);
x2 = x*x;
fn = 0.1 + 0.2*x - 0.04*x2;
y = fn + e;
output;
end; drop i;
run;

proc reg data=simu noprint;


model y = x x2;
output out=regout student=student rstudent=rstudent residual=res h=h;
run; quit;
/* Sort by descending leverage to eliminate high leverage points first */
proc sort data=regout; by descending h; run;

proc autoreg data=regout noprint;


model y = x x2;
output out=residuals residual=res blus=blus recres=recres;
run;
proc univariate data=residuals normal plot; var student recres; run;

2
Studentized Residual
Recursive Residual
1

0
Residual

-1

-2

-3
-2 -1 0 1 2
Standard Gaussian Quantile

Figure 4.8. Normal probability plots of studentized and recursive residuals for simulated
data. Solid line represents standard Gaussian distribution.

© 2003 by CRC Press LLC


The recursive residuals show the deviation from Gaussianity more clearly than the
studentized residuals which tend to cluster around the reference line. The :-values for the
Shapiro-Wilk test for Gaussianity were !Þ(*(* and !Þ"#(& for the studentized and recursive
residuals, respectively. Applying the Shapiro-Wilk test incorrectly to the correlated raw resi-
duals leads to considerably less evidence for detecting a departure from Gaussianity. The :-
values for the Anderson-Darling test for Gaussianity were ž !Þ#& and !Þ"!%(, respectively.
In all instances the recursive residuals show more evidence against Gaussianity.
We emphasized recursive residuals for diagnosing the assumption of normal distur-
bances. This is certainly not the only possible application. Recursive residuals are also key in
detecting outliers, detecting changes in the regression coefficients, testing for serial corre-
lation, and testing for heteroscedasticity. For details and further references, see Galpin and
Hawkins (1984) and Kianifard and Swallow (1996).
The second method of error recovery leads to Linearly Unbiased Scaled (LUS) estimates
of /3 and is due to Theil (1971). It exploits the projection properties in the standard linear
model. Denote the a> œ 8  5 ‚ "b vector of LUS residuals as

Ra>b œ cV" ß V# ß âß V85 dw .

The name reflects that Ra>b is a linear function of Y, has the same expectation as the model
disturbances aEcRd œ Eced œ 0b, and has a scalar covariance matrix ÐVarÒRa>b Ó œ 5 # IÑ. In con-
trast to recursive residuals it is not necessary to fit the model to an initial set of 5 observa-
tions, since the process is not sequential. This allows the recovery of > œ 8  5 uncorrelated,
homoscedastic residuals in the presence of classification variables. The error recovery pro-
ceeds as follows. Let M œ I  H and recall that Varcsed œ 5 # M. If one premultiplies se with a
matrix Qw such that
I> 0
VarcQwsed œ 5 # ” ,
0 0•

then the first > elements of Qwse are the LUS estimates of e. Does such a matrix Q exist? This
is indeed the case and unfortunately there are many such matrices. The spectral decomposi-
tion of a real symmetric matrix A is PDPw œ A, where P is an orthogonal matrix containing
the ordered eigenvectors of A, and D is a diagonal matrix containing the ordered eigenvalues
of A. Since M is symmetric idempotent (a projector) it has a spectral decomposition. Further-
more, since the eigenvalues of a projector are either " or !, and the number of nonzero eigen-
values equals the rank of a matrix, there are > œ 8  5 eigenvalues of value " and the remain-
ing values are !. D thus has precisely the structure
I> 0
Dœ”
0 0•

we are looking for. So, if the spectral decomposition of M is M œ QDQw , then Qw MQ œ D,


since Q is an orthogonal matrix (Qw Q œ I). Jensen and Ramirez (1999) call the first >
elements of Qwse the linearly recovered errors. Their stochastic properties in >-dimensional
space are identical to those of the model disturbances e in 8-dimensional space. The non-
uniqueness of the LUS estimates stems from the nondistinctness of the eigenvalues of M.
Any orthogonal rotation BQw will also produce a LUS estimator for e. Theil (1971) settled the
problem of how to make the LUS unique by defining as best the set of LUS residuals that

© 2003 by CRC Press LLC


minimize the expected sum of squares of estimation errors and termed those the BLUS
residuals. While this is a sensible criterion, Kianifard and Swallow (1996) point out that there
is no reason why recovered errors being best in this sense should necessarily lead to tests with
high power. To diagnose normality of the model disturbances, Jensen and Ramirez (1999)
observe that the skewness and kurtosis of the least square residuals are closer to the values
expected under Gaussianity than are those of nonnormal model errors. The diagnostic tests
are thus slanted toward Gaussianity. These authors proceed to settle the nonuniqueness of the
LUS residuals by finding that orthogonal rotation BQw which maximizes the kurtosis (4th
moment) of the recovered errors. With respect to the fourth moment, this recovers (rotated)
errors that are as ugly as possible, slanting diagnostic tests away from Gaussianity.
Particularly with small sample sizes this may be of importance to retain power. Linearly
recovered errors and recursive residuals are no longer in a direct correspondence with the
data points. The 4th recursive residual does not represent the residual for the 4th observation.
Their use as diagnostic tools is thus restricted to those inquiries where the individual
observation is immaterial, e.g., tests for normality.

4.4.3 Case Deletion Diagnostics


Box 4.6 Influential Data Point

• A data point is called highly influential if an important aspect of the fitted


model changes considerably if the data point is removed.

• A high leverage point is unusual with respect to other x values; it has the
potential to be an influential data point.

• An outlying observation is unusual with respect to other ] values.

Case deletion diagnostics assess the influence of individual observations on the overall
analysis. The idea is to remove a data point and refit the model without it. The change in a
particular aspect of the fitted model, e.g., the residual for the deleted data point or the least
squares estimates, is a measure for its influence on the analysis. Problematic are highly in-
fluential points (hips), since they have a tendency to dominate the analysis. Fortunately, in
linear regression models, these diagnostics can be calculated without actually fitting a regres-
sion model 8 times but in a single fit of the entire data set. This is made possible by the Sher-
man-Morrison-Woodbury theorem which is given in §A4.8.3. For a data point to be a hip, it
must be either an outlier or a high leverage data point. Outliers are data points that are un-
usual relative to the other ] values. The Magnesium observation in the prediction efficiency
data set (Example 4.4, p. 93) might be an outlier. The attribute outlier does not have a nega-
tive connotation. It designates a data point as unusual and does not automatically warrant
deletion. An outlying data point is only outlying with respect to a particular statistical model
or criterion. If, for example, the trend in C is quadratic in B, but a simple linear regression
model, C3 œ "! € "" B3 € /3 , is fit to the data, many data points may be classified as outliers
because they do not agree with the model. The reason therefore is an incorrect model, not
erroneous observations. According to the commonly applied definition of what constitutes
outliers based on the box-plot of a set of data, outliers are those values that are more than "Þ&

© 2003 by CRC Press LLC


times the interquartile-range above the third or below the first quartile of the data. When ran-
domly sampling "ß !!! observations from a Gaussian distribution, one should expect to see on
average about seven outliers according to this definition.
To understand the concept of leverage, we concentrate on the "Hat" matrix
"
H œ X aX w X b X w [4.26]

introduced in the previous section. This matrix does not depend on the observed responses y,
only on the information in X. Its 3th diagonal element, 233 , measures the leverage of the 3th
observation and expresses how unusual or extreme the covariate record of this observation is
relative to the other observations. In a simple linear regression, a single B value far removed
from the bulk of the B-data is typically a data point with high leverage. If X has full rank 5 , a
point is considered a high leverage point if
233 ž #5Î8.

High leverage points deserve special attention because they may be influential. They have the
potential to pull the fitted regression toward it. A high leverage point that follows the regres-
sion trend implied by the remaining observations will not exert undue influence on the least
squares estimates and is of no concern. The decision whether a high leverage point is influen-
tial thus rests on combining information about leverage with the magnitude of the residual. As
leverage increases, smaller and smaller residual values are needed to declare a point as
influential. Two statistics are particularly useful in this regard. The RStudent residual is a
studentized residual that combines leverage and the fitted residual similar to <3 in [4.23],

s 3 È"  233 ‹.
RStudent3 œ s/3 €Š5 [4.27]

s #3 is the mean square error estimate obtained after removal of the 3th data point. The
Here, 5
DFFITS (difference in fit, standardized) statistic measures the change in fit in terms of
standard error units when the 3th observation is deleted:
DFFITS3 œ RStudent3 È233 Îa"  233 b. [4.28]

A DFFITS3 value of #Þ!, for example, implies that the fit at C3 will change by two standard
error units if the 3th data point is removed. DFFITs are useful to asses whether a data point is
highly influential and RStudent residuals are good at determining outlying observations.
According to [4.27] a data point may be a hip if it has a moderate residual s/3 and high
leverage or if it is not a high leverage point (233 small), but unusual in the y space (s/3 large).
RStudent residuals and DFFITS measure changes in residuals or fitted values as the 3th
observation is deleted. The influence of removing the 3th observation on the least squares
s is measured by Cook's Distance H3 (Cook 1977):
estimate "
H3 œ <3# 233 Îaa5 € "ba"  233 bb [4.29]

When diagnosing (criticizing) a particular model, one or more of these statistics may be
important. If the purpose of the model is mainly predictive, the change in the least squares
estimates aH3 b is secondary to RStudent residuals and DFFITS. If the purpose of the model
lies in testing hypotheses about " , Cook's distance will gain importance. Rule of thumb (rot)

© 2003 by CRC Press LLC


values for the various statistics are shown in Table 4.12, but caution should be exercised in
their application:
• If a decision is made to delete a data point because of its influence, the data set changes
and the leverage, RStudent values, etc. of the remaining observations also change.
Data points not influential before may be designated influential now and continuance
in this mode leads to the elimination of too many data points.
• Data points often are unusual in groups. This is especially true for clustered data where
entire clusters may be unusual. The statistics discussed here apply to single-case dele-
tion only.
• The fact that a data point is influential does not imply that the data point must be
deleted. It should prompt the investigator to question the validity of the observation
and the validity of the model.
• Several of the case deletion diagnostics have known distributional properties under
Gaussianity of the model errors. When these properties are tractable, relying on rules
of thumbs instead of exact :-values is difficult to justify. For important results about
the distributional properties of these diagnostics see, for example, Jensen and Ramirez
(1998) and Dunkl and Ramirez (2001).
• Inexperienced modelers tend to delete more observations based on these criteria than
warranted.

Table 4.12. Rules of thumbs (rots) for leverage and case deletion diagnostics: 5 denotes
the number of parameters (regressors plus intercept), 8 the number of observations
Statistic Formula Rule of thumb Conclusion
Leverage 233 233 ž #5Î8 high leverage point
RStudent3 s 3 È"  233 ‰
s/3 Έ5 lRStudent3 l ž # outlier (hip)
DFFITS3 <3# 233 Îa5 a"  233 bb lDFFITS3 l ž #È5Î8 hip
Cook's Distance H3 œ <3 233 Îa5 a"  233 bb H3 ž "
#
hip

Example 4.4. Prediction Efficiency (continued). Recall the Prediction Efficiency


example introduced on p. 93 and the data shown in Figure 4.4. Although Mg is
obviously a strange observation in this data set, the overall trend between prediction
efficiency and the range of the spatial correlation could be quadratic.

Let \3 denote the spatial range of the 3th attribute and ]3 its prediction efficiency, we
commence by fitting a quadratic polynomial
]3 œ "! € "" B3 € "# B#3 € /3 [4.30]

with proc reg in The SAS® System:


proc reg data=range;
model eff30 = range range2;
run; quit;

© 2003 by CRC Press LLC


Output 4.8.
The REG Procedure
Model: MODEL1
Dependent Variable: eff30

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 2 854.45840 427.22920 3.34 0.1402


Error 4 511.52342 127.88086
Corrected Total 6 1365.98182

Root MSE 11.30844 R-Square 0.6255


Dependent Mean 25.04246 Adj R-Sq 0.4383
Coeff Var 45.15707

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -39.12105 37.76100 -1.04 0.3587


range 1 1.39866 1.00851 1.39 0.2378
range2 1 -0.00633 0.00609 -1.04 0.3574

Ordinary least squares estimates are " s ! œ  $*Þ"#", "


s " œ "Þ$*), and "
s2 œ
 !Þ!!'$$. The coefficient of determination is
WWV
V# œ "  œ !Þ'#&&.
WWX7
Removing the Mg observation and refitting the quadratic polynomial leads to
s ! œ  (&Þ"!#, "
" s " œ #Þ%"%, and " s # œ  !Þ!""* and the coefficient of determination
for the quadratic model increases from !Þ'#&& to !Þ*(# (output not shown). Obviously,
this data point exerts large influence on the analysis.

The various case deletion diagnostics are listed in Table 4.13. The purpose of their
calculation is to determine whether the Mg observation is an influential data point. The
rule of thumb value for a high leverage point is #5Î8 œ #‡$Î( œ !Þ)&( and for a
DFFITS is #È$Î( œ "Þ$!*. The diagnostics were calculated with the /influence
option of the model statement in proc reg. This option produces raw residuals,
RStudent residuals, leverages, DFFITS, and other diagnostics:
proc reg data=range;
model eff30 = range range2 / influence;
output out=cookd cookd=cookd;
run; quit;
proc print data=cookd; run;

The two most extreme regressor values, for P and lime, have the largest leverages. P is
a borderline high leverage point (Table 4.13). The residual of the P observations is not
too large so that it is not considered an outlying observation as judged by RStudent. It

© 2003 by CRC Press LLC


has considerable influence on the fitted values (DFFITS# ) and the least squares esti-
mates aH# b. Removing the P observation will change the fit dramatically. Since it is not
a candidate observation for deletion, it remains in the data set. The Mg data point is not
a high leverage point, since it is near the center of the observed range (Figure 4.4). Its
unusual behavior in the y-space is evidenced by a very large RStudent residual and a
large DFFITs value. It is an outlier with considerable influence on fitted values. Its in-
fluence on the least squares estimates as judged by Cook's Distance is not critical. Since
the purpose of modeling these data is the derivation of a predictive equation of efficien-
cy as a function of spatial range, influence on fitted (predicted) values is more impor-
tant than influence on the actual estimates.

Table 4.13. Case deletion diagnostics for quadratic polynomial in Prediction Efficiency
application (rot denotes a rule-of-thumb value)
Obs. Attribute s/3 233 RStudent3 DFFITS3 H3
(rot œ !Þ)&(Ñ Ðrot œ #Ñ (rot œ "Þ$!*) (rot œ ")
" pH $Þ'")% !Þ#"&( !Þ$")" !Þ"'') !Þ!""*
# P $Þ$(&! !Þ*%'( "Þ%'($ 'Þ")%# *Þ)*&'
$ Ca *Þ**() !Þ#$!* "Þ!"!* !Þ&&$* !Þ"!"(
% Mg  "(Þ&$!$ !Þ$&#&  'Þ#"!#  %Þ&)#! !Þ'($%
& CEC !Þ()#) !Þ#("" !Þ!(!$ !Þ!%#* !Þ!!!(
' Lime  'Þ%""* !Þ'$!(  !Þ*"$&  "Þ"*$' !Þ%*&%
( Prec 'Þ"')# !Þ$&#& !Þ'#%! !Þ%'!% !Þ!)$%

4.4.4 Collinearity Diagnostics


Case deletion diagnostics focus on the relationship among the rows of Y (outliers) and rows
of X (leverage) and their impact on the analysis. Collinearity is a condition among the
columns of X. The two extreme cases are (i) complete independence of its columns (orthogo-
nality) and (ii) one or more exact linear dependencies among the columns. In the first case,
the inner product (§3.3) between any columns of X is ! and Xw X is a diagonal matrix. Since
VarÒ"s Ó œ 5 # aXw Xb" , the variance-covariance matrix is diagonal and the least squares esti-
mates are uncorrelated. Removing a covariate from X or adding a covariate to X that is
orthogonal to all of the other columns does not affect the least squares estimates.
If one or more exact linear dependencies exist, X is rank-deficient, the inverse of aXw Xb
does not exist, and ordinary least squares estimates " s œ aXw Xb" Xw y cannot be calculated.
s œ aXw Xb Xw y is known to be
Instead, a generalized inverse (§3.4) is used and the solution "
nonunique.
The collinearity problem falls between these two extremes. In a multiple regression
model the columns of X are almost never orthogonal and exact dependencies among the
columns also rarely exist. Instead, the columns of X are somewhat interrelated. One can
calculate pairwise correlation coefficients among the columns to detect which columns relate
to one another (linearly) in pairs. For the Turnip Greens data the pairwise correlations among

© 2003 by CRC Press LLC


the columns of X œ cx" ß x# ß x$ ß x% d are shown in Table 4.14. Substantial correlations exist
between Sunlight and Air Temperature (\" ß \$ ) and Soil Moisture and its square a\# ß \% b.

Table 4.14. Pairwise Pearson correlation coefficients for covariates in Turnip Greens data.
Sunlight Soil Moisture Air Temp. Soil Moisture#
\" \# \$ \%
\" !Þ!""# !Þ&$($  !Þ!$)!
\# !Þ!""#  !Þ!"%* !Þ**'&
\$ !Þ&$($  !Þ!"%*  !Þ!(!#
\%  !Þ!$)! !Þ**'&  !Þ!(!#

Large pairwise correlations are a sufficient but not a necessary condition for collinearity.
A collinearity problem can exist even if pairwise correlations among the regressors are small
and a near-linear dependency involves more than two regressors. For example, if
5
"- 4 x 4 œ - " x " € - # x # € ⠀ - 5 x 5 ¸ 0
4œ"

holds, X will be almost rank-deficient although the pairwise correlations among the x3 may be
small. Collinearity negatively affects all regression calculations that involve the aXw Xb"
matrix: the least squares estimates, their standard error estimates, the precision of predictions,
test statistics, and so forth. Least squares estimates tend to be unstable (large standard errors),
large in magnitude, and have signs at odds with the subject matter.

Example 4.2. Turnip Greens (continued). Collinearity can be diagnosed in a first step
by removing columns from X. The estimates of the remaining parameters will change
unless the columns of X are orthogonal. The table that follows shows the coefficient
estimates and their standard errors when \% , the quadratic term for Soil Moisture, is in
the model (full model) and when it is removed.

Table 4.15. Effect on the least squares estimates of removing


\% œ Soil Moisture# from the full model

Full Model Reduced Model


Parameter Estimate Est. Std. Error Estimate Est. Std. Error
"! ""*Þ&(" "$Þ'(' )#.!'*% "*Þ)$!*
"" (sunlight)  !Þ!$$' !Þ!#%% !Þ!##( !Þ!$'&
"# (soil moist.) &Þ%#&! "Þ!!&$  !Þ(()$ !Þ!*$&
"$ (air temp.)  !Þ&!#& !Þ#"!* !Þ"'%! !Þ#*$$
"% (soil moist# Ñ  !Þ"#!* !Þ!"*& Þ Þ
V# !Þ*"!% !Þ(&%*
The estimates of "! through "$ change considerably in size and sign depending on
whether \% is in the model. While for a given soil moisture an increase in air tempera-
ture of "° Fahrenheit reduces vitamin F# concentration by !Þ&!#& milligrams ‚ gram"
in the full model, it increases it by !Þ"'% milligrams ‚ gram" in the reduced model.

© 2003 by CRC Press LLC


The effect of radiation (\" ) also changes sign. The coefficients appear to be unstable. If
theory suggests that vitamin F# concentration should increase with additional sunlight
exposure and increases in air temperature, the coefficients are not meaningful from a
biological point of view in the full model, although this model provides significantly
better fit of the data than the model without \% .

A simple but efficient collinearity diagnostic is based on variance inflation factors (VIFs)
of the regression coefficients. The VIF of the 4th regression coefficient is obtained from the
coefficient of determination aV # b in a regression model where the 4th covariate is the
response and all other covariates form the X matrix (Table 4.16). If the coefficient of determi-
nation from this regression is V4# , then

Z MJ4 œ "€ˆ"  V4# ‰. [4.31]

Table 4.16. Variance inflation factors for the full model in the Turnip Greens example
Response Regressors 4 V4# Z MJ4
\" \# ß \$ ß \ % " !Þ$))) "Þ'$'
\# \" ß \$ ß \ % # !Þ**'( $!$Þ!
\$ \" ß \# ß \ % $ !Þ%(%* "Þ*!%
\% \" ß \# ß \ $ % !Þ**'' #*%Þ"

If a covariate is orthogonal to all other columns, its variance inflation factor is ". With
increasing linear dependence, the VIFs increase. As a rule of thumb, variance inflation factors
greater than "! are an indication of a collinearity problem. VIFs greater than $! indicate a
severe problem.
Other collinearity diagnostics are based on the principal value decomposition of a
centered and scaled regressor matrix (see §A4.8.3 for details). A variable is centered and
scaled by subtracting its sample mean and dividing by its sample standard deviation:
B‡34 œ aB34  B4 bÎ=4 ,

where B4 is the sample mean of the B34 and =4 is the standard deviation of the 4th column of X.
Let X‡ be the regressor matrix
X‡ œ cx‡" ß âß x‡5 d

and rewrite the model as


Y œ "!‡ € X‡ " ‡ € e. [4.32]

This is called the centered regression model and " ‡ is termed the vector of standardized
coefficients. Collinearity diagnostics examine the conditioning of X‡ w X‡ . If X is an orthogo-
nal matrix, then X‡ w X‡ is orthogonal and its eigenvalues -4 are all unity. If an exact linear
dependency exists, at least one eigenvalue is zero (the rank of a matrix is equal to the number
of nonzero eigenvalues). If eigenvalues of X‡ w X‡ are close to zero, a collinearity problem
exists. The 4th condition index of X‡ w X‡ is the square root of the ratio between the largest

© 2003 by CRC Press LLC


and the 4th eigenvalue,

maxe-4 f
<4 œ Ë . [4.33]
-4

A condition index in excess of $! indicates a near linear dependency.


Variance inflation factors and condition indices are obtained with the vif and
collinoint options of the model statement in proc reg. For example,
proc reg data=turnip;
model vitamin = sun moist airtemp x4 / vif collinoint;
run; quit;

The collin option of the model statement also calculates eigenvalues and condition
indices, but does not adjust them for the intercept as the collinoint option does. We prefer
the latter. With the intercept-adjusted collinearity diagnostics, one obtains as many eigen-
values and condition indices as there are explanatory variables in X. They do not stand in a
one-to-one correspondence with the regressors, however. If the first condition index is larger
than $!, it does not imply that the first regressor needs to be removed. It means that there is at
least one near-linear dependency among the regressors that may or may not involve \" .

Table 4.17. Eigenvalues and condition indices for the full model in Turnip Greens
data and for the reduced model without \%
Eigenvalues Condition Indices
-" -# -$ -% <" <# <$ <%
Full Model #Þ!! "Þ&$ !Þ%' !Þ!!" "Þ!! "Þ"% #Þ!) $%Þ)'
Reduced Model "Þ&% "Þ!! !Þ%' "Þ!! "Þ#% "Þ)#

For the Turnip Greens data the (intercept-adjusted) eigenvalues of X‡ w X‡ and the con-
dition indices are shown in Table 4.17. The full model containing \% has one eigenvalue that
is close to zero and the associated condition index is greater than the rule of thumb value of
$!. Removing \% leads to a model without a collinearity problem, but early on we found that
the model without \% does not fit the data well. To keep the four regressor model and reduce
the negative impact of collinearity on the least squares estimates, we now employ ridge
regression.

4.4.5 Ridge Regression to Combat Collinearity


Box 4.7 Ridge Regression

• Ridge regression is an ad-hoc regression method to combat collinearity. The


ridge regression estimator allows for some bias in order to reduce the mean
square error compared to OLS.

• The method has an ad-hoc character because the user must choose the ridge
factor, a small number by which to shrink the least squares estimates.

© 2003 by CRC Press LLC


Least squares estimates are best linear unbiased estimators in the model
Y œ X" € eß e µ ˆ0ß 5 # I‰,

where best implies that the variance of the linear combination aw "s SPW is smaller than the
wµ µ
variance of a " , where " is any other estimator that is linear in Y. In general, the mean
square error (MSE) of an estimator 0 aYb for a target parameter ) is defined as
Q WI c0 aYbß )d œ Ea0 aYb  )b# ‘ œ Ea0 aYb  Ec0 aYbdb# ‘ € Ec0 aYb  )d#
œ Varc0 aYbd € Biasc0 aYbß )d# . [4.34]

The MSE has a variance and a (squared) bias component that reflect the estimator's precision
and accuracy, respectively (low variance = high precision). The mean square error is a more
appropriate measure for the performance of an estimator than the variance alone since it takes
both precision and accuracy into account. If 0 aYb is unbiased for ) , then Q WIÒ0 aYbß )Ó œ
VarÒ0 aYbÓ and choosing among unbiased estimators on the basis of their variances is reason-
able. But unbiasedness is not necessarily the most desirable property. An estimator that is
highly variable and unbiased may be less preferable than an estimator which has a small bias
and high precision. Being slightly off-target if one is always close to the target should not
bother us as much as being on target on average but frequently far from it.
The relative efficiency of 0 aYb compared to 1aYb as an estimator for ) is measured by
the ratio of their respective mean square errors:
Q WI c1aYbß )d Varc1aYbd € Biasc1aYbß )d#
REc0 aYbß 1aYbl)d œ œ . [4.35]
Q WI c0 aYbß )d Varc0 aYbd € Biasc0 aYbß )d#

If REÒ0 aYbß 1aYbl)Ó ž " then 0 aYb is preferred, and 1aYb is preferred if the efficiency is less
than ". Assume that 0 aYb is an unbiased estimator and that we choose 1aYb by shrinking
0 aYb by some multiplicative factor - a!  -  "b. Then
Q WI c0 aYbß )d œ Varc0 aYbd
Q WI c1aYbß )d œ - # Varc0 aYbd € a-  "b# )# .

The efficiency of the unbiased estimator 0 aYb relative to the shrinkage estimator
1aYb œ -0 aYb is
- # Varc0 aYbd € a-  "b# )# a-  "b#
REc0 aYbß -0 aYbl)d œ œ - # € )# .
Varc0 aYbd Varc0 aYbd

If - is chosen such that - # € )# a-  "b# ÎVarÒ0 aYbÓ is less than ", the biased estimator is more
efficient than the unbiased estimtator and should be preferred.
How does this relate to least squares estimation in the classical linear model? The opti-
mality of the least squares estimator of " , since it is restricted to the class of unbiased estima-
tors, implies that no other unbiased estimator will have a smaller mean square error. If the
regressor matrix X exhibits near linear dependencies (collinearity), the variance of " s 4 may be
inflated, and sign and magnitude of " s 4 may be distorted. The least squares estimator, albeit
unbiased, has become unstable and imprecise. A slightly biased estimator with greater stabili-
ty can be more efficient because it has a smaller mean square error. To motivate this esti-

© 2003 by CRC Press LLC


mator, consider the centered model [4.32]. The unbiased least squares estimator is
s ‡SPW œ ˆX‡ w X‡ ‰" X‡ w Y.
"

The ill-conditioning of the X‡ w X‡ matrix due to collinearity causes the instability of the esti-
mator. To improve the conditioning of X‡ w X‡ , add a small positive amount $ to its diagonal
values,
s ‡V œ ˆX‡ w X‡ € $ I‰" X‡ w Y.
" [4.36]

This estimator is known as the ridge regression estimator of " (Hoerl and Kennard 1970a,
1970b). It is applied in the centered and not the original model to remove the scale effects of
the X columns. Whether a covariate is measured in inches or meters, the same correction
should apply since only one ridge factor $ is added to all diagonal elements. Centering the
model removes scaling effects, every column of X‡ has a sample mean of ! and a sample
variance of ". The value $ , some small positive number chosen by the user, is called the ridge
factor.
The ridge regression estimator is not unbiased. Its bias (see §A4.8.4) is
s ‡V  " ‡ “ œ  $ ˆX‡ w X‡ € $ I‰" " ‡ .
E’"

If $ œ ! the unbiased ordinary least squares estimator results and with increasing ridge factor
$, the ridge estimator applies more shrinkage.
To see that the ridge estimator is a shrinkage estimator consider a simple linear regres-
sion model in centered form,
Y œ "!‡ € B‡3 ""‡ € /3 ,

where B‡3 œ aB3  B bÎ=B . The least squares estimate of the standardized slope is
!83œ" B‡3 C3
s ‡" œ
"
!83œ" B‡#3

while the ridge estimate is


!83œ" B‡3 C3
s ‡"V œ
" .
$ € !83œ" B‡# 3

s ‡"V  "
Since $ is positive " s ‡" . The standardization of the coefficients is removed by dividing
the coefficient by È=B , and we also have " s "V  "s".

The choice of $ is made by the user which has led to some disapproval of ridge
regression among statisticians. Although numerical methods exist to estimate the best value
of the ridge factor from data, the most frequently used technique relies on graphs of the ridge
s 4V for various values of $ . The ridge factor is chosen as
trace. The ridge trace is a plot of "
that value for which the estimates stabilize (Figure 4.9).

© 2003 by CRC Press LLC


350

6
Sunlight
300
Soil Moisture
4 Air Temperature
Soil Moisture squared
250

2
Standardized Coefficient

Variance inflation factor


200
-0

150
-2

100
-4

50
-6

Variance inflation factor for soil moisture coeff.


-8 0

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

Ridge factor δ

Figure 4.9. Ridge traces for full model in the Turnip Greens example. Also shown (empty
circles) is the variance inflation factor for the Soil Moisture coefficient.

For the full model in the Turnip Greens example the two coefficients affected by the col-
s # and "
linearity are " s % . Adding only a small amount $ œ !Þ!" to the diagonals of the X‡ w X‡
matrix already stabilizes the coefficients (Figure 4.9). The variance inflation factor of the Soil
Moisture coefficient is also reduced dramatically. When studying a ridge trace such as Figure
4.9 we look for various indications of having remedied collinearity.
• The smallest value of $ at which the parameter estimates are stabilized (do not change
rapidly). This is not necessarily the value where the ridge trace becomes a flat line.
Inexperienced users tend to choose a $ value that is too large, taking the notion of
stability too literally.
• Changes in sign. One of the effects of collinearity is that signs of the coefficients are at
odds with theory. The ridge factor $ should be chosen in such a way that the signs of
the ridge estimates are meaningful on subject matter grounds.
• Coefficients that change the most when the ridge factor is altered are associated with
variables involved in a near-linear dependency. The more the ridge traces swoop, the
higher the degree of collinearity.
• If the degree of collinearity is high, a small value for $ will suffice, although this result
may seem counterintuitive.

Table 4.18 shows the standardized ordinary least squares coefficients and the ridge esti-
mates for $ œ !Þ!" along with their estimated standard errors. Sign, size, and standard errors

© 2003 by CRC Press LLC


of the coefficient change drastically with only a small adjustment to the diagonal of X‡ w X‡ , in
particular for "# and "% .

Table 4.18. Ordinary least squares and ridge regression estimates Ð$ œ !Þ!"Ñ
along with their estimated standard errors
Ordinary Least Squares Ridge Regression
Coefficient Estimate Est. Std. Error Estimate Est. Std. Error
""  !Þ""#(% !Þ!#%$) !.!$)*" !.!$#%!
"# &Þ*)*)# "Þ!!&#' !.%((!! !.#""%'
"$  !Þ#!*)! !Þ#"!*% !.!"$*% !.#'!%&
"%  'Þ)(%*! !Þ!"*&'  ".$%'"% !.!!%""

The ridge regression estimates can be calculated in SAS® with the ridge= option of the
reg procedure. The outest= option is required to identify a data set in which SAS® collects
the ridge estimates. By adding outstb and outvif options, SAS® will also save the standard-
ized coefficients along with the variance inflation factors in the outest data set. In the Turnip
Greens example, the following code calculates ridge regression estimates for ridge factors
$ œ !ß !Þ!"ß !Þ!#ß âß !Þ&ß !Þ(&ß "ß "Þ#&ß "Þ&ß "Þ(&ß and #. The subsequent proc print step
displays the results.
proc reg data=turnip outest=regout outstb outvif
ridge=0 to 0.5 by 0.01 0.5 to 2 by 0.25;
model vitamin = sun moist airtemp x4 ;
run; quit;
proc print data=regout; run;

For more details on ridge regression and its applications the reader is referred to Myers
(1990, Ch. 8).

4.5 Diagnosing Classification Models

4.5.1 What Matters?


Diagnosis of a classification model such as an analysis of variance model in an experimental
design does not focus on the mean function as much as is the case for a regression model. In
the classification model Y œ X) € e, where X is a matrix of dummy (design) variables,
collinearity issues do not arise. While there are no near-linear dependencies among the
columns of X, the columns exhibit perfect linear dependencies and X is not a full rank matrix.
The implications of this rank-deficiency have been explored in §3.4 and are further discussed
in §A4.8.1 and §A4.8.2. Essentially, we replace aXw Xb" in all formulas with a generalized
inverse aXw Xb , the least squares estimates are biased and not unique, but estimable linear
combinations aw ) are unique and unbiased. Also, which effects are included in the X matrix,
that is, the structure of the mean function, is typically not of great concern to the modeler. In
the Lime Application example (Example 4.1), the mean structure consists of Lime Type and
Rate of Application main effects and the Lime Type ‚ Rate interactions. Even if the inter-

© 2003 by CRC Press LLC


actions were not significant, one would not remove them from the model as compared to a
regression context where a model for the mean response is built from scratch.
Diagnostics in classification models focus primarily on the stochastic part, the model
errors. The traditional assumptions that e is a vector of homoscedastic, independent, Gaussian
disturbances are of concern. Performance evaluations of the analysis of variance J test,
multiple comparison procedures, and other ANOVA components under departures from these
assumptions have a long history. Based on results of these evaluations we rank the three error
assumptions by increasing importance as:
• Gaussian distribution of e;
• homogeneous variances;
• independence of (lack of correlation among) error terms.

The analysis of variance J test is generally robust against mild and moderate departures
of the error distribution from the Gaussian model. This result goes back to Pearson (1931),
Geary (1947), and Gayen (1950). Box and Andersen (1955) note that robustness of the test
increases with the sample size per group (number of replications per treatment). The reason
for the robustness of the J test is that the group sample means are unbiased estimates of the
group means and for sufficient sample size are Gaussian-distributed by the Central Limit
Theorem regardless of the parent distribution from which the data are drawn. Finally, it must
be noted that in designed experiments under randomization one can (and should) apply signi-
ficance tests derived from randomization theory (§A4.8.2). These tests do not require
Gaussian-distributed error terms. As shown by, e.g., Box and Andersen (1955), the analysis
of variance J test is a good approximation to the randomization test.
A departure from the equal variance assumption is more troublesome. If the data are
Gaussian-distributed and the sample sizes in the groups are (nearly) equal, the ANOVA J
test retains a surprising robustness against moderate violations of this assumption (Box
1954a, 1954b, Welch 1937). This is not true for one-degree-of-freedom contrasts which are
very much affected by unequal variances even if the group sizes are the same. As group sizes
become more unbalanced, small departures from the equal variance assumption can have very
negative effects. Box (1954a) studied the signficance level of the J test in group comparisons
as a function of the ratio of the smallest group variance to the largest group variance. With
equal replication and a variance ratio of " À $, the nominal significance level of !Þ!& was only
slightly exceeded a!Þ!&'  !Þ!(%b, but with unequal group sizes the actual significance level
was as high as !Þ"#. The test becomes liberal, rejecting more often than it should.
Finally, the independence assumption is the most critical since it affects the J test most
severely. The essential problem has been outlined before. If correlations among observations
are positive, the :-values of the standard tests that treat data as if they were uncorrelated are
too small as are the standard error estimates.
There is no remedy for lack of independence except for observing proper experimental
procedure. For example, randomizing treatments to experimental units and ensuring that ex-
perimental units not only receive treatments independently but also respond independently.
Even then, data may be correlated, for example, when measurements are taken repeatedly
over time. There is no univariate data transformation to uncorrelate observations but there are
of course methods for diagnosing and correcting departures from the Gaussian and the homo-

© 2003 by CRC Press LLC


scedasticity assumption. The remedies are often sought in transformations of the data. Hope-
fully, the transformation that stabilizes the variance also achieves greater symmetry in the
data so that a Gaussian assumption is more reasonable. In what follows we discuss various
transformations and illustrate corrective measures for unequal group variances by example.
Because of the robustness of the J test against moderate departures from Gaussianity, we
focus on transformations that stabilize variability.
For two-way layouts without replication (a randomized block design, for example), an
exploratory technique called Median-Polishing is introduced in §4.5.3. We find it particularly
useful to diagnose outliers and interactions that are problematic in the absence of replications.

4.5.2 Diagnosing and Combating Heteroscedasticity


Table 4.19 contains simulated data from a one-way classification with five groups (treat-
ments) and ten samples per group (replications per treatment). A look at the sample means
and sample standard deviations at the bottom of the table suggests that the variability in-
creases with the treatment means. Are the discrepancies among the treatment sample standard
deviations large enough to worry about the equal variance assumption? To answer this
question, numerous statistical tests have been developed, e.g., Bartlett's test (Bartlett
1937a,b), Cochran's test (Cochran 1941), and Hartley's J -Max test (Hartley 1950).
These tests have in common a discouraging sensitivity to departures from the Gaussian
assumption. Depending on whether the data have kurtosis higher or lower than that of the
Gaussian distribution they tend to exaggerate or mask differences among the variances. Sahai
and Ageel (2000, p. 107) argue that when these tests are used as a check before the analysis
of variance and lead to rejection, that one can conclude that either the Gaussian error or the
homogeneous variance assumption or both are violated. We discourage the practice of using a
test designed for detecting heterogeneous variances to draw conclusions about departures
from the Gaussian distribution simply because the test performs poorly in this circumstance.
If the equal variance assumption is to be tested, we prefer a modification by Brown and
Forsythe (1974) of a test by Levene (1960). Let
Y34 œ l]34  mediane]34 fl

denote the absolute deviations of observations in group 3 from the group median. Then
calculate an analysis of variance for the Y34 and reject the hypothesis of equal variances when
significant group differences among the Y34 exist.
Originally, the Levene test was based on Y34 œ Ð]34  ] 3Þ Ñ# , rather than absolute
deviations from the median. Other variations of the test include analysis of variance based on
Y34 œ l]34  ] 3Þ l
Y34 œ ln˜l]34  ] 3Þ l™
Y34 œ Él]34  ] 3Þ lÞ

The first of these deviations as well as Y34 œ Ð]34  ] 3Þ Ñ# are implemented through the
hovtest= option of the means statement in proc glm of The SAS® System.

© 2003 by CRC Press LLC


Table 4.19. Data set from a one-way classification with "! replicates
Treatment
Replicate 3œ1 2 3 4 5
" "Þ$(& #Þ&!$ &Þ#() "#Þ$!" (Þ#"*
# #Þ"%$ !Þ%$" %Þ(#) ""Þ"$* #%Þ"''
$ "Þ%"* "'Þ%*$ 'Þ)$$ "*Þ$%* $(Þ'"$
% !Þ#*& ""Þ!*' *Þ'!* (Þ"'' *Þ*"!
& #Þ&(# #Þ%$' "!Þ%#) "!Þ$(# )Þ&!*
' !Þ$"# (Þ((% (Þ##" )Þ*&' "(Þ$)!
( #Þ#(# 'Þ&#! "#Þ(&$ *Þ$'# "*Þ&''
) "Þ)!' 'Þ$)' "!Þ'*% *Þ#$( "&Þ*&%
* "Þ&&* &Þ*#( #Þ&$% "#Þ*#* ""Þ)()
"! "Þ%!* )Þ))$ (Þ!*$ 'Þ%)& "&Þ*%"
C3Þ "Þ&"% 'Þ)%& (Þ("( "!Þ(#* "'Þ)"%
=3 !Þ(&( %Þ'(' $Þ"$$ $Þ'%* *Þ!!!

For the data in Table 4.19 we fit a simple one-way model:


]34 œ . € 73 € /34 ß 3 œ "ß âß &à 4 œ "ß âß "!,

where ]34 denotes the 4th replicate value observed for the 3th treatment and obtain the Levene
test with the statements
proc glm data=hetero;
class tx;
model y = tx /ss3;
means tx / hovtest=levene(type=abs);
run; quit;

The Levene test clearly rejects the hypothesis of equal variances


a: œ !Þ!!)", Output 4.9b. Even with equal sample sizes per group the test for equality of the
treatment means aJ9,= œ "#Þ%%, :  !Þ!!!"b is not to be trusted if variance heterogeneity is
that strong.

Output 4.9. The GLM Procedure

Class Level Information


Class Levels Values
tx 5 1 2 3 4 5

Number of observations 50

Dependent Variable: y
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 4 1260.019647 315.004912 12.44 <.0001
Error 45 1139.207853 25.315730
Corrected Total 49 2399.227500

R-Square Coeff Var Root MSE y Mean


0.525177 57.67514 5.031474 8.723818

Source DF Type III SS Mean Square F Value Pr > F


tx 4 1260.019647 315.004912 12.44 <.0001

© 2003 by CRC Press LLC


Output 4.9 (continued).
Levene's Test for Homogeneity of y Variance
ANOVA of Absolute Deviations from Group Means

Sum of Mean
Source DF Squares Square F Value Pr > F
tx 4 173.6 43.4057 3.93 0.0081
Error 45 496.9 11.0431

Level of --------------y--------------
tx N Mean Std Dev
1 10 1.5135500 0.75679658
2 10 6.8449800 4.67585820
3 10 7.7172000 3.13257166
4 10 10.7295700 3.64932521
5 10 16.8137900 9.00064885

20

10
Raw residual

-10

0 5 10 15
Estimated treatment mean

Figure 4.10. Raw residuals of one-way classification model fit to data in Table 4.19.

A graphical means of diagnosing heterogeneous variances is based on plotting the


residuals against the treatment means. As is often the case with biological data, variability in-
creases with the means and this should be obvious from the residual plot. Figure 4.10 shows a
plot of the raw residuals s/34 œ C34  sC 34 against the treatment means C 3Þ . Not only is it obvious
that the variability of the residuals is very heterogeneous in the groups, it also appears that the
distribution of the residuals is right-skewed and not a Gaussian distribution.
Can the data be transformed to make the residual distribution more symmetric and its
variability homogeneous across the treatments? Transformations to stabilize the variance
exploit a functional relationship between group means and group variances. Let .3 œ . € 73
denote the mean in the 3th group and 53# the error variance in the group. If .3 œ 53# , the
variance stabilizing transform is the square root transformation Y34 œ È]34 . The Poisson
distribution (see §6.2) for counts has this property of mean-variance identity. For Binomial
counts, such as the number of successes out of 8 independent trials, each of which can result
in either a failure or success, the relationship between the mean and variance is Ec]34 d œ 13 ,

© 2003 by CRC Press LLC


Varc]34 d œ 834 13 a"  13 b where 13 is the probability of a success in any one trial. The
variance stabilizing transform for this random variable is the arc-sine transformation
Y34 œ sin" ˆÈ]34 Î834 ‰. The transformed variable has variance "Îa%834 b. Anscombe (1948)
has recommended modifications of these basic transformations. In the Poisson case when the
average count is small, Y34 œ È]34 € $Î) has been found to stabilize the variances better and
in the Binomial case

]34 € $Î)
Y34 œ sin" ŽË
834 € $Î% 

appears to be superior.
For continuous data, the relationship between mean and variance is not known unless the
distribution of the responses is known (see §6.2 on how mean variance relationships for
continuous data are used in formulating generalized linear models). If, however, it can be
established that variances are proportional to some power of the mean, a transformation can
be derived. Box and Cox (1964) assume that

ÉVarc]34 d œ 53 º ."3 . [4.37]

If one takes a transformation Y34 œ ]34- , then ÈVarÒY34 Ó º .3" €-" . Since variance homoge-
neity implies 53 º ", the transformation with - œ "  " stabilizes the variance. If - œ !, the
transformation is not defined and the logarithmic transformation is chosen.
For the data in Table 4.19 sample means and sample standard deviations are available for
each group. We can empirically determine which transformation will result in a variance
stabilizing transformation. For data in which standard deviations are proportional to a power
of the mean the relationship 53 œ !."3 can be linearized by taking logarithms: lne53 f œ
lne!f € " lne.3 f. Substituting estimates =3 for 53 and C 3Þ for .3 , this is a linear regression of
the log standard deviations on the log sample means: lne=3 f œ "! € " lneC 3Þ f € /3‡ . Figure
4.11 shows a plot of the five sample standard deviation/sample mean points after taking
logarithms. The trend is linear and the assumption that 53 is proportional to a power of the
mean is reasonable. The statements
proc means data=hetero noprint;
by tx;
var y;
output out=meanstd(drop=_type_) mean=mean std=std;
run;
data meanstd; set meanstd;
logstd = log(std);
logmn = log(mean);
run;
proc reg data=meanstd;
model logstd = logmn;
run; quit;

calculate the treatment specific sample means and standard deviations, take their logarithms
and fit the simple linear regression model (Output 4.10). The estimate of the slope is
s œ !Þ*&) which suggests that "  !Þ*&) œ !Þ!%# is the power for the variance stabilizing
"
s is close to zero, one could also opt for a logarithmic transformation of
transform. Since "  "

© 2003 by CRC Press LLC


the data. The analysis of variance and the Levene homogeneity of variance test for the trans-
formed values is obtained with
data hetero; set hetero;
powery = y**0.042;
run;
proc glm data=hetero;
class tx;
model powery = tx / ss3;
means tx / hovtest=levene(type=abs);
run; quit;

and results are shown in Output 4.11.


2.5

2.0

1.5
log(si)

1.0

0.5

0.0

0.5 1.0 1.5 2.0 2.5


log(sample mean)

Figure 4.11. Logarithm of sample standard deviations vs. logarithm of sample means for data
in Table 4.19 suggests a linear trend.

Output 4.10. The REG Procedure


Dependent Variable: logstd

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 3.02949 3.02949 32.57 0.0107
Error 3 0.27901 0.09300
Corrected Total 4 3.30850

Root MSE 0.30497 R-Square 0.9157


Dependent Mean 1.17949 Adj R-Sq 0.8876
Coeff Var 25.85570

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 -0.65538 0.34923 -1.88 0.1572
logmn 1 0.95800 0.16785 5.71 0.0107

© 2003 by CRC Press LLC


Output 4.11. The GLM Procedure

Class Level Information


Class Levels Values
tx 5 1 2 3 4 5

Number of observations 50

Dependent Variable: powery


Sum of
Source DF Squares Mean Square F Value Pr > F
Model 4 0.07225884 0.01806471 20.76 <.0001
Error 45 0.03915176 0.00087004
Corrected Total 49 0.11141060

R-Square Coeff Var Root MSE powery Mean


0.648581 2.736865 0.029496 1.077745

Source DF Type III SS Mean Square F Value Pr > F


tx 4 0.07225884 0.01806471 20.76 <.0001

Levene's Test for Homogeneity of powery Variance


ANOVA of Absolute Deviations from Group Means

Sum of Mean
Source DF Squares Square F Value Pr > F
tx 4 0.00302 0.000755 2.19 0.0852
Error 45 0.0155 0.000345

Level of ------------powery-----------
tx N Mean Std Dev

1 10 1.00962741 0.03223304
2 10 1.07004861 0.04554636
3 10 1.08566961 0.02167296
4 10 1.10276544 0.01446950
5 10 1.12061483 0.02361509

The Levene test is (marginally) nonsignificant at the &% level. Since we are interested in
accepting the null hypothesis of equal variances in the five groups this result is not too
convincing. We would like the :-value for the Levene test to be larger. The Levene test based
on squared deviations Ð^34  ^ 3Þ Ñ# has a :-value of !Þ")* which is more appropriate. The
group-specific standard deviations are noticeably less heterogeneous than for the untrans-
formed data. The ratio of the largest to the smallest group sample variance for the original
data was *Þ!# Î!Þ(&')# œ "%"Þ%# whereas this ratio is only !Þ!$### Î!Þ!"%&# œ %Þ*$ for the
transformed data. The &% critical value for a Hartley's J -Max test is (Þ"" and we fail to
reject the hypothesis of equal variances with this test too.

4.5.3 Median Polishing of Two-Way Layouts


A two-way layout is a cross-tabulation of two discrete variables. While common in the
analysis of categorical data where the body of the table depicts absolute or relative
frequencies, two-way layouts also arise in field experiments and the study of spatial data. For
example, a randomized complete block design can be thought of as a two-way layout with

© 2003 by CRC Press LLC


blocks as the row variable, treatments as the column variable, and the observed outcomes in
the body of the table. In this subsection we consider only nonreplicated two-way layouts with
at most one value in each cell. If replicate values are available it is assumed that the data were
preprocessed to reduce the replicates to a summary value (the total, median or average).
Because of the importance of randomized complete block designs (RCBDs) in the plant and
soil sciences we focus on these in particular.
An important issue that is often overlooked in RCBDs is the presence of
block ‚ treatment interactions. The crux of the standard RCBD with > treatments in , blocks
is that each treatment appears only once in each block. To estimate the experimental error
variance one cannot rely on replicate values of treatments within each block, as for example,
in the generalized randomized block design (Hinkelmann and Kempthorne 1994, Ch. 9.7).
Instead, the experimental error (residual) sum of squares in a RCBD is calculated as
, >
#
WWV œ ""ˆC34  C 3Þ  C Þ4 € C ÞÞ ‰ , [4.38]
3œ" 4œ"

which is an interaction sum of squares. If treatment comparisons depend on the block in


which the comparison is made, Q WV œ WWVÎea,  "ba>  "bf is no longer an unbiased
estimator of the experimental error variance and test statistics having Q WV in the denomi-
nator no longer have the needed distributional properties. It is the absence of the
block ‚ treatment interactions that ensures that Q WV is an unbiased estimator of the
experimental error variance. One can also not estimate the full block ‚ treatment interactions
as this would use up a,  "ba>  "b degrees of freedom, leaving nothing for the estimation of
the experimental error variance. Tukey (1949) developed a one-degree of freedom test for
nonadditivity that tests for a particularly structured type of interaction in nonreplicated two-
way layouts.
The interaction of blocks and treatment in a two-way layout can be caused by only a few
extreme observations in some blocks. Hence it is important to determine whether outliers are
present in the data and whether they induce a block ‚ treatment interaction. In that case redu-
cing their influence on the analysis by using robust estimation methods (see §4.6.4) can lead
to a valid analysis of the block design. If the interaction is not due to outlying observations,
then it must be incorporated in the model in such a way that degrees of freedom for the esti-
mation of experimental error variance remain. To examine the possibility of block ‚ treat-
ment interaction and the presence of outliers we reach into the toolbox of exploratory data
analysis. The method we prefer is termed median polishing (Tukey 1977, Emerson and
Hoaglin 1983, Emerson and Wong 1985). Median polishing is also a good device to check on
the Gaussianity of the model errors prior to an analysis of variance. In the analysis of spatial
data median polishing also plays an important role as an exploratory tool and has been
successfully applied as an alternative to traditional methods of geostatistical prediction
(Cressie 1986).
To test nonreplicated two-way data for Gaussianity presents a considerable problem. The
assumption of Gaussianity in the analysis of variance for a randomized complete block design
requires that each of the treatment populations is Gaussian-distributed. It is thus not sufficient
to simply calculate a normal probability plot or test of normality for the entire two-way data,
since this assumes that the treatment means (and variances) are equal. Normal probability
(and similar) plots of least squares residuals have their own problems. Although they have

© 2003 by CRC Press LLC


zero mean they are not homoscedastic and not uncorrelated (§4.4.1). Since least squares
estimates are not outlier-resistant, the residuals may not reflect how extreme observations
really are compared to a method of fitting that accounts for the mean structure in an outlier-
resistant fashion. One can then expect to see the full force of the outlying observations in the
residuals. Error recovery as an alternative (§4.4.2) produces only a few residuals because
error degrees of freedom are often small.
To fix ideas, let ]34 be the observation in row 3 and column 4 of a two-way table. In the
following $ ‚ $ table, for example, ]"$ œ %#.
Variety
Fertilizer V" V# V$
B" $) '! %#
B# "$ $# #*
B$ #( %$ $)
To decompose the variability in this table we apply the simple decomposition
Data œ All € Row € Column € Residual,
where All refers to an overall effect, Row to the row effects, and so forth. In a block design
Row could denote the blocks, Column the treatments, and All an overall effect (intercept).
With familiar notation the decomposition can be expressed mathematically as
]34 œ . € !3 € "4 € /34
3 œ "ß âß M ; 4 œ "ß âß N ,

where M and N denote the number of rows and columns, respectively. If there are no missing
entries in the table, the least squares estimates of the effects are simple linear combinations of
the various row and column averages:

" M N

. ""C34 œ C ÞÞ
MN 3œ" 4œ"
!s3 œ C 3Þ  C ÞÞ
s 4 œ C Þ4  C ÞÞ
"
s/34 œ C34  C 3Þ  C Þ4 € C ÞÞ

Notice that the s/34 are the terms in the residual sum of squares [4.38]. The estimates have the
desirable property that !M3œ" !N4œ"s/34 œ !, mimicking a property of the model disturbances /34
provided the decomposition is correct. The sample means unfortunately are sensitive to
extreme observations (outliers). In least squares analysis based on means, the residual in any
cell of the two-way table affects the fitted values in all other cells and the effect of an outlier
is to leak its contribution across the estimates of the overall, row, and column effects
(Emerson and Hoaglin 1983). Tukey (1977) suggested basing estimation of the effects on
medians which are less affected by extreme observations. Leakage of outliers into estimates
of overall, row, and column effects is then minimized. The method — termed median
polishing — is iterative and sweeps medians out of rows, then columns out of the row-
adjusted residuals, then rows out of column-adjusted residuals, and so forth. The process

© 2003 by CRC Press LLC


stops when the following conditions are met:
median3 e µ! 3f œ !
µ
median4 š " 4 › œ ! [4.39]
median3 e /µ34 f œ median4 e /µ34 f œ !.

In [4.39] ! µ denotes the median-polished effect of the 3th row, µ" 4 the median-polished
3
th µ µ µ " µ
effect of the 4 column and the residual is calculated as /34 œ C34  .  ! 3 4.

A short example of median-polishing follows, kindly made available by Dr. Jeffrey B.


Birch1 at Virginia Tech. The $ ‚ $ layout represents the yield per plot as a function of type of
fertilizer (B" , B# , B$ ) and wheat variety (V" , V# , V$ ).
Variety
Fertilizer V" V# V$
B" $) '! %#
B# "$ $# #*
B$ #( %$ $)
Step 1. Obtain row medians and subtract them from the C34 .
! ! ! ! Ð!Ñ Í ! ! ! !
! $) '! %# a%#b %# % ") !
! "$ $# #* a#*b #*  "' $ !
! #( %$ $) a$)b $)  "" & !
Some housekeeping is done in the first row and column, storing µ µ
. in the first cell, the !3
µ
in the first column, and the "4 in the first row of the table. The table on the right-hand side is
obtained from the one on the left by adding the row medians to the first column and subtract-
ing them from the remainder of the table.
Step 2. Obtain column medians and subtract them from the residuals (right-hand table from
step 1).
! ! ! ! Í $)  "" & !
%# % ") ! % ( "$ !
#*  "' $ ! * & # !
$)  "" & ! ! ! ! !
Ð$)Ñ Ð  ""Ñ Ð&Ñ Ð!Ñ
The right-hand table in this step is obtained by adding the column medians to the first
row and subtracting them from the remainder of the table.
Step 3. Repeat sweeping the row medians.
$)  "" & ! Ð!Ñ Í $)  "" & !
% ( "$ ! Ð(Ñ "" ! ' (
* & # ! Ð  #Ñ  "" $ ! #
! ! ! ! Ð!Ñ ! ! ! !

1 Birch, J.B. Unpublished manuscript: Contemporary Applied Statistics: An Exploratory and Robust Data Analysis Approach.

© 2003 by CRC Press LLC


The right-hand table is obtained by adding the row medians to the first column and sub-
tracting them from the rest of the table.
Step 4. Repeat sweeping column medians.
$)  "" & !
"" ! ' (
 "" $ ! #
! ! ! !
Ð!Ñ Ð!Ñ Ð!Ñ Ð!Ñ
At this point the iterative sweep stops. Nothing is to be added to the first row or subtrac-
ted from the rest of the table. Notice that the residuals in the body of the cell do not sum to
zero.
The sweeped row and column effects represent the large-scale variation (the mean
structure) of the data. The median-polished residuals represent the errors and are used to test
for Gaussianity, outliers, and interactions. When row and column effects are not additive,
Tukey (1977) suggests augmenting the model with an interaction term
!3 "4 ‡
]34 œ . € !3 € "4 € )Œ  € /34 . [4.40]
.

If this type of patterned interaction is present, then ) should differ from zero. After median
polishing the no-interaction model, it is recommended to plot the residuals µ /3 against the
µ"µ µ
comparison values ! 3 4 Î . . A trend in this plot indicates the presence of row ‚ column
interaction. To decide whether the interaction is induced by outlying observations or a more
general phenomenon, fit a simple linear regression between /µ34 and the comparison values by
an outlier-robust method such as M-Estimation (§4.6.2) and test the slope against zero. Also,
fit a simple regression by least squares. When the robust M-estimate of the slope is not
different from zero but the least squares estimate is, the interaction between rows and
columns is caused by outlying observations. Transforming the data or removing the outlier
should then eliminate the interaction. If the M-estimate of the slope differs significantly from
zero the interaction is not caused just by outliers. The slope of the diagnostic plot is helpful in
determining the transformation that can reduce the nonadditivity of row and column effects.
Power transformations of approximately "  ) are in order (Emerson and Hoaglin 1983). In
the interaction test a failure to reject L! : ) œ ! is the result of interest since one can then
proceed with a simpler analysis without interaction. Even if the data are balanced (all cells of
the table are filled), the analysis of variance based on the interaction model [4.40] is non-
orthogonal and treatment means are not estimated by arithmetic averages. Because of the
possibility of a Type-II error one should not accept L! when the :-value of the test is barely
larger than !Þ!& but require :-values in excess of !Þ#&  !Þ$.

Example 4.3. Dollar Spot Counts (continued). Recall the turfgrass experiment in a
randomized complete block design with > œ "% treatments in , œ % blocks (p. 93). The
outcome of interest was the number of leaves infected with dollar spot in a reference
area of each experimental unit. One count was obtained from each unit.

© 2003 by CRC Press LLC


In the chapter introduction we noted that count data are often modeled as Poisson
variates which can be approximated well by Gaussian variates if the mean count is
sufficiently large a>"&, sayb. This does not eliminate the heterogeneous variance
problem since for Poisson variates the mean equals the variance. The Gaussian distribu-
tion approximating the distribution of counts for treatments with low dollar spot inci-
dence thus has a smaller variance than that for treatments with high incidence. As an
alternative, transformations of the counts are frequently employed to stabilize the
variance. In particular the square-root and logarithmic transformation are popular. In
§6, we discuss generalized linear models that utilize the actual distribution of the count
data, rather than relying on Gaussian approximations or transformations.

Here we perform median polishing for the dollar spot count data for the original counts
and log-transformed counts. Particular attention will be paid to the observation for
treatment "! in block % that appears extreme compared to the remainder of the data (see
Table 4.3, p. 93). Median polishing with The SAS® System is possible with the
%MedPol() macro contained on the CD-ROM (\SASMacros\MedianPolish.sas). It
requires an installation of the SAS/IML® module. The macro does not produce any
output apart from the residual interaction plot if the plot=1 option is active (which is
the default). The results of median polishing are stored in a SAS® data set termed
_medpol with the following structure.

Variable Value

VALUE contains the estimates of Grand, Row, Col,


and Residual
_Type_ 0 = grand, 1=row effect, 2=col effect,
3 = residual
ROW indicator of row number
COL indicator of col number
IA interaction term !3 "4 Î.
Signal C34  /µ34

The median polish of the original counts and a printout of the first #& observations of
the result data set is produced by the statements

%include 'DriveLetterOfCDROM:\SASMacros\Median Polish.sas';

title1 'Median Polishing of dollar spot counts';


%MedPol(data=dollarspot,row=block,col=tmt,y=count,plot=1);
proc print data=_medpol(obs=25); run;

Median Polishing of dollar spot counts

Obs Value Row Col _Type_ IA Signal

1 43.3888 0 0 0 . .
2 -0.6667 1 0 1 . .
3 0.6667 2 0 1 . .
4 -6.0000 3 0 1 . .
5 3.4445 4 0 1 . .
6 -13.1111 0 1 2 . .
7 36.7778 0 2 2 . .
8 -16.7777 0 3 2 . .
9 -12.5555 0 4 2 . .
10 11.0556 0 5 2 . .
11 0.9445 0 6 2 . .

© 2003 by CRC Press LLC


12 -0.9444 0 7 2 . .
13 4.1112 0 8 2 . .
14 41.0556 0 9 2 . .
15 65.6112 0 10 2 . .
16 86.3889 0 11 2 . .
17 -8.6111 0 12 2 . .
18 -14.2222 0 13 2 . .
19 -17.9444 0 14 2 . .
20 -13.6111 1 1 3 0.20145 29.6111
21 -39.5000 1 2 3 -0.56509 79.5000
22 3.0556 1 3 3 0.25779 25.9444
23 -0.1667 1 4 3 0.19292 30.1667
24 -22.7777 1 5 3 -0.16987 53.7777
25 -1.6667 1 6 3 -0.01451 43.6667

A graph of the column (treatment) effects against column index gives an indication of
treatment differences to be expected when a formal comparison procedure is invoked
(Figure 4.12).

90
Median Polished Treatment Effects

60

30

-30
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Treatment Label

µ
Figure 4.12. Column (treatment) effects "4 for dollar spot counts.

Whether a formal analysis can proceed without accounting for block ‚ treatment inter-
action cannot be gleaned from a plot of the column effects alone. To this end, plot the
median polished residuals against the interaction term (Figure 4.13) and calculate least
squares and robust M-estimates (§4.6) of the simple linear regression after the call to
%MedPol():

%include 'DriveLetterOfCDROM:\SASMacros\MEstimation.sas';

title1 'M-Regression for median polished residuals of


dollar spot counts';
%MEstim(data=_medpol(where=(_type_=3)),
stmts=%str(model value = ia /s;)
)

© 2003 by CRC Press LLC


4
150

Residual (log{count})
3
Residual (count)

100
2

50
OLS 1
M
OLS
0 M
0

-50 -1

-12 -8 -4 0 4 8 -0.04 -0.02 0.00 0.02 0.04


(αi * βj)/µ (αi * βj)/µ

Figure 4.13. Median polished residual plots vs. interaction terms for original counts
(left panel) and log-transformed counts (right panel). The outlier (treatment "! in block
%) is circled.

The outlying observation for treatment "! in block % pulls the least squares regression
line toward it (Figure 4.13). The interaction is significant a: œ !Þ!!*%, Table 4.20b and
remains significant when the influence of the outlier is reduced by robust M-Estimation
a: œ !Þ!!)"b. The interaction is thus not induced by the outlier alone but a general
block ‚ treatment interaction exists for these data. For the log-transformed counts
(Figure 4.13, right panel) the interaction is not significant (: œ !Þ#)$" for OLS and
: œ !Þ$)*) for M-Estimation) and the least squares and robust regression line are
almost indistinguishable from each other. The observation for treatment "! in block % is
certainly not as extreme relative to the remainder of the data as is the case for the
original counts.

Table 4.20. T -values for testing the the block ‚ treatment interaction, L! : ) œ !
Response Least Squares M-Estimation
Count !Þ!!*% !Þ!!)"
logeCountf !Þ#)$" !Þ$)*)

Based on these findings a log transformation appears reasonable. It reduces the


influence of extreme observations and allows analysis of the data without interaction. It
also improves the symmetry of the data. The Shapiro-Wilk test for normality has : -
values of  !Þ!!!" for the original data and !Þ"!'* for the log-transformed counts.
The :-values of the interaction test for the log-transformed data are sufficiently large to
proceed with confidence with an analysis of the transformed data without an interaction
term. The larger :-value for the M-estimate of ) suggests to use a robustified analysis
of variance rather than the standard, least squares-based analysis. This analysis is
carried out in §4.6.4 after the details of robust parameter estimation have been covered.

© 2003 by CRC Press LLC


4.6 Robust Estimation
Robustness of a statistical inference procedure refers to its stability when model assumptions
are violated. For example, a procedure may be robust to an incorrectly specified mean model,
heterogeneous variances, or the contamination of the error distribution. Resistance, on the
other hand, is the property of a statistic or estimator to be stable when the data are grossly
contaminated (many outliers). The sample median, for example, is a resistant estimator of the
central tendency of sample data, since it is not affected by the absolute value of observations,
only whether they occur to the right or the left of the middle. Statistical procedures that are
based on sample medians instead of sample means such as Median Polishing (§4.5.3) are
resistant procedures. The claim of robustness must be made with care since robustness with
respect to one model breakdown does not imply robustness with respect to other model break-
downs. M-Estimation (§4.6.2, §4.6.3), for example, are robust against moderate outlier con-
tamination of the data, but not against high leverage values. A procedure robust against mild
departures from Gaussianity may be much less robust against violations of the homogeneous
variance assumption.
In this section we focus in particular on mild-to-moderate outlier contamination of the
data in regression and analysis of variance models. Rather than deleting outlying observa-
tions, we wish to keep them in the analysis to prevent a loss of degrees of freedom but reduce
their negative influence. As mentioned earlier, deletion of diagnosed outliers is not an
appropriate strategy, unless the observation can be clearly identified as in error. Anscombe
(1960) concludes that “if we could be sure that an outlier was caused by a large measurement
or execution error which could not be rectified (and if we had no interest in studying such
errors for their own sake), we should be justified in entirely discarding the observation and all
memory of it.” Barnett and Lewis (1994) distinguish three origins for outlying observations:
• The variability of the phenomenon being studied. An unusual observation may well be
within the range of natural variability; its appearance reflects the distributional proper-
ties of the response.
• Erroneous measurements. Mistakes in reading, recording, or calculating data.
• Imperfect data collection (execution errors). Collecting a biased sample or observing
individuals not representing or belonging to the population under study.

If the variability of the phenomenon being studied is the cause of the outliers there is no
justification for their deletion. Two alternatives to least squares estimation, P" - and M-
Estimation, are introduced in §4.6.1 and §4.6.2. The prediction efficiency data is revisited in
§4.6.3 where the quadratic response model is fit by robust methods to reduce the negative
influence of the outlying Mg observation without deleting it from the data set. In §4.6.4 we
apply the M-Estimation principle to classification models.

© 2003 by CRC Press LLC


4.6.1 P" -Estimation
Box 4.8 Least Absolute Deviation (LAD) Regression

• P" -Regression minimizes the sum of absolute residuals and not the sum of
squared residuals as does ordinary least squares estimation. An alternative
name is hence Least Absolute Deviation (LAD) Regression.

• The P" -norm of a vector aa5‚"b is defined as !53œ" l+3 l and the P# -norm is
É!53œ" +3# . Least squares is an P# -norm method.

• The fitted trend of an P" -Regression with 5  " regressors and an intercept
passes through 5 data points.

The Euclidean norm of a a5 ‚ "b vector a,


Í
Í 5
Í
llall œ Èaw a œ "+3# ,
Ì 3œ"

is also called its P# -norm and the P" -norm of a vector is the sum of the absolute values of its
elements: !53œ" l+3 l. In P" -Estimation the objective is not to find the estimates of the model
parameters that minimize the sum of squared residuals (the P# -norm) !83œ" /3# , but the sum of
the absolute values of the residuals, !83œ" l/3 l. Hence the alternative name of Least Absolute
Deviation (LAD) estimation. In terms of a linear model Y œ X" € e, the P" -estimates " s P are
the values of " that minimize
8
"l C3  xw3 " l. [4.41]
3œ"

Squaring of the residuals in least squares estimation gives more weight to large residuals
than taking their absolute value (see Figure 4.15 below).
A feature of P" -Regression is that the model passes exactly through some of the data
points. In the case of a simple linear regression, C3 œ "! € "" B3 € /3 , the estimates " s! €
P
s
" "P B pass through two data points and in general if X is a a8 ‚ 5 b matrix of full rank 5 the
model passes through 5 data points. This fact can be used to devise a brute-force method of
finding the LAD estimates. Force the model through 5 data points and calculate its sum of
absolute deviations WEH œ !83œ" l/ s3 l. Repeat this for all possible sets of 5 points and choose
the set which has the smallest WEH.

Example 4.5. Galton (1886), who introduced the concept of regression in studies of
inheritance noted that the diameter of offspring peas was linearly increasing with the
diameter of the parent peas. Seven of the data points from his 1886 publication are
shown in Figure 4.14. Fitting a simple linear regression model to these data, there are

© 2003 by CRC Press LLC


(‡'Î# œ #" possible least absolute deviation regressions. Ten of them are shown along
with the sum of their absolute deviations. WEH achieves a minimum at "Þ)! for the line
that passes through the first and last data point.

18
SAD = 7.50

SAD = 4.30
Diameter of offspring

16 SAD = 4.20
SAD = 2.40
SAD = 1.80

14

SAD = 7.25

12

10
15 16 17 18 19 20 21
Diameter of parent

Figure 4.14. Ten of the #" possible least absolute deviation lines for seven observations
from Galton's pea diameter data. The line that minimizes !83œ" l/ s3 l passes through the
first and last data point: sC3 œ *Þ) € !Þ$''B.

This brute-force approach is computationally expensive since the total number of models
being fit is large unless either 8 or 5 is small. The number of models that must be evaluated
with this method is
8 8x
Š ‹œ .
5 5xa8  5 bx

Fitting a four regressor model with intercept to a data set with 8 œ $! observations requires
evaluation of "%#ß &!' models. In practice we rely on iterative algorithms to reduce the
number of evaluations. One such algorithm is discussed in §A4.8.5 and implemented in the
SAS® macro %LAD() (\SASMacro\L1_Regression.sas on CD-ROM). One can also fit least
absolute deviation regression with the SAS/IML® function lav().
To test the hypothesis L! : A" œ d in P" -Regression we use an approximate test analo-
gous to the sum of squares reduction test in ordinary least squares (§4.2.3 and §A4.8.2). If
WEH0 is the sum of absolute deviations for the full and WEH< is the sum of absolute devia-
tions for the model reduced under the hypothesis, then
aWEH<  WEH0 bÎ;
J9,= œ [4.42]
sÎ#
-

is an approximate J statistic with ; numerator and 8  5 denominator degrees of freedom.


Here, ; is the rank of A and 5 is the rank of X. An estimator of - suggested by McKean and
Schrader (1987) is described in §A4.8.5. Since the asymptotic variance-covariance matrix of
s P is Var’"
" s P “ œ -aXw Xb" , one can construct an approximate test of L! :"4 œ ! as

© 2003 by CRC Press LLC


s4
"
>9,= œ
-44

where -44 is the square root of the 4th diagonal element of -a s Xw Xb" , the estimated standard
s 4 . P" -Regression is applied to the Prediction Efficiency data in §4.6.3.
error of "

4.6.2 M-Estimation
Box 4.9 M-Estimation

• M-Estimation is an estimation technique robust to outlying observations.

• The idea is to curtail residuals that exceed a certain threshold, restricting


the influence of observations with large residuals on the parameter
estimates.

M-Estimation was introduced by Huber (1964, 1973) as a robust technique for estimating
location parameters (means) in data sets containing outliers. It can also be applied to the esti-
mation of parameters in the mean function of a regression model. The idea of M-Estimation is
simple (additional details in §A4.8.6). In least squares the objective function can be written as
8 8
" #
Uœ" #
a C 3  x w
3 " b œ " a/3 Î5 b# , [4.43]
3œ"
5 3œ"

a function of the squared residuals. If the contribution of large residuals /3 to the objective
function U can be reduced, the estimates should be more robust to extreme deviations. In M-
Estimation we minimize the sum of a function 3a•b of the residuals, the function being chosen
so that large residuals are properly weighted. The objective function for M-Estimation can be
written as
8 8 8
C3  xw3 "
U‡ œ "3Œ  œ "3a/3 Î5 b œ "3a?b. [4.44]
3œ"
5 3œ" 3œ"

Least squares estimation is a special case of M-Estimation with 3a?b œ ?# and maximum
likelihood estimators are obtained for 3a?b œ  lne0 a?bf where 0 a?b is the probability den-
sity function of the model errors. The name M-Estimation is derived from this reationship
with maximum likelihood estimation.
The estimating equations of the minimization problem can be written as
8
"< a ?
sbx3 œ 0, [4.45]
3œ"

where <a?b is the first derivative of 3Ð?Ñ. An iterative algorithm for this problem is discussed
in our §A4.8.6 and in Holland and Welsch (1977), Coleman et al. (1980), and Birch and
Agard (1993). Briefly, the algorithm rests on rewriting the psi-function as <a?b œ

© 2003 by CRC Press LLC


a<a?bÎ?b?, and to let A œ <a?bÎ?. The estimation problem is then expressed as a weighted
least squares problem with weights A. In order for the estimates to be unique, consistent, and
asymptotically Gaussian, 3a?b must be a convex function, so that its derivative <a?b is mono-
tonic. One of the classical truncation functions is
/3# l/3 l Ÿ 5
3a/3 b œ œ [4.46]
#5l/3 l  5 # l/3 l ž 5 .

If a residual is less than some number 5 in absolute value it is retained; otherwise it is cur-
tailed to #5l/3 l  5 # . This truncation function is a compromise between using squared
residuals if they are small, and a value close to l/3 l if the residual is large. It is a compromise
between least squares and P" -Estimation (Figure 4.15) combining the efficiency of the former
with the robustness of the latter. Large residuals are not curtailed to l/3 l but #5l/3 l  5 # to
ensure that 3a/b is convex (Huber 1964). Huber (1981) suggested to choose 5 œ "Þ& ‚ 5 s
where 5 s is an estimate of the standard deviation. In keeping with the robust/resistant theme
one can choose the median absolute deviation 5 s œ "Þ%)#' ‚ medianÐl/ s3 lÑ or its rescaled
version (Birch and Agard 1993)
5
s œ "Þ%)#' ‚ medianal/
s3  medianas/3 blb.

The value 5 œ "Þ$%&s 5 was suggested by Holland and Welsch (1977). If the data are
Gaussian-distributed this leads to M-estimators with relative efficiency of *&% compared to
least squares estimators. For non-Gaussian data prone to outliers, "Þ$%&s 5 yields more
efficient estimators than "Þ&s
5.

ρ(e) = e2
16

14

12 k = 2.0

10 k = 1.5
ρ(e)

8
k = 1.0
6

4
ρ(e) = |e|
2

-4 -3 -2 -1 0 1 2 3 4
Residual ei

Figure 4.15. Residual transformations in least squares (3a/b œ /# ), P" -Estimation


(3Ð/Ñ œ l/l), and M-Estimation Ð[4.46] for 5 œ "Þ!ß "Þ&ß #Þ!).

Since <a?
sb serves as a weight of a residual, it is sometimes termed the residual weighing
function and presents an alternative way of defining the residual transformation in M-
Estimation. For Huber's transformation with tuning constant 5 œ "Þ$%&s 5 the weighing
function is

© 2003 by CRC Press LLC


Ú/3 l/3 l Ÿ 5
<a/3 b œ Û  5 /3   5
Ü 5 /3 ž 5 .

A weighing function due to Andrews et al. (1972) is


sina/3 Î5 b l/3 l Ÿ 5 1
< a /3 b œ œ ß 5 œ #Þ",
! l/3 l ž 5 1

while Hampel (1974) suggested a step function


Ú
Ý /3
Ý l/3 l Ÿ +
+ signa/3 b +  l/3 l Ÿ ,
<a/3 b œ Û
Ý +a- signa/3 b  /3 b ,  l/3 l Ÿ -
Ý
Ü! l/3 l ž - .

The user must set values for the tuning constants +, , , and - , similar to the choice of 5 in
other weighing functions. Beaton and Tukey (1974) define the biweight M-estimate through
the weight function
#
/#3
/ " l/3 l Ÿ 5
<a/3 b œ ž 3 Š 5# 5# ‹
! l/3 l ž 5 .

and suggest the tuning constant 5 œ %Þ')&. Whereas Huber's weighing function is an example
of a monotonic function, the Beaton and Tukey < function is a redescending function which
is preferred if the data are contaminated with gross outliers. A weight function that produces
maximum likelihood estimates for Cauchy distributed data (a > distribution with a single
degree of freedom) is
<a/3 b œ #/3 Έ" € /3# ‰.

The Cauchy distribution is symmetric but much heavier in the tails than the Gaussian distri-
bution and permits more extreme observations than the Gaussian model. A host of other
weighing functions has been proposed in the literature. See Hampel et al. (1986) and Barnett
and Lewis (1994) for a more detailed description.
Hypothesis tests and confidence intervals for the elements of " in M-Estimation rely on
the asymptotic Gaussian distribution of "s Q . For finite sample sizes the :-values of the tests
are only approximate. Several tests have been suggested in the literature. To test the general
linear hypothesis, L! : A" œ d, an analog of the sum of squares reduction test can be used.
The test rests on comparing the sum of transformed residuals aWX V œ !83œ" 3as/3 bb in a full
and reduced model. If WX V< denotes this sum in the reduced and WX V0 in the full model, the
test statistic is
aWX V<  WX V0 bÎ;
J9,= œ , [4.47]
:
s
where ; is the rank of A and : s is an estimate of error variability playing a similar role in M-
s # in least squares estimation. If =3 œ maxe  5ß mine/3 ß 5 ff then
Estimation as 5

© 2003 by CRC Press LLC


8
8
:
sœ "=#3 . [4.48]
7a8  5 b 3œ"

This variation of the sum of squares reduction test is due to Schrader and McKean (1977) and
Schrader and Hettmansberger (1980). T -values are approximated as PraJ9,= ž J;ß85 b. The
test of Schrader and Hettmansberger (1980) does not adjust the variance of the M-estimates
for the fact that they are weighted unequally as one would in weighted least squares. This is
justifiable since the weights are random, whereas they are considered fixed in weighed least
squares. To test the general linear hypothesis L! : A" œ d, we prefer a test proposed by Birch
and Agard (1993) which is a direct analog of the J test in the Gaussian linear models with
unequal variances. The test statistic is
"
"
J9,= œ aA"  dbw ’AaXw WXb Aw “ aA"  dbΈ;=# ‰,

where ; is the rank of A and =# is an estimator of 5 # proposed as


8
! < a? #
# sb
8 3œ"
=# œ Œ s#
5 #
. [4.49]
8: 8
Œ ! < w a?
s b
3œ"

Here, <w a?b is the first derivative of ? and the diagonal weight matrix W has entries <a?
sbÎ?s.
Through simulation studies it was shown that this test has very appealing properties with
respect to size (significance level) and power. Furthermore, if <Ð?Ñ œ ? which results in least
squares estimates, =# is the traditional residual mean square error estimate.

4.6.3 Robust Regression for Prediction Efficiency Data


When fitting a quadratic polynomial to the Prediction Efficiency data by least squares and
invoking appropriate model diagnostics, it was noted that the data point corresponding to the
Magnesium observation was an outlier (see Table 4.13, p. 130). There is no evidence sugges-
ting that the particular observation is due to a measurement or execution error and we want to
retain it in the data set. The case deletion diagnostics in Table 4.13 suggest that the model fit
may change considerably, depending on whether the data point is included or excluded. We
fit the quadratic polynomial to these data with several methods. Ordinary least squares with
all data points (OLS+66 ), ordinary least squares with the Mg observation deleted (OLSQ 1 ),
P" -Regression, and M-Regression based on Huber's residual transformation [4.46].
M- and P" -Estimation are implemented with various SAS® macros contained on the
companion CD-ROM. Three of the macros require an installation of the SAS/IML® module.
These are %MRegress(), %MTwoWay(), and %LAD(). A third, general-purpose macro %MEstim()
performs M-Estimation in any linear model, and does not require SAS/IML® . %MRegress() in
file \SASMacros\M_Regression.sas is specifically designed for regression models. It uses
either the Huber weighing function or Beaton and Tukey's biweight function. The macro
%MTwoWay() contained in file \SASMacros\M_TwoWayAnalysis.sas implements M-Estimation
for the analysis of two-way layouts without replications (a randomized block design, for
example). It also uses either the Huber or Tukey weighing functions. Finally, least absolute

© 2003 by CRC Press LLC


deviation regression is implemented in macro %LAD() (\SASMacros\L1_Regression.sas).
The most general macro to fit linear models by M-Estimation is macro %MEstim()
(\SASMacros\MEstimation.sas). This macro is considerably more involved than
%MRegress() or %MTwoWay() and thus executes slightly slower. On the upside it can fit regres-
sion and classification models alike, does not require SAS/IML® , allows choosing from nine
different residual weighing functions, user-modified tuning constants, tests of main effects
and interactions, multiple comparisons, interaction slices, contrasts, etc. The call to the macro
and its arguments mimics statements and options of the mixed procedure of The SAS®
System. The macro output is also formatted similar to that produced by the mixed procedure.
To fit the quadratic model to the Prediction Efficiency data by least squares (with and
without outlier), M-Estimation and P" -Estimation, we use the following statements.

title 'Least Squares regression line with outlier';


proc reg data=range;
model eff30 = range range2;
run; quit;

title 'Least Squares regression line without outlier';


proc reg data=range(where=(covar ne 'Mg'));
model eff30 = range range2;
run; quit;

%include 'DriveLetterOfCDROM:\SASMacros\MEstimation.sas';
/* M-Estimation */
%MEstim(data=range,
stmts=%str(model eff30 = range range2 /s;) )
proc print data=_predm;
var eff30 _wght Pred Resid StdErrPred Lower Upper;
run;

%include 'DriveLetterOfCDROM:\SASMacros\L1_Regression.sas';
title 'L1-Regression for Prediction Efficacy Data';
%LAD(data=range,y=eff30,x=range range2);

P" -estimates can also be calculated with the LAV() function call of the SAS/IML®
module which provides greater flexibility in choosing the method for determining standard
errors and tailoring of output than the %LAD() macro. The code segment
proc iml;
use range; read all var {eff30} into y;
read all var {range range2} into x;
close range;
x = J(nrow(x),1,1) || x;
opt = {. 3 0 . };
call lav(rc,xr,X,y,,opt);
quit;

produces the same analysis as the previous call to the %LAD() macro.
Output 4.12 reiterates the strong negative influence of the Mg observation. When it is
contained in the model, the residual sum of squares is WWV œ &$'Þ#*, the error mean square
estimate is Q WV œ )*Þ$)$, and neither the linear nor the quadratic term are (partially)
significant at the &% level. Removing the Mg observation changes things dramatically. The
mean square error estimate is now Q WV œ (Þ%&* and the linear and quadratic terms are
(partially) significant.

© 2003 by CRC Press LLC


Output 4.12.
Least Squares regression with outlier

The REG Procedure


Model: MODEL1
Dependent Variable: eff30

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 854.45840 427.22920 3.34 0.1402
Error 4 511.52342 127.88086
Corrected Total 6 1365.98182

Root MSE 11.30844 R-Square 0.6255


Dependent Mean 25.04246 Adj R-Sq 0.4383
Coeff Var 45.15707

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -39.12105 37.76100 -1.04 0.3587


range 1 1.39866 1.00851 1.39 0.2378
range2 1 -0.00633 0.00609 -1.04 0.3574

Least Squares regression without outlier

The REG Procedure


Model: MODEL1
Dependent Variable: eff30

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 1280.23575 640.11788 52.02 0.0047
Error 3 36.91880 12.30627
Corrected Total 5 1317.15455

Root MSE 3.50803 R-Square 0.9720


Dependent Mean 26.12068 Adj R-Sq 0.9533
Coeff Var 13.43008

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -75.10250 13.06855 -5.75 0.0105


range 1 2.41407 0.35300 6.84 0.0064
range2 1 -0.01198 0.00210 -5.72 0.0106

The M-estimates of the regression coefficients are " s !Q œ  '&Þ'&, " s "Q œ #Þ"#', and
s #Q œ  !Þ!"!$ (Output 4.13), values similar to the OLS estimates after the Mg observation
"
has been deleted. The estimate of the residual variability a"%Þ#&'b is similar to the mean
square error estimate based on OLS estimates in the absence of the outlying observation. Of
particular interest is the printout of the data set _predm that is generated automatically by the

© 2003 by CRC Press LLC


macro (Output 4.13). It contains the raw fitted residuals s/3 and the weights of the observa-
tions in the weighted least squares algorithm that underpins estimation. The Mg observation
with a large residual of s/% œ  #%Þ$#!% receives the smallest weight in the analysis
(_wght=0.12808).

Output 4.13.
Results - The MEstim Macro - Author: Oliver Schabenberger
Model Information

Data Set WORK.RANGE


Dependent Variable EFF30
Weighing Function HUBER
Covariance Structure Variance Components
Estimation Method REML
Residual Variance Method Parameter
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Between-Within
Weighing constant 1.345

Dimensions

Covariance Parameters 1
Columns in X 3
Columns in Z 0
Subjects 7
Max Obs Per Subject 1
Observations Used 7
Observations Not Used 0
Total Observations 7

Fit Statistics

OLS Residual variance 127.8809


Rescaled MAD 2.31598
Birch and Agard estimate 14.25655
Observations used 7
Sum of weights (M) 5.605905
Sum of residuals (M) -17.8012
Sum of abs. residuals (M) 37.02239
Sum of squ. residuals (M) 643.6672
Sum of residuals (OLS) -142E-15
Sum of abs. residuals (OLS) 47.88448
Sum of squ. residuals (OLS) 511.5234

Covariance Parameters

Parameter Estimate
Residual 14.2566

Type 3 Tests of Fixed Effects

Num Den
Effect DF DF F Value Pr > F
range 1 4 30.93 0.0051
range2 1 4 20.31 0.0108

© 2003 by CRC Press LLC


Output 4.13 (continued).
Solution for Fixed Effects

Standard
Effect Estimate Error DF t Value Pr > |t|

Intercept -65.6491 14.0202 4 -4.68 0.0094


range 2.1263 0.3823 4 5.56 0.0051
range2 -0.01030 0.002285 4 -4.51 0.0108

StdErr
Obs eff30 _wght Pred Resid Pred Lower Upper

1 25.2566 1.00000 24.3163 0.9403 2.03354 18.6703 29.9623


2 40.1652 1.00000 39.0396 1.1256 3.70197 28.7613 49.3179
3 33.7026 0.47782 27.1834 6.5191 2.17049 21.1572 33.2097
4 18.5732 0.12808 42.8935 -24.3204 2.82683 35.0450 50.7421
5 15.3946 1.00000 14.3691 1.0256 2.03142 8.7289 20.0092
6 -0.0666 1.00000 2.4030 -2.4696 3.08830 -6.1714 10.9775
7 42.2717 1.00000 42.8935 -0.6218 2.82683 35.0450 50.742

Notice that the raw residuals sum to zero in the ordinary least squares analysis but not in
M-Estimation due to the fact that residuals are weighted unequally. The results of P" -Estima-
tion of the model parameters are shown in Output 4.14. The estimates of the parameters are
very similar to the M-estimates.

Output 4.14.
Least Absolute Deviation = L1-Norm Regression
Author: Oliver Schabenberger
Model: eff30 = intcpt range range2

Data Set range


Number of observations 7
Response eff30
Covariates range range2

LAD Regression Results

Median of response 1366.302766


Sum abs. residuals full model 35.363442
Sum abs. residuals null model 82.238193
Sum squared residuals full m. 624.274909
Tau estimate 15.085416

Parameter Estimates and Standard Errors

Estimate Std.Err T-Value Pr(> |T|)

INTERCPT -57.217612 50.373024 . .


RANGE 1.917134 1.345349 1.425 0.227280
RANGE2 -0.009095 0.008121 -1.120 0.325427

© 2003 by CRC Press LLC


The estimates of the various fitting procedures are summarized in Table 4.21. The coeffi-
cients for P" -, M-Regression, and OLSQ 1 are quite similar, but the former methods retain
all observations in the analysis.

Table 4.21. Estimated coefficients for quadratic polynomial C3 œ "! € "" B3 € "# B#3 € /3
for Prediction Efficacy data
OLS+66 OLSQ 1 P" -Regression M-Regression
s
"!  $*Þ"#"  (&."!$  &(.#"(  '&.'%*
s
"" ".$*) #.%"% ".*"( #."#'
s
"#  !.!!'$  !.!""*  !Þ!!*  !.!"!
8 ( ' ( (
WWV &""Þ&#$ $'Þ*"* '#%Þ#(& '%$Þ''(
WEH %(Þ))% $&Þ$'$ $(Þ!##

Among the fits based on all observations a8 œ (b, the residual sum of squares is mini-
mized for ordinary least squares regression, as it should be. By the same token, P" -Estimation
yields the smallest sum of absolute deviations (WEH). M-Regression with Huber's weighing
function, has a residual sum of squares slightly larger than the least absolute deviation
regression. Notice that M-estimates are obtained by weighted least squares, and WWV is not
the criterion being minimized. The sum of absolute deviations of M-Estimation is smaller
than that of ordinary least squares and only slightly larger than that of the P" fit.
The similarity of the predicted trends for OLSQ 1 , P" -, and M-Regression is apparent in
Figure 4.16. While the Mg observation pulls the least squares regression line toward it, P" -
and M-estimates are not greatly affected by it. The predictions for P" - and M-Regression are
hard to distinguish; they are close to least squares predictions obtained after outlier deletion
but do not require removal of the offending data point. The P" -Regression passes through
three data points since the model contains two regressors and an intercept. The data points are
CEC, Prec and P.

50 OLS without outlier

Prec
40 P
Prediction Efficacy

Ca
OLS with outlier
30
pH M and L1-Norm regression

20
Mg
CEC

10

0 lime

30 40 50 60 70 80 90 100 110 120 130


Range of Spatial Correlation

Figure 4.16. Fitted and predicted values in prediction efficacy example for ordinary least
squares with and without outlying Mg observation, P" - and M-Regression.

© 2003 by CRC Press LLC


4.6.4 M-Estimation in Classification Models
Analysis of variance (classification) models that arise in designed experiments often have
relatively few observations. An efficiently designed experiment will use as few replications as
necessary to maintain a desirable level of test power, given a particular error-control and
treatment design. As a consequence, error degrees of freedom in experimental designs are
often small. A latin square design with four treatments, for example, has only six degrees of
freedom for the estimation of experimental error variability. Removal of outliers from the
data further reduces the degrees of freedom and the loss of observations is particularly
damaging if only few degrees of freedom are available to begin with. It is thus not only with
an eye toward stabilizing the estimates of block, treatment, and other effects that robust
analysis of variance in the presence of outliers is important. Allowing outliers to remain in the
data but appropriately downweighing their influence retains error degrees of freedom and
inferential power. In this subsection, outlier-resistant M-Estimation is demonstrated for
analysis of variance models using the %MEstim() macro written for The SAS® System. No
further theoretical details are needed in this subsection beyond the discussion in §4.6.2 and
the mathematical details in §A4.8.6 except for taking into account that the design matrix in
analysis of variance models is rank-deficient. We discuss M-Estimation in classification
models by way of example.

Example 4.3 Dollar Spot Counts (continued). Median polishing of the dollar spot
count data suggested to analyze the log-transformed counts. We apply M-Estimation
here with Huber's weighing function and the tuning constant "Þ$%&. These are default
settings of %MEstim(). The statements
%include 'DriveLetter:\SASMacros\MEstimation.sas';
%MEstim(data=dollarspot,
stmts=%str(
class block tmt;
model lgcnt = block tmt;
lsmeans tmt / diff;) );

ask for a robust analysis of the log-transformed dollar spot counts in the block design
and pairwise treatment comparisons (lsmeans tmt / diff).

Output 4.15. The MEstim Macro - Author: Oliver Schabenberger

Model Information
Data Set WORK.DOLLSPOT
Dependent Variable LGCNT
Weighing Function HUBER
Covariance Structure Variance Components
Estimation Method REML
Residual Variance Method Parameter
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Between-Within
Weighing constant 1.345

Class Level Information


Class Levels Values
BLOCK 4 1 2 3 4
TMT 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14

© 2003 by CRC Press LLC


Output 4.15 (continued).
Dimensions
Covariance Parameters 1
Columns in X 19
Columns in Z 0
Subjects 56
Max Obs Per Subject 1
Observations Used 56
Observations Not Used 0
Total Observations 56

Fit Statistics
OLS Residual variance 0.161468
Rescaled MAD 0.38555
Birch and Agard estimate 0.162434
Observations used 56
Sum of weights (M) 55.01569
Sum of residuals (M) 0.704231
Sum of abs. residuals (M) 15.68694
Sum of squ. residuals (M) 6.336998
Sum of residuals (OLS) -862E-16
Sum of abs. residuals (OLS) 15.89041
Sum of squ. residuals (OLS) 6.297243

Covariance Parameters
Parameter Estimate
Residual 0.1624

Type 3 Tests of Fixed Effects


Num Den
Effect DF DF F Value Pr > F
BLOCK 3 39 0.78 0.5143
TMT 13 39 7.38 <.0001

M-Estimated Least Square Means


Standard
Obs Effect TMT Estimate Error DF t Value Pr > |t|
1 TMT 1 3.2720 0.2015 39 16.24 <.0001
2 TMT 2 4.2162 0.2015 39 20.92 <.0001
3 TMT 3 3.3113 0.2015 39 16.43 <.0001
4 TMT 4 3.5663 0.2058 39 17.33 <.0001
5 TMT 5 3.9060 0.2015 39 19.38 <.0001
6 TMT 6 3.8893 0.2076 39 18.73 <.0001
7 TMT 7 3.7453 0.2015 39 18.59 <.0001
8 TMT 8 3.9386 0.2015 39 19.54 <.0001
... and so forth ...

Differences of M-Estimated Least Square Means


Standard
Effect TMT _TMT Estimate Error DF t Value Pr > |t|
TMT 1 2 -0.9442 0.2850 39 -3.31 0.0020
TMT 1 3 -0.03931 0.2850 39 -0.14 0.8910
TMT 1 4 -0.2943 0.2880 39 -1.02 0.3132
TMT 1 5 -0.6340 0.2850 39 -2.22 0.0320
TMT 1 6 -0.6173 0.2893 39 -2.13 0.0392
TMT 1 7 -0.4733 0.2850 39 -1.66 0.1048
TMT 1 8 -0.6666 0.2850 39 -2.34 0.0245
TMT 1 9 -1.0352 0.2850 39 -3.63 0.0008
TMT 1 10 -1.5608 0.2894 39 -5.39 <.0001
TMT 1 11 -1.5578 0.2850 39 -5.47 <.0001
TMT 1 12 -0.1448 0.2921 39 -0.50 0.6229
TMT 1 13 -0.05717 0.2850 39 -0.20 0.8420

© 2003 by CRC Press LLC


Output 4.15 (continued).
TMT 1 14 0.1447 0.2850 39 0.51 0.6145
TMT 2 3 0.9049 0.2850 39 3.18 0.0029
TMT 2 4 0.6499 0.2880 39 2.26 0.0297
TMT 2 5 0.3102 0.2850 39 1.09 0.2831
TMT 2 6 0.3269 0.2893 39 1.13 0.2654
TMT 2 7 0.4709 0.2850 39 1.65 0.1065
TMT 2 8 0.2776 0.2850 39 0.97 0.3360
TMT 2 9 -0.09097 0.2850 39 -0.32 0.7513
TMT 2 10 -0.6166 0.2894 39 -2.13 0.0395
TMT 2 11 -0.6136 0.2850 39 -2.15 0.0376
TMT 2 12 0.7994 0.2921 39 2.74 0.0093
TMT 2 13 0.8870 0.2850 39 3.11 0.0035
TMT 2 14 1.0889 0.2850 39 3.82 0.0005
TMT 3 4 -0.2550 0.2880 39 -0.89 0.3814
... and so forth ...

The least squares estimates minimize the sum of squared residuals ('Þ#*(, Table 4.22).
The corresponding value in the robust analysis is very close as are the J statistics for
the treatment effects a(Þ'! and (Þ$)b.

The treatment estimates are also quite close. Because of the orthogonality and the
constant weights in the least squares analysis, the standard errors of the treatment
effects are identical. In the robust analysis, the standard errors depend on the weights
which differ from observation to observation depending on the size of their residuals.
The subtle difference in precision is exhibited in the standard error of the estimate for
treatment '. Four observations are downweighed substantially in the robust analysis. In
particular, treatment ' in block # (C#' œ **, Table 4.3) received weight !Þ(( and
treatment "# in block " (C"ß"# œ '$) receives weight !Þ'%. These observations have not
been identified as potential outliers before. Combined with treatment "! in block %
these observations accounted for three of the four largest median polished residuals.

Table 4.22. Analysis of variance for Dollar Spot log(counts) by least squares and
M-Estimation (Huber's weighing function)
Least Squares M-Estimation
Sum of squared residuals 'Þ#* 'Þ$%
Sum of absolute residuals "&Þ)* "&Þ'*
J9,= for treatment effect (Þ'! (Þ$)
:-value for treatment effect !Þ!!!" !Þ!!!"
Treatment Estimates (Std.Err)
1 $Þ#( Ð!Þ#!Ñ $Þ#( Ð!Þ#!Ñ
2 %Þ#" Ð!Þ#!Ñ %Þ#" Ð!Þ#!Ñ
3 $Þ$" Ð!Þ#!Ñ $Þ$" Ð!Þ#!Ñ
4 $Þ&* Ð!Þ#!Ñ $Þ&' Ð!Þ#!Ñ
5 $Þ*! Ð!Þ#!Ñ $Þ*! Ð!Þ#!Ñ
6 $Þ*$ Ð!Þ#!Ñ $Þ)) Ð!Þ#"Ñ
ã ã ã

© 2003 by CRC Press LLC


14
13
12
11
10

Treatment Label
9
8
7
6
5 Significance Least Squares
4 Significance M-Estimation
3
2
1

1 2 3 4 5 6 7 8 9 10 11 12 13 14
Treatment Label

Figure 4.17. Results of pairwise treatment comparisons in robust ANOVA and ordi-
nary least squares estimation. Dots reflect significance of treatment comparison at the
&% level in ANOVA, circles significance at the &% level in M-Estimation.

The robust analysis downweighs the influence of C#' œ ** whereas the least squares
analysis weighs all observations equally. Such differences, although small, can have a
measurable impact on treatment comparisons. At the &% significance level the least
squares and robust analysis agree closely. However, treatments $ and ' as well as ' and
"$ are significantly different in the least squares analysis and not so in the robust
analysis (Figure 4.17).

The detrimental effect of outlier deletion in experimental designs with few degrees of
freedom to spare can be demonstrated with the following split-plot design. The whole-plot
design is a randomized complete block design with four treatments in two blocks. Each
whole-plot is then subdivided into three sub-plots to which the levels of the sub-plot treat-
ment factor are randomly assigned. The analysis of variance of this design contains separate
error terms for whole- and sub-plot factors. Tests of whole-plot effects use the whole-plot
error which has only $ degrees of freedom (Table 4.23).

Table 4.23. Analysis of variance for split-plot design


Source Degrees of freedom
Block "
A $
Error(A) $
B #
A‚B '
Error(B) )
Total #$

The sub-plot error is associated with ) degrees of freedom. Now assume that the obser-
vations from one of the eight whole plots appear errant. It is surmised that either the experi-

© 2003 by CRC Press LLC


mental unit is not comparable to other units, that the treatment was applied incorrectly, or that
measurement errors were comitted. Removing the three observations for the whole-plot
changes the analysis of variance (Table 4.24).

Table 4.24. Analysis of variance for split-plot design after removal of one whole-plot
Source Degrees of freedom
Block "
A $
Error(A) #
B #
A‚B '
Error(B) '
Total #!

At the &% significance level the critical value in the J -test for the whole-plot aAb main
effect is J!Þ!&ß$ß$ œ *Þ#) in the complete design and J!Þ!&ß$ß# œ "*Þ"' in the design with a lost
whole-plot. The test statistic J9,= œ Q WÐEÑÎQ WÐErroraEbÑ must double in value to find a
significant difference among the whole-plot treatments. An analysis which retains extreme
observations and reduces their impact is to be preferred.

Example 4.6. The data from a split-plot design with four whole-plot treatments ar-
ranged in a randomized complete block design with three blocks and three sub-plot
treatments is given in the table below.

Table 4.25. Observations from split-plot design


Whole-Plot Sub-Plot Treatment
Block Treatment F" F# F$
" E" $Þ) &Þ$ 'Þ#
E# &Þ# &Þ' &Þ%
E$ 'Þ! &Þ' (Þ'
E% 'Þ& (Þ" (Þ(
# E" $Þ* &Þ% %Þ&
E# 'Þ! 'Þ" 'Þ#
E$ (Þ! 'Þ% (Þ%
E% (Þ% )Þ$ 'Þ*
$ E" %Þ* 'Þ% $Þ&
E# %Þ' &Þ! 'Þ%
E$ 'Þ) 'Þ# (Þ(
E% %Þ( (Þ$ (Þ%

It is assumed that the levels of factor A are quantitative and equally spaced. Apart from
tests for main effects and interactions, we are interested in testing for trends of the
response with levels of E. We present a least squares analysis of variance and a robust

© 2003 by CRC Press LLC


analysis of variance using all the data. The linear statistical model underlying this
analysis is
]345 œ . € 33 € !4 € .34 € "5 € a!" b45 € /345 ,

where 33 denotes the block effect a3 œ "ß âß $b, !4 the main effects of factor E
a4 œ "ß âß %b, .34 is the random whole-plot experimental error with mean ! and
variance 5.# , "5 are the main effects of factor F a5 œ "ß #ß $b, a!" b45 are the interaction
effects, and /345 is the random sub-plot experimental error with mean ! and variance 5/# .

The tabulated data do not suggest any data points as problematic. A graphical display
adds more insight. The value %Þ( for the fourth level of factor E and the first level of F
in the third block appears suspiciously small compared to the remainder of the data for
the whole-plot factor level (Figure 4.18).

9
Underlined labels refer to block 1,
italicized labels refer to block 3
B2
8
B3 B3
B3
B3 B3
B1
B2
B2
7 B1
Observed Response

B3
B1
B1
B2 B3 B2
B3 B3 B2
B2
6 B1 B1

B2 B2
B2 B3
B2
B1
5 B1
B2
B1
B1
B3

4 B1
B1
B3

3
A1 A2 A3 A4

Whole-Plot Factor

Figure 4.18. Data in split-plot design. Labels F" ß âß F$ in the graph area are drawn at
the value of the response for the particular combination of factors E (horizontal axis)
and F . Values from block 1 are underlined, appear in regular type for block 2, and are
italicized for block 3.

Split-plot designs are special cases of mixed models (§7) that are best fit with the mixed
procedure of The SAS® System. The standard and robust analysis using Huber's weight
function with tuning constant "Þ$%& are produced with the statements

proc mixed data=spd;


class block a b;
model y = block a b a*b ;
random block*a;
lsmeans a b a*b / diff;
lsmeans a*b / slice=(a b);
contrast 'A cubic @ B1' a -1 3 -3 1 a*b -1 0 0 3 0 0 -3 0 0 1 0 0;
contrast 'A quadr.@ B1' a 1 -1 -1 1 a*b 1 0 0 -1 0 0 -1 0 0 1 0 0;
contrast 'A linear@ B1' a -3 -1 1 3 a*b -3 0 0 -1 0 0 1 0 0 3 0 0;

© 2003 by CRC Press LLC


/* and so forth for contrasts @ B2 and @B3 */
run;

/* Robust Analysis */
%include 'DriveLetter:\SASMacros\MEstimation.sas';
%MEstim(data=spd,
stmts=%str(class block a b;
model y = block a b a*b ;
random block*a;
parms / nobound;
lsmeans a b a*b / diff;
lsmeans a*b / slice=(a b);
contrast 'A cubic @ B1' a -1 3 -3 1
a*b -1 0 0 3 0 0 -3 0 0 1 0 0;
contrast 'A quadr.@ B1' a 1 -1 -1 1
a*b 1 0 0 -1 0 0 -1 0 0 1 0 0;
contrast 'A linear@ B1' a -3 -1 1 3
a*b -3 0 0 -1 0 0 1 0 0 3 0 0;
/* and so forth for contrasts @ B2 and @B3 */
),converge=1E-4,fcn=huber );

Table 4.26. Analysis of variance results in split-plot design


Least Squares M-Estimation
J Value :-Value J Value :-Value
Model Effects
E Main Effect "'Þ#! !Þ!!#) #'Þ') !Þ$$"#
F Main Effect %Þ#" !Þ!$%! $Þ*$ !Þ!%!)
E ‚ F Interaction "Þ)( !Þ"%)) $Þ!$ !Þ!$&&
Trend Contrasts
Cubic trend @ F" "Þ!( !Þ$"&$ "Þ"# !Þ$!&"
Quadratic trend @ F" #Þ)* !Þ"!)& #Þ&" !Þ"$#&
Linear trend @ F" "%Þ%& !Þ!!"' #%Þ(& !Þ!!!"

Cubic trend @ F# !Þ!% !Þ)&"( !Þ!& !Þ)"*)


Quadratic trend @ F# $Þ&) !Þ!('' &Þ$# !Þ!$%)
Linear trend @ F# "!Þ!! !Þ!!'! "%Þ)% !Þ!!"%

Cubic trend @ F$ "Þ") !Þ#*#& "Þ"# !Þ$!&&


Quadratic trend @ F$ $Þ!# !Þ"!"$ 'Þ)% !Þ!")(
Linear trend @ F$ #$Þ&( !Þ!!!# %!Þ!( !Þ!!!"

The interaction is not significant in the least squares analysis (: œ !Þ"%)), Table 4.26)
and both main effects are. The robust analysis reaches a different conclusion. It indi-
cates a significant E ‚ F interaction (: œ !Þ!$&&) and a masked E main effect
(: œ !Þ$$"#). At the &% significance level the least squares results suggest linear trends
of the response in E for all levels of F . A marginal quadratic trend can be noted for F#
(: œ !Þ!('', Table 4.26). The robust analysis concludes a stronger linear effect at F"
and quadratic effects at F# and F$ .

© 2003 by CRC Press LLC


Can these discrepancies between the two analyses be explained by a single observation?
Studying the weights A3 in M-Estimation for these data reveals that there are in fact two
extreme observations. All weights are one except for E% F" in block $ aC œ %Þ(b with
A œ !Þ%'"" and E" F$ in block " aC œ 'Þ#b with A œ !Þ$*$%. While the first obser-
vation was identified as suspicious in Figure 4.18, most analysts would have probably
failed to recognize the large influence of the second observation. As can be seen from
the magnitude of the weight, its residual is even larger than that of E% F" in block $.
The weights can be viewed by printing the data set _predm which is generated by the
%MEstim() macro (not shown here).

The estimated treatment cell means are very much the same for both analyses with the
exception of . s"$ and .
s%" (Figure 4.19). The least squares estimate of ."$ is too large
and the estimate of .%" is too small. For the other treatment combinations the estimates
are identical (because of the unequal weights the precision of the treatment estimates is
not the same in the two analyses, even if the estimates agree). Subtle differences in esti-
mates of treatment effects contribute to the disagreement in conclusions from the two
analyses in Table 4.26. A second source of disagreement is the estimate of experimental
s # œ !Þ&&)% in the least
error variance. The sub-plot error variance 5/# was estimated as 5
squares analysis and as !Þ$('" in the robust analysis. Comparisons of the treatments
whose estimates are not affected by the outliers will be more precise and powerful in
the robust analysis.

B3 B2
B3

7
Treatment Estimate

B1

B1
B2
6 B3

B2
B2

B1
5
B3

B1
4

A1 A2 A3 A4
Whole-Plot Factor Level

Figure 4.19. Estimated treatment means . s45 in split-plot design. The center of the
circles denotes the estimates in the robust analysis, the labels the location of the
estimates in the least squares analysis.

The disagreement does not stop here. Comparing the levels of F at each level of E via
slicing (see §4.3.2), the least squares analysis fails to find significant differences among
the levels of F at the &% level for any level of E. The robust analysis detects F effects
at E" and E$ (Table 4.27). The marginally significant slice at level E% in the least

© 2003 by CRC Press LLC


squares analysis (: œ !Þ!)') is due to a greater separation of the means compared to the
robust analysis (Figure 4.19).

Table 4.27. T -values for slices in split-plot design


Factor being Least Squares M-Estimation
compared at level J Value :-value J Value :-value
F E" $Þ"" !Þ!(#& &Þ$) !Þ!"'$
E# !Þ($ !Þ%*(# "Þ!) !Þ$'")
E$ $Þ"" !Þ!(#& %Þ'" !Þ!#'#
E% #Þ)( !Þ!)'! #Þ"% !Þ"&!&

4.7 Nonparametric Regression


Box 4.10 Nonparametric Regression

• Nonparametric regression estimates the conditional expectation 0 aBb œ


Ec] l\ œ Bd by a smooth but otherwise unspecified function.

• Nonparametric regression is based on estimating 0 aBb within a local


neighborhood of B by weighted averages or polynomials. This process is
termed smoothing.

• The trade-off between bias and variance of a smoother is resolved by


selecting a smoothing parameter that optimizes some goodness-of-fit
criterion.

Consider the case of a single response variable ] and a single covariate \ . The regression of
] on \ is the conditional expectation
0 aBb œ Ec] l\ œ Bd

and our analysis of the relationship between the two variables so far has revolved around a
parametric model for 0 aBb, for example a quadratic polynomial 0 aBb œ "! € "" B € "# B# .
Inferences drawn from the analysis are dependent on the model for 0 aBb being correct. How
are we to proceed in situations where the data do not suggest a particular class of parametric
models? What can be gleaned about the conditional expectation Ec] lBd in an exploratory
fashion that can aid in the development of a parametric model?
A starting point is to avoid any parametric specification of the mean function and to
consider the general model
]3 œ 0 aB3 b € /3 . [4.50]

© 2003 by CRC Press LLC


Rather than placing the onus on the user to select a parametric model for 0 aB3 b, we let the
data guide us to a nonparametric estimate of 0 aB3 b.

Example 4.7. Paclobutrazol Growth Response. During the 1995 growing season the
growth regulator Paclobutrazol was applied May 1, May 29, June 29, and July 24 on
turf plots. If turfgrass growth is expressed relative to the growth of untreated plots we
expect a decline of growth shortly after each application of the regulator and increasing
growth as the regulator's effect wears off. Figure 4.20 shows the clipping percentages
removed from Paclobutrazol-treated turf by regular mowing. The amount of clippings
removed relative to the control is a surrogate measure of growth in this application.

The data points show the general trend that is expected. Decreased growth shortly after
application with growth recovery before the next application. It is not obvious,
however, how to parametrically model the clipping percentages over time. A single
polynomial function would require trends of high order to pick up the fluctuations in
growth response. One could also fit separate quadratic or cubic polynomials to the
intervals c!ß #d, c#ß $d, and c$ß %d months. Before examining complicated parametric
models, a nonparametric smooth of the data can (i) highlight pertinent features of the
data, (ii) provide guidance for the specification of possible parametric structures and
(iii) may answer the questions of interest.

200

150
Clippings (% of Control)

100

50

0
0 1 2 3 4
Months

Figure 4.20. Clipping percentages of Paclobutrazol-treated turf plots relative to an


untreated control. Vertical lines show times of treatment applications. Data kindly
provided by Mr. Ronald Calhoun, Department of Crop and Soil Sciences, Michigan
State University. Used with permission.

© 2003 by CRC Press LLC


The result of a nonparametric regression analysis is a smoothed function s0 aBb of the data
and the terminology data smoothing is sometimes used in this context. We point out,
however, that parametric regression analysis is also a form of data smoothing. The least
smooth description of the data is obtained through interpolation the data points by connecting
the dots. The parametric regression line, on the other hand, is a very smooth representation of
the data. It passes through the center of the data scatter, but not necessarily any particular data
pair aC3 ß B3 b. Part of the difficulty in modeling data nonparametrically lies in determining the
appropriate degree of smoothness so that s0 aBb retains the pertinent features of the data while
filtering its random fluctuations.
Nonparametric statistical methods are not assumption-free as is sometimes asserted. 0 aBb
is assumed to belong to some collection of functions, possibly of infinite dimension, that
share certain properties. It may be required that 0 aBb is differentiable, for example. Secondly,
the methods discussed in what follows also require that the errors /3 are uncorrelated and
homoscedastic aVarc/3 d œ 5 # b.

4.7.1 Local Averaging and Local Regression


The unknown target function 0 aBb is the conditional mean of the response if the covariate
value is B. It is thus a small step to consider as an estimate of 0 aB3 b some form of average
calculated from the observations at B3 . If data are replicated such that < responses are
observed for each value B3 , one technique of estimating 0 aB3 b is to average the < response
values and to interpolate the means. Since this procedure breaks down if < œ ", we assume
now that each covariate value B3 is unique in the data set and if replicate measurements were
made, that we operate with the average values at B3 . In this setup, estimation will be possible
whether the B3 's are unique or not. The data set now comprises as many observations as there
are unique covariate values, 8, say. Instead of averaging C3 's at a given B3 we can also
estimate 0 aB3 b by averaging observations in the neighborhood of B3 . Also, the value of \ at
which a prediction is desired can be any value, whether it is part of the observed data or not.
We simply refer to B! as the value at which the function 0 ab is to be estimated.
If ] and \ are unrelated and 0 aBb is a flat line, it does not matter how close to the target
B! we select observations. But as Ec] lBd changes with B, observations far removed from the
point B! should not contribute much or anything to the estimate to avoid bias. If we denote
the set of points that are allowed to contribute to the estimation of 0 at B! with R aB! b, we
need to decide (i) how to select the neighborhood R aB! b and (ii) how to weigh the points that
are in the neighborhood. Points not in the neighborhood have zero weight and do not contri-
bute to the estimation of 0 aB! b, but points in the neighborhood can be weighted unequally.
To assign larger weights to points close to B! and smaller weights to points far from B! is
reasonable. The local estimate can be written as a linear combination of the responses,
8
s0 aB! b œ "[! aB3 à -bC3 . [4.51]
3œ"

[! ab is a function that assigns a weight to C3 based on the distance of B3 from B! . The


weights depend on the form of the weight function and the parameter - , called the smoothing
parameter. - essentially determines the width of the neighborhood.

© 2003 by CRC Press LLC


The simplest smoother is the moving average where s0 aB! b is calculated as the average
of the - points to the right or left of B! . If B! is an observed value, then points within - points
of B! receive weight [! aB3 à -b œ "Îa#- € "b; all other points receive zero weight. In the
interior of the data, each predicted value is thus the average of "!! ‚ a#- € "bÎ8 percent of
the data. Instead of this symmetric nearest neighborhood the #- points closest to B! (the
nearest neighborhood) may be selected. Because of the bias incurred by the moving average
near the ends of the \ data, the symmetric nearest neighborhood is usually preferred; see
Hastie and Tibshirani (1990, p. 32). A moving average is not very smooth and has a jagged
look if - is chosen small. Figure 4.21a shows 819 observations of the light transmittance
(PPDF) in the understory of a longleaf pine stand collected during a single day. Moving
averages were calculated with symmetric nearest neighborhoods containing "%, &%, and $!%
of the data corresponding roughly to smoothing parameters - œ %ß - œ #!, and - œ "#$.
With increasing - the fitted profile becomes more smooth. At the same time, the bias in the
profile increases. Near the center of the data, the large neighborhood - œ "#$ in Figure 4.21d
averages values to the left and right of the peak. The predicted values thus are considerably
below what appears to be the maximum PPDF. For the small smoothing neighborhood - œ %
in Figure 4.21b, the resulting profile is not smooth at all. It nearly interpolates the data.

1600 a) Raw data 1600


b) λ = 4
PPFD (µmol * m-2 *s)
PPFD (µmol * m-2 *s)

1200 1200

800 800

400 400

0 0
6:00 11:00 16:00 21:00 6:00 11:00 16:00 21:00
Time Time

1600 1600
c) λ = 20 d) λ = 123
PPFD (µmol * m-2 * s)

PPFD (µmol * m-2 *s)

1200 1200

800 800

400 400

0 0
6:00 11:00 16:00 21:00 6:00 11:00 16:00 21:00
Time Time

Figure 4.21. Light transmittance expressed as photosynthetic photon flux density (PPDF
.796/7# Î=) in understory of longleaf pine stand. (a): raw data measured in "!-second inter-
vals. (b) to (d): moving average smoothers with symmetric nearest neighborhoods of "% (b),
&% (c), and $!% of the 8 œ )"* data points. Data kindly provided by Dr. Paul Mou, Depart-
ment of Biology, University of North Carolina at Greensboro. Used with permission.

© 2003 by CRC Press LLC


An improvement over the moving average is to fit linear regression lines to the data in
each neighborhood, usually simple linear regressions or quadratic polynomials. The smooth-
ing parameter - must be chosen so that each neighborhood contains a sufficient number of
points to estimate the regression parameters. As in the case of the moving average, the
running-line smoother becomes smoother as - increases. If the neighborhood includes all 8
points, it is identical to a parametric least squares polynomial fit.
As one moves through the data from point to point to estimate 0 aB! b, the weight assigned
to the point B3 changes. In case of the moving average or running-line smoother, the weight
of B3 makes a discrete jump as the point enters and leaves the neighborhood. This accounts
for their sometimes jagged look. Cleveland (1979) implemented a running-line smoother that
eliminates this problem by varying the weights within the neighborhood in a smooth manner.
The target point B! is given the largest weight and weights decrease with increasing distance
lB3  B! l. The weight is exactly zero at the point in the neighborhood farthest from B! .
Cleveland's tri-cube weights are calculated as
Ú $ $
"  Š lB3 B !l
- ‹  ! Ÿ lB4  B3 l  -
[! aB3 à -b œ Û œ .
Ü! otherwise

1600 a) Raw data 1600


b) λ = 4
PPFD (µmol * m-2 *s)
PPFD (µmol * m-2 *s)

1200 1200

800 800

400 400

0 0
6:00 11:00 16:00 21:00 6:00 11:00 16:00 21:00
Time Time

1600 1600
c) λ = 20 d) λ = 123
PPFD (µmol * m-2 * s)

PPFD (µmol * m-2 *s)

1200 1200

800 800

400 400

0 0
6:00 11:00 16:00 21:00 6:00 11:00 16:00 21:00
Time Time

Figure 4.22. Loess fit of light transmittance data with smoothing parameters identical to
those in Figure 4.21.

Cleveland's proposal was to initially fit a . th degree polynomial with weights [3 aB3 à -b
and to obtain the residual C3  s0 aB3 b at each point. Then a second set of weights $3 is defined

© 2003 by CRC Press LLC


based on the size of the residual with $3 being small for large residuals. The local polynomial
fit is repeated with [3 aB3 à -b replaced by $3 [3 aB4 à -b and the residual weights are again
updated based on the size of the residuals after the second fit. This procedure is repeated until
some convergence criterion is met. This robustified local polynomial fit was called robust
locally weighted regression by Cleveland (1997) and is also known as loess regression or
lowess regression (see also Cleveland et al. 1988).
Beginning with Release 8.0 of The SAS® System the loess procedure has been available
to fit local polynomial models with and without robustified reweighing. Figure 4.22 shows
loess fits for the light transmittance data with smoothing parameters identical to those in
Figure 4.21 produced with proc loess. For the same smoothing parameter the estimated trend
s0 aB3 b is less jagged and erratic compared to a simple moving average. Small neighborhoods
(- œ %) still produce highly undersmoothed estimates with large variance.
A closely related class of smoothers uses a weight function that is a symmetric density
function with a scale parameter -, called the bandwidth, and calculates weighted averages of
points in the neighborhood. The weight functions trail off with increasing distance of B3 from
B! but do not have to reach zero. These weight functions are called kernel functions or
simply kernels and the resulting estimators are called kernel estimators. We require the ker-
nels to have certain properties. Their support should be bounded so that the function can be
rescaled to the c–"ß "d interval and the function should be symmetric and integrate to one.
Because of these requirements, the Gaussian probability density function
"
O a> b œ exp˜  ½># ™
È#1

is a natural choice. Other frequently used kernels are the quadratic kernel
!Þ(&a"  ># b l>l Ÿ "
O a> b œ œ
! otherwise

due to Epanechnikov (1969), the triangular kernel O a>b œ "  l>l ‚ M al>l Ÿ "b and the
minimum variance kernel
$
a$  &># b l>l Ÿ "
O a> b œ œ ) .
! otherwise

For a discussion of these kernels see Hastie and Tibshirani (1990), Härdle (1990), and
Eubank (1988). The weight assigned to the point B3 in the estimation of s0 aB! b is now
- lB3  B! l
[! aB3 à -b œ OŒ .
- -

If the constant - is chosen so that the weights sum to one, i.e.,


" 8 lB  B! l
- " œ "O Œ 3 
- 3œ" -

one arrives at the popular Nadaraya-Watson kernel estimator (Nadaraya 1964, Watson 1964)

© 2003 by CRC Press LLC


!
s0 aB! b œ 3 O alB3  B! lÎ-bC3 . [4.52]
!3 O alB!  B3 lÎ-b

Compared to the choice of bandwidth -, the choice of kernel function is usually of lesser
consequence for the resulting estimate s0 aBb.
The Nadaraya-Watson kernel estimator is a weighted average of the observations where
the weights depend on the kernel function, bandwidth, and placement of the design points
B" ß âß B8 . Rather than estimating an average locally, one can also estimate a local mean func-
tion that depends on \ . This leads to kernel regression. A local linear kernel regression esti-
a!b a!b
mate models the mean at B! as Ec] d œ "! € "" B! by weighted least squares where the
weights for the sum of squares are given by the kernel weights. Once estimates are obtained,
the mean at B! is estimated as

s a!!b € "
" s "a!b B! .

As B! is changed to the next location at which a mean prediction is desired, the kernel
weights are recomputed and the weighted least squares problems is solved again, yielding
s a!!b and "
new estimates " s "a!b .

4.7.2 Choosing the Smoothing Parameter


The smoothing parameter -, the size of the nearest neighborhood in the moving average,
running-line, and loess estimators, and the scaling parameter of the kernel in kernel
smoothing has considerable influence on the shape of the estimated mean function 0 aBb.
Undersmoothed fits, resulting from choosing - too small are jagged and erratic, and do not
allow the modeler to discern the pertinent features of the data. Oversmoothed fits result from
choosing - too large and hide important features of the data. To facilitate the choice of - a
customary approach is to consider the trade-off between the bias of an oversmoothed function
and the variability of an undersmoothed function. The bias of a smoother is caused by the fact
that the true expectation function 0 aBb in the neighborhood around B! does not follow the
trend assumed by the smoother in the local neighborhood. Kernel estimates and moving
averages assume that 0 aBb is flat in the vicinity of B! and local polynomials assume a linear
or quadratic trend. A second source of bias is incurred at the endpoints of the B range,
because the neighborhood is one-sided. Ignoring this latter effect and focusing on a moving
average we estimate at point B! ,

s0 aB! b œ "
" C3 [4.53]
#- € " 3­R aB b
!

s aB! bÓ œ "Îa#- € "b!


EÒ0 s aB! bÓ œ
with expectation 3­R aB! b 0 aB3 b and variance VarÒ0
5 # Îa#- € "b. As - is increased (s0 aBb is less smooth), the bias EÒ0 s aB! b  0 aB! bÓ increases,
and the variance decreases. To balance bias and variance, we minimize for a given bandwidth
a criterion combining both. Attempting to select the bandwidth that minimizes the residual
variance alone is not meaningful since this invariably will lead to a small bandwidth that
connects the data point. If the B3 are not replicated in the data the residual sum of squares

© 2003 by CRC Press LLC


would be exactly zero in this case. Among the criteria that combine accuracy and precision
are the average mean square error EQ WI a-b and the average prediction error ET WI a-b at
the observed values:
" 8 #
EQ WI a-b œ "E”šs0 aB3 b  0 aB3 b› •
8 3œ"
" 8
ET WI a-b œ "Ee]3  0 aB3 bf# ‘ œ EQ WI € 5 # .
8 3œ"

The prediction error focuses on the prediction of a new observation and thus has an additional
term a5 # b. The bandwidth which minimizes one criterion also minimizes the other. It can be
shown that
" " #
EQ WI a-b œ " Var’s0 aB3 b“ € " Š0 aB3 b  s0 aB3 b‹ .
8 3 8 3

The term 0 aB3 b  s0 aB3 b is the bias of the smooth at B3 and the average mean square error is
the sum of the average variance and the average squared bias. The cross-validation statistic
" 8 #
GZ œ "ŠC3  s0 3 aB3 b‹ [4.54]
8 3œ"

is an estimate of ET WI a-b, where s0 3 aB3 b is the leave-one-out prediction of 0 ab at B3 . That


is, the nonparametric estimate of 0 aB3 b is obtained after removing the 3th data point from the
data set. The cross-validation statistic is obviously related to the PRESS (prediction error
sum of squares) statistic in parametric regression models (Allen, 1974),
8
#
T VIWW œ "ˆC3  sC 3߁3 ‰ , [4.55]
3œ"

where sC3߁3 is the predicted mean of the 3th observation if that observation is left out in the
estimation of the regression coefficients. Bandwidths that minimize the cross-validation
statistic [4.54] are often too small, creating fitted values with too much variability. Various
adjustments of the basic GZ statistic have been proposed. The generalized cross-validation
statistic of Craven and Wahba (1979),
#
" 8 C3  s0 aB3 b
KGZ œ "Ž , [4.56]
8 3œ" 8/ 

simplifies the calculation of GZ and penalizes it at the same time. If the vector of fitted
values at the observed data points is written as sy œ Hy, then the degrees of freedom 8  /
are 8  traHb. Notice that the difference in the numerator term is no longer the leave-one-out
residual, but the residual where s0 is based on all 8 data points. If the penalty 8  / is applied
directly to GZ , a statistic results that Mays, Birch, and Starnes (2001) term
T VIWW
T VIWW ‡ œ .
8/
Whereas bandwidths selected on the basis of GZ are often too small, those selected based on
T VIWW ‡ tend to be large. A penalized T VIWW statistic that is a compromise between GZ

© 2003 by CRC Press LLC


and KGZ and appears to be just right (Mays et al. 2001) is
T VIWW
T VIWW ‡‡ œ .
8  / € a8  "beWWV7+B  WWV a-bfÎWWV7+B

WWV7+B is the largest residual sum of squares over all possible values of - and WWV a-b is
the residual sum of squares for the value of - investigated.
GZ and related statistics select the bandwidth based on the ability to predict a new obser-
vation. One can also concentrate on the ability to estimate the mean 0 aB3 b which leads to con-
sideration of U œ !83œ" Ð0 aB3 b  s0 a3 bÑ# as a selection criterion under squared error loss.
Mallows' G: statistic

s # traHbÎ8,
G: a-b œ 8" WWV a-b € #5 [4.57]

is an estimate of EcUd. In parametric regression models G: is used as a model-building tool to


develop a model that fits well and balances the variability of the coefficients in an over-fit
model with the bias of an under-fit model (Mallows 1973). An estimate of 5 # is needed to
calculate the G: statistic. Hastie and Tibshirani (1990, p. 48) recommend estimating 5 # from
a nonparametric fit with small smoothing parameter -‡ as
w
s # œ WWV a-‡ bΘ8  trˆ#H‡  H‡ H‡ ‰™.
5

A different group of bandwidth selection statistics is based on information-theoretical


measures which play an important role in likelihood inference. Among them are Akaike's
information criterion EMG a-b œ 8logeEQ WI a-bf € # and variations thereof (see Eubank
1988, pp. 38-41; Härdle 1990, Ch. 5; Hurvich and Simonoff 1998).

Example 4.7 Paclobutrazol Growth Response (continued). A local quadratic poly-


nomial (loess) smooth was obtained for a number of smoothing parameters ranging
from 0.2 to 0.8 and EMG a-b, G: a-b, and KGZ a-b were calculated. Since the measures
have different scales, we rescaled them to range between zero and one in Figure 4.23.
EMG a-b is minimized for - œ !Þ%, Mallows G: a-b and the generalized cross-validation
statistic are minimized for - œ !Þ$&. The smooth with - œ !Þ$& is shown in Figure
4.24.

The estimate s0 aBb traces the reversal in response trend after treatment application. It
appears that approximately two weeks after the third and fourth treatment application
the growth regulating effect of Paclobutrazol has disappeared, and there appears to be a
growth stimulation relative to untreated plots.

© 2003 by CRC Press LLC


1.0 Akaike's Criterion
Mallows' Cp
0.9 Generalized Corss-Validation

Scaled Goodness-of-fit measure


0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

0.2 0.3 0.4 0.5 0.6 0.7 0.8


λ

Figure 4.23. EMG a-b, GZ a-b, and KGZ a-b for Paclobutrazol response data in Figure
4.20. The goodness of fit measures were rescaled to range from ! to ".

The quadratic loess fit was obtained in The SAS® System with proc loess. Starting
with Release 8.1 the select= option of the model statement in that procedure enables
automatic selection of the smoothing parameter by EMG a-b or KGZ a-b criteria. For
the Paclobutrazol data, the following statements fit local quadratic polynomials and
select the smoothing parameter based on the general cross-validation criteria
aKGZ a-b, Output 4.16b.

proc loess data=paclobutrazol;


model clippct = time / degree=2 dfmethod=exact direct select=GCV;
ods output OutputStatistics=loessFit;
run;
proc print data=loessFit; var Time DepVar Pred Residual LowerCl UpperCl;
run;

200
Clippings (% of Control)

150

100

50

0
0 1 2 3 4
Months

Figure 4.24. Local quadratic polynomial fit (loess) with - œ !Þ$&.

© 2003 by CRC Press LLC


Output 4.16.
The LOESS Procedure
Selected Smoothing Parameter: 0.35
Dependent Variable: CLIPPCT

Fit Summary
Fit Method Direct
Number of Observations 30
Degree of Local Polynomials 2
Smoothing Parameter 0.35000
Points in Local Neighborhood 10
Residual Sum of Squares 6613.29600
Trace[L] 10.44893
GCV 17.30123
AICC 7.70028
AICC1 233.02349
Delta1 18.62863
Delta2 18.28875
Equivalent Number of Parameters 9.52649
Lookup Degrees of Freedom 18.97483
Residual Standard Error 18.84163

Obs TIME DepVar Pred Residual LowerCL UpperCL

1 0.22998 45.6134 43.0768 2.53664 10.472 75.6815


2 0.32854 29.7255 34.9396 -5.2141 13.4187 56.4605
3 0.45996 28.0945 27.8805 0.21407 9.05546 46.7055
4 0.55852 32.5046 25.5702 6.9344 5.21265 45.9277
5 0.68994 19.9031 26.7627 -6.8596 5.06392 48.4615
6 0.7885 29.1638 31.1516 -1.9878 8.98544 53.3177
7 0.95277 44.3525 40.5377 3.81474 19.8085 61.267
8 1.05133 51.7729 44.6647 7.10817 24.9249 64.4045
9 1.1499 44.1395 49.392 -5.2525 29.4788 69.3052
10 1.24846 50.4982 50.6073 -0.1091 29.9399 71.2747
11 1.37988 50.6601 57.6029 -6.9428 31.5453 83.6605
12 1.83984 150.228 124.482 25.7452 101.39 147.575
13 1.9384 128.217 117.851 10.3662 96.7892 138.912
14 2.06982 101.588 102.656 -1.0678 82.8018 122.509
15 2.16838 73.1811 84.7463 -11.565 63.0274 106.465
16 2.39836 86.4965 94.4899 -7.9934 70.7593 118.221
17 2.52977 110.224 123.13 -12.906 101.038 145.222
18 2.62834 143.489 138.821 4.66873 118.077 159.564
19 2.75975 188.535 142.252 46.2833 120.837 163.666
20 2.85832 90.5399 124.276 -33.736 103.48 145.071
21 2.98973 106.795 87.7587 19.0363 66.5058 109.012
22 3.0883 44.9255 63.638 -18.713 42.3851 84.891
23 3.21971 58.5588 64.5384 -5.9796 43.2855 85.7913
24 3.31828 70.728 80.7667 -10.039 59.5138 102.02
25 3.44969 126.722 104.887 21.8345 83.6342 126.14
26 3.54825 126.386 120.239 6.14743 98.9857 141.492
27 3.67967 111.562 130.674 -19.113 110.239 151.11
28 3.77823 140.074 137.368 2.70567 118.34 156.396
29 3.90965 134.654 144.436 -9.7821 122.914 165.958
30 4.00821 157.226 148.854 8.37196 116.131 181.576

© 2003 by CRC Press LLC


Chapter 5

Nonlinear Models

“Given for one instant an intelligence which could comprehend all the
forces by which nature is animated and the respective situation of the
beings who compose it — an intelligence sufficiently vast to submit these
data to analysis — it would embrace in the same formula the movements
of the greatest bodies of the universe and those of the lightest atom; for it,
nothing would be uncertain and the future, as the past, would be present
to its eyes.” Pierre de LaPlace, Concerning Probability. In Newman, J.R.,
The World of Mathematics. New York: Simon and Schuster, 1965, p.
1325.

5.1 Introduction
5.2 Models as Laws or Tools
5.3 Linear Polynomials Approximate Nonlinear Models
5.4 Fitting a Nonlinear Model to Data
5.4.1 Estimating the Parameters
5.4.2 Tracking Convergence
5.4.3 Starting Values
5.4.4 Goodness-of-Fit
5.5 Hypothesis Tests and Confidence Intervals
5.5.1 Testing the Linear Hypothesis
5.5.2 Confidence and Prediction Intervals
5.6 Transformations
5.6.1 Transformation to Linearity
5.6.2 Transformation to Stabilize the Variance
5.7 Parameterization of Nonlinear Models
5.7.1 Intrinsic and Parameter-Effects Curvature
5.7.2 Reparameterization through Defining Relationships
5.8 Applications
5.8.1 Basic Nonlinear Analysis with The SAS® System — Mitscherlich's
Yield Equation

© 2003 by CRC Press LLC


5.8.2 The Sampling Distribution of Nonlinear Estimators —
the Mitscherlich Equation Revisited
5.8.3 Linear-Plateau Models and Their Relatives — a Study of Corn
Yields from Tennessee
5.8.4 Critical R S$ Concentrations as a Function of Sampling Depth —
Comparing Join-Points in Plateau Models
5.8.5 Factorial Treatment Structure with Nonlinear Response
5.8.6 Modeling Hormetic Dose Response through Switching Functions
5.8.7 Modeling a Yield-Density Relationship
5.8.8 Weighted Nonlinear Least Squares Analysis with
Heteroscedastic Errors

© 2003 by CRC Press LLC


5.1 Introduction
Box 5.1 Nonlinear Models

• Nonlinear models have advantages over linear models in that


— their origin lies in biological/physical/chemical theory and
principles;
— their parameters reflect quantities important to the user;
— they typically require fewer parameters than linear models;
— they require substantial insight into the studied phenomenon.

• Nonlinear models have disadvantages over linear models in that


— they require iterative fitting algorithms;
— they require user-supplied starting values (initial estimates) for
the parameters;
— they permit only approximate (rather than exact) inference;
— they require substantial insight into the studied phenomenon.

Recall from §1.7.2 that nonlinear statistical models are defined as models in which the deriva-
tives of the mean function with respect to the parameters depend on one or more of the pa-
rameters. A growing number of researchers in the biological sciences share our sentiment that
relationships among biological variables are best described by nonlinear functions. Processes
such as growth, decay, birth, mortality, abundance, and yield, rarely relate linearly to
explanatory variables. Even the most basic relationships between plant yield and nutrient
supply, for example, are nonlinear. Liebig's famous law of the minimum or law of constant
returns has been interpreted to imply that for a single deficient nutrient crop yield, ] is
proportional to the addition of a fertilizer \ until a point is reached where another nutrient is
in the minimum and yield is limited. At this point further additions of the fertilizer show no
effect and the yield stays constant unless the deficiency of the limiting nutrient is removed.
The proportionality between ] and \ prior to reaching the yield limit implies a straight-line
relationship that can be modeled with a linear model. As soon as the linear increase is
combined with a plateau, the corresponding model is nonlinear. Such models are termed
linear-plateau models (Anderson and Nelson 1975), linear response-and-plateau models
(Waugh et al. 1973, Black 1993), or broken-stick models (Colwell et al. 1988). The data in
Figure 5.1 show relative corn (Zea mays L.) yield percentages as a function of late-spring test
nitrate concentrations in the top $! cm of the soil. The data are a portion of a larger data set
discussed and analyzed by Binford et al. (1992). A linear-plateau model has been fitted to
these data and is shown as a solid line. Let ] denote the yield percent and B the soil nitrogen
concentration. The linear-plateau model can be written as
"! € "" B B Ÿ !
Ec] d œ œ [5.1]
"! € "" ! B ž !,

where ! is the nitrogen concentration at which the two linear segments join. An alternative
expression for model [5.1] is

© 2003 by CRC Press LLC




Ec] d œ a"! € "" BbM aB Ÿ !b € a"! € "" !bM aB ž !b


œ "! € "" aBM aB Ÿ !b € !M aB ž !bb.

Here, M aB Ÿ !b is the indicator function that returns " if B Ÿ ! and ! otherwise. Similarly,
M aB ž !b returns " if B ž ! and ! otherwise. If the concentration ! at which the lines
intersect is known, the term D œ aBM aB Ÿ !b € !M aB ž !bb is known and one can set up an
appropriate regressor variable by replacing the concentrations in excess of ! with the value of
!. The resulting model is a linear regression model Ec] d œ "! € "" D with parameters "! and
"" . If ! is not known and must be estimated from the data  as will usually be the case  this
is a nonlinear model since the derivatives
` Ec] dÎ` "! œ "
` Ec] dÎ` "" œ BM aB Ÿ !b € !M aB ž !b
` Ec] dÎ` ! œ "" M aB ž !b

depend on model parameters.

110

100

90
Relative Yield Percent

80

70

60

50

40
a
30

0 20 40 60 80
Soil NO3 (mg kg-1)

Figure 5.1. Relative corn yield percent as a function of late-spring test soil nitrogen concen-
tration in top $! cm of soil. Solid line is the fitted linear-plateau model. Dashed line is the
fitted quadratic polynomial model. Data kindly made available by Dr. A. Blackmer, Depart-
ment of Agronomy, Iowa State University. Used with permission. See also Binford, Black-
mer, and Cerrato (1992) and the application in §5.8.4.

Should we guess a value for ! from a graph of the data, assume it is the true value (without
variability) and fit a simple linear regression model or should we let the data guide us to a
best possible estimate of ! and fit the model as a nonlinear regression model? As an alter-
native we can abandon the linear-plateau philosophy and fit a quadratic polynomial to the
data, since a polynomial Ec] d œ "! € "" B € "# B# has curvature. That this polynomial fails to
fit the data is easily seen from Figure 5.1. It breaks down in numerous places. The initial in-
crease of yield with soil R S$ is steeper than what a quadratic polynomial can accommodate.
The maximum yield for the polynomial model occurs at a nitrate concentration that is
upwardly biased. Anderson and Nelson (1975) have noticed that these two model breakdowns
are rather typical when polynomials are fit to data for which a linear-plateau model is

© 2003 by CRC Press LLC


appropriate. In addition, the quadratic polynomial has a maximum at Bmax œ  "" Î#"#
aBmax œ '(Þ% in Figure 5.1b. The data certainly do not support the conclusion that the maxi-
mum yield is achieved at a nitrate concentration that high, nor do they support the idea of
decreasing yields with increasing concentration.
This application serves to show that even in rather simple situations such as a linear-
plateau model we are led to nonlinear statistical models for which linear models are a poor
substitute. The two workarounds that result in a linear model are not satisfactory. To fix a
guessed value for ! in the analysis ignores the fact that the “guesstimate” is not free of un-
certainty. It depends on the observed data and upon repetition of the experiment we are likely
to arrive at a (slightly?) different value for !. This uncertainty must be incorporated into the
analysis when determining the precision of the slope and intercept estimators, since the three
parameters in the plateau model are not independent. Secondly, a visual guesstimate does not
compare in accuracy (or precision) to a statistical estimate. Would you have guessed, based
only on the data points in Figure 5.1, ! to be #$Þ"$? The second workaround, that of abando-
ning the linear-plateau model in favor of a polynomial model is even worse. Not only does
the model not fit the data, polynomial models do not incorporate behavior one would expect
of the yield percentages. For example, they do not provide a plateau. The linear-plateau
model does; it is constructed to exhibit that behavior. If there is theoretical and/or empirical
evidence that the response follows a certain characteristic trend, one should resort to statis-
tical models that guarantee that the fitted model shares these characteristics. Cases in point
are sigmoidal, convex, hyperbolic, asymptotic, plateau, and other relationships. Almost
always, such models will be nonlinear.
In our experience the majority of statistical models fitted by researchers and practitioners
to empirical data are nevertheless linear in the parameters. How can this discrepancy be ex-
plained? Straightforward inference in linear models contributes to it as does lack of familiari-
ty with nonlinear fitting methods and software as does the exaggeration of the following per-
ceived disadvantages of nonlinear models.
• Linear models are simple to fit and parameter estimation is straightforward. In nonlinear
models, parameters are estimated iteratively. Initial estimates are successively
improved until some convergence criterion is met. These initial estimates, also termed
starting values, are supplied by the user. There is no guarantee that the iterative algo-
rithm converges to a unique solution or converges at all. A model may apply to a set
of data but because of poorly chosen starting values one may not obtain any parameter
estimates at all.
• Curved trends can be modeled by curvilinear models (see §1.7.2), e.g., polynomial
models of the form ]3 œ "! € "" B3 € "# B#3 € ⠀ "5 B53 € /3 . One can motivate a
polynomial as an approximation to a nonlinear model (see §5.3), but this approxi-
mation may be poor (see Figure 5.1).
• Statistical inference for linear models is well-established and when data are Gaussian-
distributed, is exact. Even for Gaussian-distributed data, inference in nonlinear models
is only approximate and relies on asymptotic results.
• Some nonlinear models can be transformed into linear models. Ec]3 d œ "! expeB"" f, for
example, can be linearized by taking logarithms: lneEc]3 df œ lne"! f € B"" , which is
a linear regression with intercept lne"! f and slope "" . In §4.5.2 a nonlinear relation-
ship between group standard deviations and group means was linearized to determine

© 2003 by CRC Press LLC




a variance-stabilizing transform. Often, non-negligible transformation bias is incurred


in this process, and interpretability of the parameters is sacrificed (§5.6).
• Treatment comparisons in linear models are simple. Set up a linear hypothesis L! :
A" œ 0 and invoke a sum of squares reduction test. In the worst-case scenario this
requires fitting of a full and a reduced model and constructing the sum of squares
reduction test statistic by hand (§4.2.3). But usually, the test statistics can be obtained
from a fit of the full model alone. Tests of hypotheses in nonlinear models require
more often the fitting of a full and a reduced model or ingenious ways of model
parameterization (§5.7).

Problems with iterative algorithms usually can be overcome by choosing an optimization


method suited to the problem at hand (§5.4) and choosing starting values carefully (§5.4.3).
Starting values can often be found by graphical examination of the data, fits of linearized or
approximate models, or by simple mathematical techniques (§5.4.3). If the user has limited
understanding of the properties of a nonlinear model and the interpretation of the parameters
is unclear, choosing starting values can be difficult. Unless sample sizes are very small, non-
linear inference, albeit approximate, is reliable. It is our opinion that approximate inference in
a properly specified nonlinear model outgains exact inference in a less applicable linear
model. If a response is truly nonlinear, transformations to linearity are not without problems
(see §5.6) and treatment comparisons are possible, if the model is parameterized properly.
The advantages of nonlinear modeling outweigh its disadvantages:
• Nonlinear models are more parsimonious than linear models. To invoke curvature with
inflection with a polynomial model requires at least a cubic term. The linear poly-
nomial ]3 œ "! € "" B3 € "# B#3 € "$ B$3 € /3 has four parameters in the mean function.
Nonlinear models can accommodate inflection points with fewer parameters. For
example, the model ]3 œ "  expe  " B! f € /3 has only two parameters in the mean
function.
• Many outcomes do not develop without bounds but reach upper/lower asymptotes and
plateaus. It is difficult to incorporate limiting behavior into linear models. Many
classes of nonlinear models have been studied and constructed to exhibit behavior
such as asymptotic limits, inflections, and symmetry. Many models are mentioned
throughout the text and §A5.9 lists many more members of the classes of sigmoidal,
concave, and convex models.
• Many nonlinear models are derived from elementary biological, physical, or
chemical principles. For example, if C is the size of an organism at time > and ! is its
maximum size, assuming that the rate of growth `CÎ`> is proportional to the re-
maining growth a+  C b leads to the differential equation: `CÎ`> œ " Ð!  CÑ. Upon
integration one obtains a nonlinear three-parameter model for growth: Ca>b œ ! €
a#  !bexpe  " >f. The parameter # denotes the initial size C a!b.
• Parameters of nonlinear models are typically meaningful quantities and have a direct
interpretation applicable to the problem being studied. In the growth model
C a>b œ ! € a#  !bexpe  " >f, ! is the final size, # is the initial size and " governs a
rate of change that determines how quickly the organism grows from # to !. In the
soil sciences nitrogen mineralization potential is often modeled as a function of time
by an exponential model of form Ec] d œ "! a"  expe  "" >fb where "! is the maxi-

© 2003 by CRC Press LLC


mum amount mineralized and "" is the rate at which mineralization occurs. In the
yield-density model Ec] d œ a! € " Bb" where ] denotes yield per plant and B is the
plant density per unit area, "Î! measures the genetic potential of the species and "Î"
the environmental potential (see §5.8.7 for an application).

This chapter is concerned with nonlinear statistical models with a single covariate. In
§5.2 we investigate growth models as a particularly important family of nonlinear models to
demonstrate how theoretical considerations give rise to nonlinear models through determin-
istic generating equations, but also to examine how nonlinear models evolved from mathema-
tical equivalents of laws of nature to empirical tools for data summary and analysis. In §5.3 a
relationship between linear polynomial and nonlinear models is drawn with the help of Taylor
series expansions and the basic process of fitting a nonlinear model to data is discussed in
§5.4. The test of hypotheses and inference about the parameters is covered in §5.5. Even if
models can be transformed to a linear scale, we prefer to fit them in their nonlinear form to
retain interpretability of the parameters and to avoid transformation bias. Transformations
to stabilize the variance have already been discussed in §4.5.2 for linear models.
Transformations to linearity (§5.6) are of concern if the modeler does not want to resort to
nonlinear fitting methods. Parameterization, the process of changing the mathematical form
of a nonlinear model by re-expressing the model in terms of different parameters greatly
impacts the statistical properties of the parameter estimates and the convergence properties of
the fitting algorithms. Problems in fitting a particular model can often be overcome by
changing its parameterization (§5.7). Through reparameterization one can also make the
model depend on parameters it did not contain originally, thereby facilitating statistical
inference about these quantities. In §5.8 we discuss various analyses of nonlinear models
from a standard textbook example to complex factorial treatment structures involving a non-
linear response. Since the selection of an appropriate model family is key in successful non-
linear modeling we present numerous concave, convex, and sigmoidal nonlinear models in
§A5.9 (on CD-ROM). Additional mathematical details extending the discussion in the text
can be found as Appendix A on the CD-ROM (§A5.10).

5.2. Models as Laws or Tools


Like no other class of statistical models, nonlinear models evolved from mathematical formu-
lations of laws of nature to empirical tools describing pattern in data. Within the large family
of nonlinear models, growth models are particularly suited to discuss this evolution. This
discourse also exposes the reader to the genesis of some popular models and demonstrates
how reliance on fundamental biological relationships naturally leads to nonlinearity. This dis-
cussion is adapted in part from a wonderful review article on the history of growth models by
Zeger and Harlow (1987).
The origin of many nonlinear models in use today can be traced to scholarly efforts to
discover laws of nature, to reveal scales of being, and to understand the forces of life.
Assumptions were made about how elementary chemical, anatomical, and physical relation-
ships perpetuate to form a living and growing organism and by extension, populations
(collections of organisms). Robertson (1923), for example, saw a fundamental law of growth
in the chemistry of cells. The equation on which he built describes a chemical reaction in

© 2003 by CRC Press LLC




which the product C is also a catalyst (an autocatalytic reaction). If ! is the initial rate of
growth and " the upper limit of growth, this relationship can be expressed in form of the
differential equation
` logeC fÎ`> œ a`CÎ`>bÎC œ !a"  CÎ" b, [5.2]

which is termed the generating equation of the process. Robertson viewed this relationship
as fundamental to describe the increase in size aCb over time a>b for (all) biological entities.
The solution to this differential equation is known as the logistic or autocatalytic model:
"
C a> b œ . [5.3]
" € expe  !aB  # bf

Pearl and Reed (1924) promoted the autocatalytic concept not only for individual but also for
population growth.
The term ` logeC fÎ`> œ a`CÎ`>bÎC in [5.2] is known as the specific growth rate, a
measure of the rate of change relative to size. Minot (1908) called it the power of growth and
defined senescence as a loss in specific growth rate. He argued that ` logeCfÎ`> is a concave
decreasing function of time, since the rate of senescence decreases from birth. A mathema-
tical example of a relationship satisfying Minot's assumptions about aging and death is the
differential equation
` logeCfÎ`> œ !eloge" f  logeCff, [5.4]

where ! is the intrinsic growth rate and " is a rate of decay. This model is due to Gompertz
(1825) who posited it as a law of human mortality. It assumes that specific growth declines
linearly with the logarithm of size. Gompertz (1825) reasoned that
“the average exhaustions of a man's power to avoid death were such that at the end of equal
infinitely small intervals of time, he lost equal portions of his remaining power to oppose
destruction.”

The Gompertz model, one of the more common growth models and named after him, is
the solution to this differential equation:
C a>b œ " expe  expe  !a>  # bff. [5.5]

Like the logistic model it has upper and lower asymptotes and is sigmoidal in shape. Whereas
the logistic model is symmetric about the inflection point > œ # , the Gompertz model is
asymmetric.
Whether growth is autocatalytic or captured by the Gompertz model has been the focus
of much debate. The key was whether one believed that specific growth rate is a linear or
concave function of size. Courtis (1937) felt so strongly about the adequacy of the Gompertz
model that he argued any biological growth, whether of an individual organism, its parts, or
populations, can be described by the Gompertz model provided that for the duration of the
study conditions (environments) remained constant.
The Gompertz and logistic models were developed for Size-vs.-Time relationships. A
second developmental track focused on models where the size of one part aC" b is related to
the size of another aC# b (so-called size-vs.-size models). Huxley (1932) proposed that
specific growth rates of C" and C# should be proportional:

© 2003 by CRC Press LLC


a`C" Î`>bÎC" œ " a`C# Î`>bÎC# . [5.6]

The parameter " measures the ratio of the specific growth rates of C" and C# . The isometric
case " œ " was of special interest because it implies independence of size and shape. Quiring
(1941) felt strongly that allometry, the proportionality of sizes, was a fundamental biological
law. The study of its regularities, in his words,
“should lead to a knowledge of the fundamental laws of organic growth and explain the scale of
being.”

Integrating the differential equation, one obtains the basic allometric equation
logeC" f œ ! € " logeC# f or in exponentiated form,

C" œ !C#" Þ [5.7]

We notice at this point that the models [5.3]  [5.7] are of course nonlinear. Nonlinearity is a
result of integrating the underlying differential equations. The allometric model, however, can
be linearized by taking logarithms on both sides of [5.7]. Pázman (1993, p.36) refers to such
models as intrinsically linear. Models which cannot be transformed to linearity are then
intrinsically nonlinear.
Allometric relationships can be embedded in more complicated models. Von Bertalanffy
(1957) postulated that growth is the sum of positive (anabolic) forces that synthesize material
and negative (metabolic) forces that reduce material in an organism. Studying the weight of
animals, he found that the power #Î$ for the metabolic rate describes the anabolic forces well.
The model derived from the differential equation,
`CÎ`> œ !C #Î$  " C [5.8]

is known as the Von Bertalanffy model. Notice that the first term on the right-hand side is of
the allometric form [5.7].
The paradigm shift from nonlinear models as mathematical expressions of laws to non-
linear models as empirical tools for data summary had numerous reasons. Cases that did not
seem to fit any of the classical models could only be explained as aberrations in measurement
protocol or environment or as new processes for which laws needed to be found. At the same
time evidence mounted that the various laws could not necessarily coexist. Zeger and Harlow
(1987) elaborate how Lumer (1937) showed that sigmoidal growth (Logistic or Gompertz) in
different parts of an organism can disable allometry by permitting only certain parameter
values in the allometry equation. The laws could not hold simultaneously. Laird (1965)
argued that allometric analyses were consistent with sigmoidal growth provided certain con-
ditions about specific growth rates are met. Despite the inconsistencies between sigmoidal
and allometric growth, Laird highlighted the utility in both types of models. Finally, advances
in computing technology made fitting of nonlinear models less time demanding and allowed
examination of competing models for the same data set. Rather than adopting a single model
family as the law to which a set of data must comply, the empirical nature of the data could be
emphasized and different model families could be tested against a set of data to determine
which described the observations best. Whether one adopts an underlying biological or
chemical relationship as true, there is much to be learned from a model that fits the data well.
Today, we are selecting nonlinear models because they offer certain patterns. If the data
suggest a sigmoidal trend with limiting values, we will turn to the Logistic, Gompertz, and

© 2003 by CRC Press LLC




other families of models that exhibit the desired behavior. If empirical data suggest mono-
tonic increasing or decreasing relationships, families of concave or convex models are to be
considered.
One can argue whether the empiricism in modeling has been carried too far. Study of
certain disciplines shows a prevalence of narrow classes of models. In (herbicide) dose-
response experiments the logistic model (or log-logistic model if the regressor is log-trans-
formed) is undoubtedly the most frequently used model. This is not the case because the
underlying linearity of specific growth (decay) rates is widely adopted as the mechanism of
herbicide response, but because in numerous works it was found that logistic functions fit
herbicide dose-response data well (e.g., Streibig 1980, Streibig 1981, Lærke and Streibig
1995, Seefeldt et al. 1995, Hsiao et al. 1996, Sandral et al. 1997). As a result, analysts may
resist the urge to thoroughly investigate alternative model families. Empirical models are not
panaceas and examples where the logistic family does not describe herbicide dose-response
behavior well can be found easily (see for example, Brain and Cousens 1989, Schabenberger
et al. 1999). Sandland and McGilchrist (1979) and Sandland (1983) criticize the widespread
application of the Von Bertalanffy model in the fisheries literature. The model's status accor-
ding to Sandland (1983), goes “far beyond that accorded to purely empirical models.”
Cousens (1985) criticizes the categorical assumption of many weed-crop competition studies
that crop yield is related to weed density in sigmoidal fashion (Zimdahl 1980, Utomo 1981,
Roberts et al. 1982, Radosevich and Holt 1984). Models for yield loss as a function of weed
density are more reasonably related to hyperbolic shapes according to Cousens (1985). The
appropriateness of sigmoidal vs. hyperbolic models for yield loss depends on biological
assumptions. If it is assumed that there is no competition between weeds and crop at low
densities, a sigmoidal model suggests itself. On the other hand, if one assumes that at low
weed densities weed plants interact with the crop but not each other and that a weed's
influence increases with its size, hyperbolic models with a linear increase of yield loss at low
weed densities of the type advocated by Cousens (1985) arise rather naturally. Because the
biological explanations for the two model types are different, Cousens concludes that one
must be rejected. We believe that much is to be gained from using nonlinear models that
differ in their physical, biological, and chemical underpinnings. If a sigmoidal model fits a set
of yield data better than the hyperbolic contrary to the experimenter's expectation, one is led
to rethink the nature of the biological process, a most healthy exercise in any circumstance. If
one adopts an attitude that models are selected that describe the data well, not because they
comply with a narrow set of biological assumptions, any one of which may be violated in a
particular case, the modeler gains considerable freedom. Swinton and Lyford (1996), for
example, entertain a reparameterized form of Cousen's rectangular hyperbola to model yield
loss as a function of weed density. Their model permits a test whether the yield loss function
is indeed hyperbolic or sigmoidal and the question can be resolved via a statistical test if one
is not willing to choose between the two model families on biological grounds alone. Cousens
(1985) advocates semi-empirical model building. A biological process is divided into stages
and likely properties of each stage are combined to formulate a resulting model “based on
biologically sound premises.” His rectangular hyperbola mentioned above is derived on these
grounds: (i) yield loss percentage ranges between !% and "!!% as weed density tends
towards ! or infinity, respectively; (ii) effects of individual weed plants on crop at low
density are additive; (iii) the rate at which yield loss increases with increasing density is
proportional to the squared yield loss per weed plant. Developing mathematical models in this

© 2003 by CRC Press LLC


fashion is highly recommended. Regarding the resulting equation as a biological law and to
reject other models in its favor equates assumptions with knowledge.
At the other end of the modeling spectrum is analysis without any underlying generating
equation or mechanism by fitting linear polynomial functions to the data. An early two-stage
method for analyzing growth data from various individuals (clusters) was to fit (orthogonal)
polynomials separately to the data from each individual in the first stage and to compare the
polynomial coefficients in the second stage with analysis of variance methods (Wishart
1938). This approach is referred to by Sandland and McGilchrist (1979) as "statistical"
modeling of growth while relying on nonlinear models derived from deterministic generating
equations is termed "biological" modeling. Since all models examined in this text are statis-
tical/stochastic in nature, we do not abide by this distinction. Using polynomials gives the
modeler freedom, since it eliminates the need to develop or justify the theory behind a non-
linear relationship. On the other hand, it is well-documented that polynomials are not well-
suited to describe growth data. Sandland and McGilchrist (1979) expressed desiderata for
growth models, that strike a balance between adhering to deterministic biological laws and
empiricism. In our opinion, these are desiderata for all models applied to biological data:
“Growth models should be flexible and able to fit a range of different shapes. They should be based
on biological considerations bearing in mind the approximate nature of our knowledge of growth.
A biologist should be able to draw some meaningful conclusions from the analysis. The biological
considerations should cover not only the intrinsic growth process but also the random environment
in which it is embedded.”

5.3 Linear Polynomials Approximate Nonlinear


Models
The connection between polynomial and nonlinear models is closer than one may think.
Polynomial models can be considered approximations to (unknown) nonlinear models.
Assume that two variables ] and B are functionally related. For the time being we ignore the
possibly stochastic nature of their relationship. If the function C œ 0 aBb is known, C could be
predicted for every value of B. Expanding 0 aBb into a Taylor series around some other value
B‡ , and assuming that 0 aBb is continuous, 0 aBb can be expressed as a sum:

" ww ‡ " ÞÞÞw


0 aBb œ 0 aB‡ b € D0 w aB‡ b € D # 0 aB b € ÞÞÞ € D < 0 w aB‡ b € V . [5Þ9]
#x <x

Equation c5Þ9d is the Taylor series expansion of 0 aBb around B‡ . Here, 0 w aB‡ b denotes the first
derivative of 0 aBb with respect to B, evaluated at the point B‡ and D œ aB  B‡ b. V is the
remainder term of the expansion and measures the accuracy of the approximation of 0 aBb by
the series of order <. Replace 0 aB‡ b with "! , 0 w aB‡ b with "" , 0 ww aB‡ bÎ#x with "# and so forth
and c5Þ9d reveals itself as a polynomial in D :
C œ "! € "" D € "# D # € ÞÞÞ € "< D < € V œ zw " € V . [5.10]
The term "! € "" D € "# D # € ÞÞÞ € "< D< is a linear approximation to 0 aBb and, depending on
the number of terms, can be made arbitrarily close to 0 aBb. If there are 8 distinct data points

© 2003 by CRC Press LLC


of B, a polynomial with degree < œ 8  " will fit the data perfectly and the remainder term
will be exactly zero. If the degree of the polynomial is less than 8  ", V Á ! and V is a
measure for the discrepancy between the true function 0 aBb and its linear approximation zw " .
When fitting a linear polynomial, the appropriate degree < is of concern. While the
flexibility of the polynomial increases with <, complexity of the model must be traded against
quality of fit and poorer statistical properties of estimated coefficients in high-order, overfit
polynomials. Single-covariate nonlinear statistical models, the topic of this chapter, target
0 aBb directly, rather than its linear approximation.

Example 5.1. The data plotted in Figure 5.2 suggest a curved trend between C and B
with inflection point. To incorporate the inflection, a model must be found for which
the second derivative of the mean function depends on B. A linear polynomial in B must
be carried at least to the third order. The four parameter linear model
]3 œ "! € "" B3 € "# B#3 € "$ B$3 € /3 .

is a candidate. An alternative one-parameter nonlinear model could be

]3 œ "  expš  B3" › € /3 .

1.0

0.8

[ ]
E Yi = 1 - exp{- x 2.2 }
y 0.6

[]
E Yi = - 0.0107 + 01001
. x + 0.807 x 2 - 0.3071x 3

0.4

0.2

0.0
0.00 0.55 1.10 1.65 2.20
X

Figure 5.2. Data suggesting mean function with inflection summarized by nonlinear
and linear polynomial models.

The nonlinear model is more parsimonious and also restricts Ec]3 d between zero and
one. If the response is a true proportion the linear model does not guarantee predicted
values inside the permissible range, whereas the nonlinear model does. The nonlinear
function approaches the upper limit of "Þ! asymptotically as B grows. The fitted poly-
nomial, because of its curvature, does not have an asymptote but achieves extrema at

© 2003 by CRC Press LLC


#
s # „ É %"
 #" s #  "#"
s$"
s"
Bœ œ e!Þ!'%%ß "Þ')(f.
s$
'"

If a decrease in Ec] d is not reasonable on biological grounds, the polynomial is a


deficient model.

Linear polynomials are flexible modeling tools that do not appeal to a generating
equation and for short data series, may be the only possible modeling choice. They are less
parsimonious, poor at fitting asymptotic approaches to limiting values, and do not provide a
biologically meaningful parameter interpretation. From the scientists' point of view, nonlinear
models are certainly superior to polynomials. As an exploratory tool that points the modeler
into the direction of appropriate nonlinear models, polynomials are valuable. For complex
processes with changes of phase and temporal fluctuations, they may be the only models
offering sufficient flexibility unless one resorts to nonparametric methods (see §4.7).

5.4 Fitting a Nonlinear Model to Data


Box 5.2. Model Fitting

• An algorithm for fitting a nonlinear model is iterative and comprises three


components:
• a numerical rule for successively updating iterates (§5.4.1),
• a method for deciding when to stop the process (§5.4.2),
• starting values to get the iterative process under way (§5.4.3).

• Commonly used iterative algorithms are the Gauss-Newton and Newton-


Raphson methods. Neither should be used in their original, unmodified
form.

• Stop criteria for the iterative process should be true convergence, not
termination, criteria to distinguish convergence to a global minimum from
lack of progress of the iterative algorithm.

• Little constructive theory is available to select starting values. Some ad-hoc


procedures have proven particularly useful in practice.

5.4.1 Estimating the Parameters


The least squares principle of parameter estimation has equal importance for nonlinear
models as for linear ones. The idea is to minimize the sum of squared deviations between
observations and their mean. In the linear model ]3 œ x3w " € /3 , where Ec]3 d œ xw3 ", /3 µ 33.

© 2003 by CRC Press LLC


a!ß 5 # b, this requires minimization of the residual sum of squares
8
#
Wa" b œ "aC3  xw3 " b œ ay  X" bw ay  X" b.
3œ"

If X is of full rank, this problem has a closed-form unique solution, the OLS estimator
s œ aXw Xb" Xw y (see §4.2.1). If the mean function is nonlinear, the basic model equation is
"
]3 œ 0 ax3 ß )b € /3 ß /3 µ 33. ˆ!ß 5 # ‰ß 3 œ "ß âß 8, [5.11]

where ) is the a: ‚ "b vector of parameters to be estimated and 0 ax3 ß )b is the mean of ]3 .
The residual sum of squares to be minimized now can be written as
8
W a)b œ "aC3  0 ax3 ß )bb#
3œ"
œ a y  f a x ß ) b b w a y  f a x ß ) bb , [5.12]

with

Ô 0 ax " ß ) b ×
Ö 0 ax # ß ) b Ù
f ax ß ) b œ Ö Ù.
ã
Õ 0 ax 8 ß ) b Ø

This minimization problem is not as straightforward as in the linear case since faxß )b is a
nonlinear function of ). The derivatives of Wa)b depend on the particular structure of the
model, whereas in the linear case with faxß )b œ X" finding derivatives is easy. One method
of minimizing [5.12] is to replace faxß )b with a linear model that approximates faxß )b. In
§5.3, a nonlinear function 0 aBb was expanded into a Taylor series of order <. Since faxß )b
has : unknowns in the parameter vector we expand it into a Taylor series of first order about
each element of ) . Denote by ) ! a vector of initial guesses of the parameters (a vector of
starting values). The first-order Taylor series (see §A5.10.1 for details) of faxß )b around )! is
` fÐx ß )Ñ
faxß )b » fˆxß )! ‰ € ˆ)  ) ! ‰ œ f ˆ x ß ) ! ‰ € F ! ˆ )  ) ! ‰ , [5.13]
` )w l)!

where F! is the a8 ‚ :b matrix of first derivatives of faxß )b with respect to the parameters,
evaluated at the initial guess value )! . The residual y  faxß )b in [5.12] is then approximated
by the residual
y  fˆxß )! ‰  F! ˆ)  )! ‰ œ y  fˆxß )! ‰ € F! )!  F! ),

which is linear in ) and minimizing [5.12] can be accomplished by standard linear least
squares where the response y is replaced by the pseudo-response y  faxß )! b € F! )! and the
regressor matrix is given by F! . Since the estimates we obtain from this approximated linear
least squares problem depend on our choice of starting values )! , the process cannot stop
after just one update of the estimates. Call the estimates of this first fit )" . Then we recalcu-
late the new pseudo-response as y  faxß )" b € F" )" and the new regressor matrix is F" . This
process continues until some convergence criterion is met, for example, until the relative
change in residual sums of squares between two updates is minor. This approach to least

© 2003 by CRC Press LLC


squares fitting of the nonlinear model is termed the Gauss-Newton (GN) method of nonlinear
least squares (see §A5.10.2 for more details).
A second, popular method of finding the minimum of [5.12], is the Newton-Raphson
(NR) method. It is a generic method in the sense that it can be used to find the minimum of
any function, not just a least squares objective function. Applied to the nonlinear least squares
problem, the Newton-Raphson method differs from the Gauss-Newton method in the
following way. Rather than approximating the model itself and substituting the approximation
into the objective function [5.12], we approximate Wa)b directly by a second-order Taylor
series and find the minimum of the resulting approximation (see §A5.10.4 for details). The
NR method also requires initial guesses (starting values) of the parameters and is hence also
?
iterative. Successive iterates s
) are calculated as
?€" ? ? " ?
Gauss-Newton: s
) œs ?
) € $KR œs
) € ˆF? w F? ‰ F? w rŠs
) ‹
?€" ?
[5.14]
Newton-Raphson: s
) œs ?
) € $R s? ˆ ? w ?
V œ) € F F €A
? ‰" ? w
F rŠs
?
) ‹.
? ?
The matrix A is defined in §A5.10.4 and r(s ) ) œ y  f( x ß s
) ) is the vector of fitted residuals
after the ?th iteration. When the process has successfully converged we call the converged
iterate the nonlinear least squares estimates s
) of ).
!
The vector of starting values ) is supplied by the user and their determination is impor-
tant (§5.4.3). The closer the starting values are to the least squares estimate that minimizes
[5.12], the faster and more reliable the iterative algorithm will converge. There is no
guarantee that the GN and NR algorithms converge to the same estimates. They may, in fact,
not converge at all. The GN method in particular is notorious for failing to converge if the
starting values are chosen poorly, the residuals are large, and the Fw F matrix is ill-conditioned
(close to singular).
We described the GN and NR method in their most basic form. Usually they are not
implemented without some modifications. The GN algorithm, for example, does not
guarantee that residual sums of squares between successive iterations decrease. Hartley
(1961) proposed a modification of the basic Gauss-Newton step where the next iterate is
calculated as
?€" ?
s
) œs ?
) € 5 $KR ß 5 ­ a!ß "b [5.15]

and 5 is chosen to ensure that the residual sum of square decreases between iterations. This is
known as step-halving or step-shrinking. The GN method is also not a stable estimation
method if the columns of F are highly collinear for the same reasons that ordinary least
squares estimates are unstable if the columns of the regressor matrix X are collinear (see
§4.4.4 on collinearity) and hence Xw X is ill-conditioned. Nonlinear models are notorious for
ill-conditioning of the Fw F matrix which plays the role of the Xw X matrix in the approximate
linear model of the GN algorithm. In particular when parameters appear in exponents, deriva-
tives with respect to different parameters contain similar functions. Consider the simple two-
parameter nonlinear model
Ec] d œ "  " exp˜  B) ™ [5.16]

© 2003 by CRC Press LLC


with " œ !Þ&, ) œ !Þ* (Figure 5.3). The derivatives are given by
` Ec] dÎ` " œ  exp˜  B) ™
` Ec] dÎ` ) œ " lneBfB) exp˜  B) ™.

1.0

0.9

0.8
E[Y]

0.7

0.6

0.5

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
x

Figure 5.3. Nonlinear response function Ec] d œ "  " exp˜  B) ™ with " œ !Þ&, ) œ !Þ*.

Assume the covariate vector is x œ c!Þ#ß !Þ&ß !Þ(ß "Þ)dw and the F matrix becomes

Ô  !Þ(*!'  !Þ"%*& ×
Ö  !Þ&)"&  !Þ"!)( Ù
FœÖ Ù.
 !Þ%)%"  !Þ!'#'
Õ  !Þ")$# !Þ!*"% Ø

The correlation coefficient between the two columns of F is !Þ*()&. Ridging (§4.4.5) the Fw F
matrix is one approach to modifying the basic GN method to obtain more stable estimates.
This modification is known as the Levenberg-Marquardt method (Levenberg 1944,
Marquardt 1963).
Calculating nonlinear parameter estimates by hand is a tedious exercise. Black (1993, p.
65) refers to it as the “drudgery connected with the actual fitting.” Fortunately, we can rely on
statistical computing packages to perform the necessary calculations and manipulations. We
do caution the user, however, that simply because a software package claims to be able to fit
nonlinear models does not imply that it can fit the models well. Among the features of a good
package we expect suitable modifications of several basic algorithms, grid searches over sets
of starting values, efficient step-halving procedures, the ability to apply ridge estimation
when the F matrix is poorly conditioned, explicit control over the type and strictness of the
convergence criterion, and automatic differentiation to free the user from having to specify
derivatives. These are just some of the features found in the nlin procedure of The SAS®
System. We now go through the "drudgery" of fitting a very simple, one-parameter nonlinear
model by hand, and then show how to apply the nlin procedure.

© 2003 by CRC Press LLC


Example 5.2. The nonlinear model we are fitting by the GN method is a special case of
model [5.16] with " œ ":
]3 œ "  exp˜  B)3 ™ € /3 , /3 µ 33. ˆ!ß 5 # ‰ß 3 œ "ß âß %

The data set consists of the response vector y œ c!Þ"ß !Þ%ß !Þ'ß !Þ*dw and the covariate
vector x œ c!Þ#ß !Þ&ß !Þ(ß "Þ)dw . The mean vector and the matrix (vector) of derivatives
are given by the following:

Ô "  exp˜  !Þ# ™ × Ô lne!Þ#f!Þ# exp˜  !Þ# ™ ×


) ) )

Ö "  exp˜  !Þ&) ™ Ù Ö lne!Þ&f!Þ&) exp˜  !Þ&) ™ Ù


f ax ß ) b œ Ö Ù
Ö "  exp˜  !Þ() ™ Ùà FœÖ Ù
Ö lne!Þ(f!Þ() exp˜  !Þ() ™ Ù.
Õ "  exp˜  "Þ)) ™ Ø Õ lne"Þ)f"Þ)) exp˜  "Þ)) ™ Ø

As a starting value we select )! œ "Þ$. From [5.14] the first evaluation of the derivative
matrix and the residual vector gives

Ô !Þ"  !Þ""'" œ  !Þ!"'" × Ô  !Þ"(&' ×


Ö !Þ%  !Þ$$$) œ !Þ!''# Ù Ö  !Þ")(& Ù
ra) b œ y  fˆxß )
! !‰
œÖ Ùà !
F œÖ Ù.
!Þ'  !Þ%''* œ !Þ"$$"  !Þ""*'
Õ !Þ*  !Þ))$# œ !Þ!"') Ø Õ !Þ"%(% Ø

w
The first correction term is then ÐF! F! с" Fw ! rÐ)! Ñ œ *Þ)!! ‚  !Þ!#$ œ  !Þ##&)
"
and the next iterate is s) œ )!  !Þ##&) œ "Þ!(%#. Table 5.1 shows results of
successive iterations with the GN method.

Table 5.1. Gauss-Newton iterations, )! œ "Þ$

" ?
Iteration ? )? F? r? aF? w F? b F? w r ? $? W Šs) ‹

Ô  !Þ"(&' × Ô  !Þ!"'" ×
Ö  !Þ")(& Ù Ö !Þ!''# Ù
! "Þ$ Ö Ù Ö Ù *Þ)!!&  !Þ!#$!  !Þ##&) !Þ!##'
 !Þ""*' !Þ"$$"
Õ !Þ"%(% Ø Õ !Þ!"') Ø
Ô  !Þ#$*# × Ô  !Þ!'#' ×
Ö  !Þ#!%( Ù Ö !Þ!#"* Ù
" "Þ!(%# Ö Ù Ö Ù (Þ!!)' !Þ!!'% !Þ!%%& !Þ!")%
 !Þ"#$! !Þ"!&(
Õ !Þ"')' Ø Õ !Þ!&#' Ø
Ô  !Þ##&% × Ô  !Þ!&#$ ×
Ö  !Þ#!"% Ù Ö !Þ!$!* Ù
# "Þ"")( Ö Ù Ö Ù (Þ%*$#  !Þ!!!'  !Þ!!%( !Þ!")"
 !Þ"##$ !Þ"""#
Õ !Þ"'%( Ø Õ !Þ!%&" Ø
Ô  !Þ##') × Ô  !Þ!&$% ×
Ö  !Þ#!") Ù Ö !Þ!$!! Ù
$ "Þ""%! Ö Ù Ö Ù (Þ%%!' !Þ!!!" !Þ!!!' !Þ!")"
 !Þ"##% !Þ""!'
Õ !Þ"'&" Ø Õ !Þ!%&* Ø
% "Þ""%'

© 2003 by CRC Press LLC


After one iteration the derivative matrix F and the residual vector r have stabilized and
exhibit little change in successive iterations. The initial residual sum of squares
!
WÐ) Ñ œ !Þ!##' decreases by ")% in the first iteration and does not change after the
third iteration. If convergence is measured as the (relative) change in Wa)b, the algo-
rithm is then considered converged.

To fit this model using The SAS® System, we employ proc nlin. Prior to Release 6.12
of SAS® , proc nlin required the user to supply first derivatives for the Gauss-Newton
method and first and second derivatives for the Newton-Raphson method. Since
Release 6.12 of The SAS® System, an automatic differentiator is provided by the
procedure. The user supplies only the starting values and the model expression. The
following statements read the data set and fit the model using the default Gauss-Newton
algorithm. More sophisticated applications of the nlin procedure can be found in the
example applications (§5.8) and a more in-depth discussion of its capabilities and
options in §5.8.1. The statements
data Ex_51;
input y x @@;
datalines;
0.1 0.2 0.4 0.5 0.6 0.7 0.9 1.8
;;
run;
proc nlin data=Ex_51;
parameters theta=1.3;
model y = 1 - exp(-x**theta);
run;

produce Output 5.1. The Gauss-Newton method converged in six iterations to a residual
!
sum of squares of WÐs)Ñ œ !Þ!")!*& from the starting value ) œ "Þ$. The converged
iterate is s) œ "Þ""%' with an estimated asymptotic standard error eseÐs)Ñ œ !Þ#""*.

Output 5.1. The NLIN Procedure


Iterative Phase
Dependent Variable y
Method: Gauss-Newton
Sum of
Iter theta Squares
0 1.3000 0.0227
1 1.0742 0.0183
2 1.1187 0.0181
3 1.1140 0.0181
4 1.1146 0.0181
5 1.1146 0.0181
6 1.1146 0.0181
NOTE: Convergence criterion met.

Estimation Summary
Method Gauss-Newton
Iterations 6
R 2.887E-6
PPC(theta) 9.507E-7
RPC(theta) 7.761E-6
Object 4.87E-10
Objective 0.018095
Observations Read 4
Observations Used 4
Observations Missing 0

© 2003 by CRC Press LLC


Output 5.1 (continued).
NOTE: An intercept was not specified for this model

Sum of Mean Asymptotic Approx


Source DF Squares Square F Value Pr > F
Regression 1 1.3219 1.3219 219.16 0.0007
Residual 3 0.0181 0.00603
Uncorrected Total 4 1.3400
Corrected Total 3 0.3400

Asymptotic
Standard Asymptotic 95% Confidence
Parameter Estimate Error Limits
theta 1.1146 0.2119 0.4401 1.7890

The parameters statement defines which quantities are parameters to be estimated and
assigns starting values. The model statement defines the mean function 0 ax3 ß )b to be
fitted to the response variable (y in this example). All quantities not defined in the
parameters statement must be either constants defined through SAS® programming
statements or variables to be found in the data set. Since x is neither defined as a param-
eter nor a constant, SAS® will look for a variable by that name in the data set. If, for
example, one may fit the same model where x is square-root transformed, one can
simply put

proc nlin data=Ex_51;


parameters theta=1.3;
z = sqrt(x);
model y = 1 - exp(-z**theta);
run;

As is the case for a linear model, the method of least squares provides estimates for the pa-
rameters of the mean function but not for the residual variability. In the model
Y œ faxß )b € e with e µ a0ß 5 # Ib an estimate of 5 # is required for evaluating confidence
intervals and test statistics. Appealing to linear model theory it is reasonable to utilize the
residual sum of squares obtained at convergence. Specifically,
" " w "
s# œ
5 WÐs
)Ñ œ Šy  fŠxß s
) ‹‹ Šy  fŠxß s
) ‹‹ œ rÐs
)Ñw rÐs
)Ñ. [5.17]
8: 8: 8:

Here, : is the number of parameters and s ) is the converged iterate of ) . If the model errors
# #
are Gaussian, a8  :b5 s Î 5 is approximately Chi-square distributed with 8  : degrees of
freedom. The approximation improves with sample size 8 and is critical in the formulation of
test statistics and confidence intervals. In Output 5.1 this estimate is shown in the analysis of
variance table as the Mean Square of the Residual source, 5 s # œ !Þ!!'!$.

5.4.2 Tracking Convergence


Since fitting algorithms for nonlinear models are iterative, some criterion must be employed
to determine when iterations can be halted. The objective of nonlinear least squares estima-

© 2003 by CRC Press LLC


tion is to minimize a sum of squares criterion and one can, for example, monitor the residual
sum of squares WÐ)s ? Ñ between iterations. In the unmodified Gauss-Newton algorithm it is not
guaranteed that the residual sum of square in the ?th iteration is less than the residual sum of
squares in the previous iteration; therefore, this criterion is dangerous. Furthermore, we note
that one should not use absolute convergence criteria such as the change in the parameter esti-
mates between iterations, since they depend on the scale of the estimates. When the largest
absolute change in a parameter estimates from one iteration to the next is only !Þ!!!" does
not imply that changes are sufficiently small. If the current estimate of that parameter is
!Þ!!!# the parameter estimate has changed by &!% between iterations.
Tracking changes in the residual sum of squares and changes in the parameter estimates
monitors different aspects of the algorithm. A small relative change in WÐ) s Ñ indicates that the
sum of squares surface near the current iterate is relatively flat. A small relative change in the
parameter estimates implies that a small increment of the estimates can be tolerated
(Himmelblau 1972). Bates and Watts (1981) drew attention to the fact that convergence
criteria are not just termination criteria. They should indicate that a global minimum of the
sum of squares surface has been found and not that the iterative algorithm is lacking progress.
Bates and Watts (1981) also point out that computation should not be halted when the relative
accuracy of the parameter estimates seems adequate. The variability of the estimates should
also be considered. A true measure of convergence according to Bates and Watts (1981) is
based on the projection properties of the residual vector. Their criterion is
Í
Í a8  : b w
Í rŠs) ‹ FaFw Fb" Fw rŠs
) ‹, [5.18]
?
Ì :W Šs
) ‹

and iterations are halted when this measure is less than some number 0. The nlin procedure
of The SAS® System implements the Bates and Watts criterion as the default convergence
criterion with 0 œ "!& .
The sum of squares surface in linear models with a full rank X matrix has a unique mini-
mum, the values at the minimum being the least squares estimates (Figure 5.4). In nonlinear
models the surface can be considerably more complicated with long, elongated valleys
(Figure 5.5) or multiple local minima (Figure 5.6). When the sum of square surface has
multiple extrema the iterative algorithm may be trapped in a region from which it cannot
escape. If the surface has long, elongated valleys, it may require a large number of iterations
to locate the minimum. The sum of squares surface is a function of the model and the data
and reparameterization of the model (§5.7) can have tremendous impact on its shape. Well-
chosen starting values (§5.4.3) help in the resolution of convergence problems.
To protect against the possibility that a local rather than a global minimum has been
found we recommend starting the nonlinear algorithm with sufficiently different sets of start-
ing values. If they converge to the same estimates it is reasonable to assume that the sum of
squares surface has a global minimum at these values. Good implementations of modified
algorithms, such as Hartley's modified Gauss-Newton method (Hartley 1961) can improve the
convergence behavior if starting values are chosen far from the solution but cannot guarantee
convergence to a global minimum.

© 2003 by CRC Press LLC


Example 5.3. A simple linear regression model
Ec]3 d œ "! € "" B$3 ß 3 œ "ß ÞÞß $

is fitted to responses y œ c  !Þ"ß &ß  !Þ#dw and data matrix

Ô"  "Þ"%%( ×
Xœ " !Þ! .
Õ" "Þ"$$)) Ø
#
A surface contour of the least squares objective function W a" b œ !$3 aC3  "!  "" B$3 b
shows elliptical contours with a single minimum WÐ" s Ñ œ "(Þ#*& achieved by
s œ c"Þ'$%!)ß  !Þ##%()d (Figure 5.4). If a Gauss-Newton or Newton-Raphson
" w

algorithm is used to estimate the parameters of this linear model, either method will find
the least squares estimates with a single update, regardless of the choice of starting
values.
0.4 20.6 20.6
20.2
19.8
19.5
19.1
0.2
18.8
18.4

18.0
0.0
b1 17.7

bˆ 1 ()
S β̂

-0.2 17.
3

-0.4 18
.8

b̂0

1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
b0

Figure 5.4. Residual sum of squares surface contours.


11.4
11.4

-0.08
11.5

-0.09

-0.10

-0.11
11.7

11.9
11.9

11.8
11.6

11.6
11.9
11.7
11.8

-0.12
1.5 2.0 2.5 3.0 3.5
b

Figure 5.5. Sum of squares contour of model "Îa! € " Bb with elongated valley.

© 2003 by CRC Press LLC




6
3.
0.2 3
3.

1.9
3.0

2.8 1.9
0.1

a 2.5

2.2 2.2
0.0

2.5

-0.1 2.8

1.9 3.0 3
3.

6
3.
-0.2
-1.0 -0.5 0.0 0.5
b

Figure 5.6. Residual sum of square contour for model !expe  " Bf with two local minima.
Adapted from Figure 3.1 in Seber, G.A.F. and Wild, C.J. (1989) Nonlinear Regression. Wiley
and Sons, New York. Copyright © 1989 John Wiley and Sons, Inc. Reprinted by permission
of John Wiley and Sons, Inc.

5.4.3 Starting Values


A complete algorithm for fitting a nonlinear model to data involves starting values to get the
iterative process under way, numerical rules for obtaining a new iterate from previous ones,
and a stopping rule indicating convergence of the iterative process. Although starting values
initiate the process, we discuss the importance and methods for finding good starting values
at this point, after the reader has gained appreciation for the difficulties one may encounter
during iterations. These problems are amplified by poorly chosen starting values. When initial
values are far from the solution, iterations may diverge, in particular with unmodified Gauss-
Newton and Newton-Raphson algorithms. A nonlinear problem may have multiple roots
(solutions) and poorly chosen starting values may lead to a local instead of a global minimum.
Seber and Wild (1989, p. 665) convey that “the optimization methods themselves tend to be
far better and more efficient at seeking a minimum than the various ad hoc procedures that
are often suggested for finding starting values.” This having been said, well-chosen starting
values will improve convergence, reduce the need for numerical manipulations and speed up
fitting of nonlinear models. While at times only wild guesses can be mustered, several tech-
niques are available to determine starting values. Surprisingly, there are only few constructive
theoretical results about choosing initial values. Most methods have an ad-hoc character;
those discussed here have been found to work well in practice.

Graphing Data
One of the simplest methods to determine starting values is to discern reasonable values for )
from a scatterplot of the data. A popular candidate for fitting growth data, for example, is the
four-parameter logistic model. It can be parameterized in the following form,

© 2003 by CRC Press LLC


!
Ec]3 d œ $ € , [5.19]
" € expe"  # B3 f

where $ and a! € $ b are lower and upper asymptotes, respectively, and the inflection point is
located at B‡ œ " Î# (Figure 5.7). Furthermore, the slope of the logistic function at the
inflection point is a function of ! and # , `0 Î`BlB‡ œ !# Î#.

d + a = 15
15

13

10
11 []
E Yi = 5 +
1 + exp{- 6.4 + 11
. xi }
Y

b / g = 5818
.
d =5
5

0 2 4 6 8 10
x

Figure 5.7. Four-parameter logistic model with parameters $ œ &, ! œ "!, " œ  'Þ%,
# œ  "Þ".

Consider having to determine starting values for a logistic model with the data shown in
Figure 5.8. The starting values for the lower and upper asymptote could be $ ! œ &, !! œ *.
The inflection point occurs approximately at B œ ', hence " s ! Îs
# ! œ ' and the slope at the
inflection point is about –$. Solving the equations ! # Î# œ  $ and " ! Î# ! œ ' for # ! and
! !

" ! yields the starting values # ! œ  !Þ'' and " ! œ  %Þ!.

15.0

12.8

10.6
Y

8.4

6.2

4.0

0 2 4 X 6 8 10

Figure 5.8. Observed data points in growth study.

© 2003 by CRC Press LLC




With proc nlin of The SAS® System starting values are assigned in the parameters
statement. The code
proc nlin data=Fig5_8;
parameters delta=5 alpha=9 beta=-4.0 gamma=-0.66;
model y = delta + alpha/(1+exp(beta-gamma*x));
run;

invokes the modified Gauss-Newton algorithm (the default) and convergence is achieved
after seven iterations. Although the converged estimates
s
) œ c&Þ$(%'ß *Þ"'))ß  'Þ)!"%ß  "Þ"'#"dw

are not too far from the starting values )! œ c&Þ!ß *Þ!ß  %Þ!ß  !Þ''dw the initial residual sum
of squares WÐ)! Ñ œ &"Þ'""' is more than twice the final sum of squares WÐs )Ñ œ #%Þ$%!'
(Output 5Þ2). The nlin procedure uses the Bates and Watts (1981) criterion [5.18] to track
convergence with a default benchmark of "!& . The criterion achieved when convergence
was halted is shown as R in the Estimation Summary.

Output 5.2. The NLIN Procedure


Iterative Phase
Dependent Variable y
Method: Gauss-Newton
Sum of
Iter delta alpha beta gamma Squares
0 5.0000 9.0000 -4.0000 -0.6600 51.6116
1 6.7051 7.0023 -7.3311 -1.2605 43.0488
2 5.3846 9.1680 -6.5457 -1.1179 24.4282
3 5.3810 9.1519 -6.8363 -1.1675 24.3412
4 5.3736 9.1713 -6.7944 -1.1610 24.3403
5 5.3748 9.1684 -6.8026 -1.1623 24.3403
6 5.3746 9.1689 -6.8012 -1.1621 24.3403
7 5.3746 9.1688 -6.8014 -1.1621 24.3403
NOTE: Convergence criterion met.

Estimation Summary
Method Gauss-Newton
Iterations 7
R 6.353E-6
PPC(beta) 6.722E-6
RPC(beta) 0.000038
Object 1.04E-9
Objective 24.34029
Observations Read 30
Observations Used 30
Observations Missing 0

Sum of Mean Approx


Source DF Squares Square F Value Pr > F
Regression 4 3670.4 917.6 136.97 <.0001
Residual 26 24.3403 0.9362
Uncorrected Total 30 3694.7
Corrected Total 29 409.0

Approx Approximate 95% Confidence


Parameter Estimate Std Error Limits
delta 5.3746 0.5304 4.2845 6.4648
alpha 9.1688 0.7566 7.6137 10.7240
beta -6.8014 1.4984 -9.8813 -3.7215
gamma -1.1621 0.2592 -1.6950 -0.6293

© 2003 by CRC Press LLC


Grid Search
A grid search allows the numerical optimization method to evaluate WÐ)! Ñ for more than one
set of starting values. The actual iterative part of model fitting then commences with the set of
starting values that produced the smallest sum of squares among the grid values. Initial grid
searches are no substitute for restarting the iterations with different sets of starting values.
The latter procedure is recommended to ensure that the converged iterate is indeed a global,
not a local minimum.
The SAS® statements
proc nlin data=Fig5_8;
parameters delta=3 to 6 by 1 /* 4 grid values for $ */
alpha=7 to 9 by 1 /* 3 grid values for ! */
beta=-6 to -3.0 by 1 /* 4 grid values for " */
gamma=-1.5 to -0.5 by 0.5;/* 3 grid values for # */
model y = delta + alpha/(1+exp(beta-gamma*x));
run;

fit the four-parameter logistic model to the data in Figure 5.8. The residual sum of squares is
evaluated at % ‚ $ ‚ % ‚ $ œ "%% parameter combinations. The best initial combination is the
set of values that produces the smallest residual sum of squares. This turns out to be
)! œ c&ß *ß  'ß  "d (Output 5.3). The algorithm converges to the same estimates as above
but this time requires only five iterations.

Output 5.3. The NLIN Procedure


Grid Search
Dependent Variable y
Sum of
delta alpha beta gamma Squares
3.0000 7.0000 -6.0000 -1.5000 867.2
4.0000 7.0000 -6.0000 -1.5000 596.6
5.0000 7.0000 -6.0000 -1.5000 385.9
6.0000 7.0000 -6.0000 -1.5000 235.3
3.0000 8.0000 -6.0000 -1.5000 760.4
... and so forth ...
6.0000 8.0000 -6.0000 -1.0000 36.7745
3.0000 9.0000 -6.0000 -1.0000 189.8
4.0000 9.0000 -6.0000 -1.0000 79.9745
5.0000 9.0000 -6.0000 -1.0000 30.1425
6.0000 9.0000 -6.0000 -1.0000 40.3106
... and so forth ...
5.0000 9.0000 -3.0000 -0.5000 82.4638
6.0000 9.0000 -3.0000 -0.5000 86.0169

The NLIN Procedure


Iterative Phase
Dependent Variable y
Method: Gauss-Newton
Sum of
Iter delta alpha beta gamma Squares
0 5.0000 9.0000 -6.0000 -1.0000 30.1425
1 5.4329 9.0811 -6.8267 -1.1651 24.3816
2 5.3729 9.1719 -6.7932 -1.1607 24.3403
3 5.3748 9.1684 -6.8026 -1.1623 24.3403
4 5.3746 9.1689 -6.8011 -1.1621 24.3403
5 5.3746 9.1688 -6.8014 -1.1621 24.3403
NOTE: Convergence criterion met.
Remainder of Output as in Output 5.2.

© 2003 by CRC Press LLC




Elimination of Linear Parameters


Some statistical models contain both linear and nonlinear parameters, for example
Ec]3 d œ "! € "" B3 € "# D3) .

Once ) is fixed, the model is linear in " œ c"! ß "" ß "# dw . A common model to relate yield per
plant ] to plant density B is due to Bleasdale and Nelder (1960),

Ec] d œ a! € " B b"Î) .

A special case of this model is the Shinozaki-Kira model with ) œ " (Shinozaki and Kira
1956). Starting values for the Bleasdale-Nelder model can be found by setting ) œ " and ob-
taining initial values for ! and " from a simple linear regression "Î]3 œ ! € " B3 . Once star-
ting values for all parameters have been found the model is fit in nonlinear form.
The Mitscherlich model is popular in agronomy to express crop yield ] as a function of
the availability of a nutrient B. One of the many parameterizations of the Mitscherlich model
(see §5.7, §A5.9.1 on parameterizations of the Mitscherlich model and §5.8.1 for an applica-
tion) is
Ec] d œ !a"  expe  ,aB  B! bfb

where ! is the upper yield asymptote. , is related to the rate of change and B! is the nutrient
concentration at which mean yield is !. A starting value !! can be found from a graph of the
data as the plateau yield. Then the relationship can be re-expressed by taking logarithms as
ln˜!!  ] ™ œ lne!f € ,B!  ,B œ "! € "" B,

which is a simple linear regression with response lne!!  ] f, intercept "! œ lne!f € ,B!
and slope "" œ  ,. The ordinary least squares estimate of , serves as the starting value ,! .
Finally, if the yield without any addition of nutrient (e.g., the zero fertilizer control) is C‡ , a
starting value for B! is a"Î,! blne"  C ‡ Î!! f.

Reparameterization
It can be difficult to find starting values for parameters that have an unrestricted range (–_ß
_Ñ. On occasion, only the sign of the parameter value can be discerned. In these cases it
helps to modify the parameterization of the model. Instead of the unrestricted parameter )
one can fit, for example, ! œ "Îa" € expe–)fb which is constrained to range from zero to
one. Specifying a parameter in this range may be simpler. Once a reasonable estimate for !
has been obtained one can change the parameterization back to the original state and use
)! œ lne!! ÎÐ"  !! f as initial value.
A reparameterization technique advocated by Ratkowsky (1990, Sec. 2.3.1) makes
finding starting values particularly simple. The idea is to rewrite a given model in terms of its
expected value parameters (see §5.7.1). They correspond to predicted values at selected
values aB‡ b of the regressor. From a scatterplot of the data one can then estimate Ec] lB‡ d by
visual inspection. Denote this expected value by .‡ . Set the expectation equal to 0 aB‡ ß )b and
replace one of the elements of ). We illustrate with an example.

© 2003 by CRC Press LLC


In biochemical applications the Michaelis-Menten model is popular to describe chemical
reactions in enzyme systems. It can be written in the form
ZB
Ec] d œ , [5.20]
B€O
where ] is the velocity of the chemical reaction and B is the substrate concentration. Z and
O are parameters of the model, measuring the theoretical maximum velocity aZ b and the sub-
strate concentration at which velocity Z Î# is attained aO b. Assume that no prior knowledge
of the potential maximum velocity is available. From a scatterplot of ] vs. \ the average
velocity is estimated to be .‡ at substrate concentration B‡ . Hence, if \ œ B‡ , the expected
reaction time is
Z B‡
.‡ œ .
B‡ € O
Solving this expression for Z , the parameter which is difficult to specify initially, leads to
B‡ € O
Z œ .‡ .

After substituting this expression back into [5.20] and some algebraic manipulations the re-
parameterized model becomes
‡
B.‡ ˆ B B€O
‡
‰ B € B‡ O
Ec] d œ œ .‡ ‡ BB . [5.21]
B€O B € B‡ O

This is a two-parameter model (.‡ ß O ) as the original model. The process can be repeated by
choosing another value B‡‡ , its expected value parameter .‡‡ , and replacing the parameter O .
Changing the parameterization of a nonlinear model to expected value parameters not
only simplifies finding starting values, but also improves the statistical properties of the esti-
mators (see §5.7.1 and the monographs by Ratkowsky 1983, 1990). A drawback of working
with expected value parameters is that not all parameters can be replaced with their expected
value equivalents, since the resulting system of equations may not have analytic solutions.

Finding Numeric Solutions


This method is similar in spirit to the method of expected value parameters and was suggested
by Gallant (1975). For each of the : parameters of the model choose a point aB4 ß C4 b. This can
be an observed data point or an estimate of the average of ] at B4 . Equate C4 œ
0 aB4 ß )bà 4 œ "ß âß : and solve the system of nonlinear equations. For the Michaelis-Menten
model [5.20] the system comprises two equations:
Z B"
C" œ Í Z œ C" aB" € O bÎB"
B" € O
Z B# Z B#
C# œ ÍOœ  B# .
B# € O C#
Substituting the expression for O into the first equation and simplifying yields

© 2003 by CRC Press LLC




"  B# ÎB"
Z œ C"
"  C" B# ÎC#

and substituting this expression into O œ Z B# ÎC#  B# yields


C" "  B# ÎB"
Oœ  B# Þ
C# a"ÎB#  C" ÎC# b

This technique requires that the system of nonlinear equations can be solved and is a special
case of a more general method proposed by Hartley and Booker (1965). They divide the 8
observations into :7 sets, where : is the number of parameters and B25
a2 œ "ß âß :à 5 œ "ß âß 7b are the covariate values. Then the system of nonlinear equations
" 7
C2 œ " 0 aB25 ß )b
7 5œ"

is solved where C 2 œ 7" !7 5œ" C25 . Gallant's method is a special case where 7 œ " and one
selects : representative points from the data.
In practical applications, the various techniques for finding starting values are often
combined. Initial values for some parameters are determined graphically, others are derived
from expected value parameterization or subject-matter considerations, yet others are entirely
guessed.

Example 5.4. Gregoire and Schabenberger (1996b) model stem volume in $$' yellow
poplar (Liriodendron tulipifera L.) trees as a function of the relative diameter
.34
>34 œ ,
H4

where H4 is the stump diameter of the 4th tree and .34 is the diameter of tree 4 measured
at the 3th location along the bole (these data are visited in §8.4). At the tip of the tree
>34 œ ! and directly above ground >34 œ ". The measurements along the bole were
spaced "Þ# meters apart. The authors selected a volume-ratio model to describe the
accumulation of volume with decreasing diameter (increasing height above ground):
>34 "$ >34
Z34 œ a"! € "" B4 bexpœ  "# /  € /34 . [5.22]
"!!!

Here B4 œ diameter at breast height squared times total tree height for the 4th tree. This
model consists of a linear part a"! € "" B4 b representing the total volume of a tree and a
multiplicative reduction term
>34 "$ >34
V a"# ß "$ ß >34 b œ expœ  "# / ,
"!!!

which by virtue of the parameterization and ! Ÿ >34 Ÿ " is constrained between

© 2003 by CRC Press LLC


" "$
expœ  "# /  Ÿ V a"# ß "$ ß >b Ÿ ".
"!!!

To find starting values for " œ c"! ß "" ß "# ß "$ dw , the linear component was fit to the total
tree volumes of the $$' trees,
Z4 œ "! € "" B4 € /4 a4 œ "ß ÞÞÞß $$'b

and the linear least squares estimates "s ! and "


s " were used as starting values for "! and
"" . For the reduction term V a"# ß "$ ß >b one must have "# ž !ß "$ ž !; otherwise the
term does not shrink toward ! with increasing relative diameter >. Plotting V a"# ß "$ ß >b
for a variety of parameter values indicated that especially "# is driving the shape of the
reduction term and that values of "$ between & and ) had little effect on the reduction
term. To derive a starting value for "# , its expected value parameter can be computed.
Solving
> ‡ "$ > ‡
.‡ œ expœ  "# / ,
"!!!

for "# yields


‡ "!!!
"# œ  lne.‡ f/"$ >

Examining graphs of the tree profiles, it appeared that about *!% of the total tree
volume were accrued on average at a height where the trunk diameter had decreased by
&!%, i.e., at > œ !Þ&. With a guesstimate of "$ œ (Þ&, the starting value for "# with
>‡ œ !Þ& and .‡ œ !Þ* is "#! œ %Þ*'.

5.4.4 Goodness-of-Fit
The most frequently used goodness-of-fit (g-o-f) measure in the classical linear model
5"
]3 œ "! € ""4 B43 € /3
4œ"

is the simple (or multiple) coefficient of determination,


8
! asC3  C b#
WWQ7 3œ"
V# œ œ 8 . [5.23]
WWX7 ! a C3  C b #
3œ"

The appeal of the V # statistic is that it ranges between ! and " and has an immediate interpre-
tation as the proportion of variability in ] jointly explained by the regressor variables. Notice
that this is a proportion of variability in ] about its mean ] since in the absence of any
regressor information one would naturally predict Ec]3 d with the sample mean. Kvålseth

© 2003 by CRC Press LLC




(1985) reviews alternative definitions of V # . For example,


8
! aC3  sC 3 b#
3œ" WWV WWX7  WWV
V# œ "  8 œ" œ . [5.24]
! aC3  C b# WWX7 WWX7
3œ"

In the linear model with intercept term, the two V # statistics [5.23] and [5.24] are identical.
[5.23] contains mean adjusted quantities and for models not containing an intercept other V # -
type measures have been proposed. Kvålseth (1985) mentions
8 8
! aC3  sC 3 b# !sC#3
# 3œ" ‡ 3œ"
V8938> œ" 8 and V # 8938> œ 8 .
! C3# ! C3#
3œ" 3œ"

#
The V8938> statistics are appropriate only if the model does not contain an intercept and C is
zero. Otherwise one may obtain misleading results. Nonlinear models do not contain an inter-
cept in the typical sense and care must be exercised to select an appropriate goodness-of-fit
measure. Also, [5.23] and [5.24] do not give identical results in the nonlinear case. [5.23] can
easily exceed " and [5.24] can conceivably be negative. The key difficulty is that the decom-
position
8 8 8
"aC3  C b# œ "asC 3  C b# € "aC3  sC 3 b#
3œ" 3œ" 3œ"

no longer holds in nonlinear models. Ratkowsky (1990, p. 44) feels strongly that the danger
of misinterpreting V # in nonlinear models is too great to rely on such measures and recom-
mends basing goodness-of-fit decisions in a nonlinear model with : parameters on the mean
square error
8
" "
=# œ "ˆC3  sC #3 ‰ œ WWV . [5.25]
8  : 3œ" 8:

The mean square error is a useful goodness-of-fit statistic because it combines a measure of
closeness between data and fit aWWV b with a penalty term a8  :b to prevent overfitting.
Including additional parameters will decrease WWV and the denominator. =# may increase if
the added parameter does not improve the model fit. But a decrease of =# is not necessarily
indicative of a statistically significant improvement of the model. If the models with and
without the additional parameter(s) are nested, the sum of squares reduction test addresses the
level of significance of the improvement. The usefulness of =# notwithstanding, we feel that a
reasonable V # -type statistic can be applied and recommend the statistic [5.24] as a goodness-
of-fit measure in linear and nonlinear models. In the former, it yields the coefficient of deter-
mination provided the model contains an intercept term, and is also meaningful in the no-
intercept model. In nonlinear models it cannot exceed " and avoids a serious pitfall of [5.23].
Although this statistic can take on negative values in nonlinear models, this has never
happened in our experience and usually the statistic is bounded between ! and ". A negative
value would indicate a serious problem with the considered model. Since the possibility of
negative values exists theoretically, we do not term it a V # statistic in nonlinear models to

© 2003 by CRC Press LLC


avoid interpretation in terms of the proportion of total variation of ] about its mean
accounted for by the model. We term it instead
8
! aC3  sC 3 b#
3œ" WWV
Pseudo-V # œ "  8 œ" . [5.26]
! aC3  C b# WWX7
3œ"

Because the additive sum of squares decomposition does not hold in nonlinear models,
Pseudo-V # should not be interpreted as the proportion of variability explained by the model.

5.5 Hypothesis Tests and Confidence Intervals

5.5.1 Testing the Linear Hypothesis


Statistical inference in nonlinear models is based on the asymptotic distribution of the param-
eter estimators s ) and 5s # . For any fixed sample size inferences are thus only approximate,
even if the model errors are perfectly Gaussian-distributed. Exact nonlinear inferences (e.g.,
Hartley 1964) are typically not implemented in statistical software packages. In this section
we discuss various approaches to the testing of hypotheses about ) and the construction of
confidence intervals for parameters, as well as confidence and prediction intervals for the
mean response and new data points. The tests are straightforward extensions of similar tests
and procedures for linear models (§A4.8.2) but in contrast to the linear case where the same
test can be formulated in different ways, in nonlinear models the equivalent expressions will
lead to tests with different properties. For example a Wald test of L! À A" œ d in the linear
model and the sum of squares reduction test of this hypothesis are equivalent. The sum of
squares reduction test equivalent in the nonlinear model does not produce the same test
statistic as the Wald test.
Consider the general nonlinear model Y œ faxß )b € e, where errors are uncorrelated with
constant variance, e µ a0ß 5 # Ib. The nonlinear least squares estimator s
) has an asymptotic
Gaussian distribution with mean ) and variance-covariance matrix 5 aF Fb" . Notice, that
# w

even if the errors e are Gaussian, the least squares estimator is not. Contrast this with the
linear model Y œ X" € e, e µ Ka0ß 5 # Ib, where the ordinary least squares estimator
s œ aXw Xb" Xw Y
"

is Gaussian with mean " and variance-covariance matrix 5 # aXw Xb" . The derivative matrix F
in the nonlinear model plays a role akin to the X matrix in the linear model, but in contrast to
the linear model, the F matrix is not known, unless ) is known. The derivatives of a nonlinear
model depend, by definition, on the unknown parameters. To estimate the standard error of
the nonlinear parameter )4 , we extract - 4 , the 4th diagonal element of the aFw Fb" matrix.
Similarly, in the linear model we extract . 4 , the 4th diagonal element of aXw Xb" . The
estimated standard errors for )4 and "4 , respectively, are

© 2003 by CRC Press LLC




Linear model: s 4 ‹ œ È5 # . 4
seŠ" s 4 ‹ œ È5
eseŠ" s# .4 . 4 known

Nonlinear model: aseŠs)4 ‹ œ È5 # - 4 easeŠs)4 ‹ œ È5


s #s- 4 - 4 unknown.

The differences between the two cases are subtle. È5 # . 4 is the standard error of "4 in the
linear model, but È5 # - 4 is only the asymptotic standard error (ase) of )4 in the nonlinear
case. Calculating estimates of these quantities requires substituting an estimate of 5 # in the
linear model whereas in the nonlinear case we also need to estimate the unknown - 4 . This
estimate is found by evaluating F at the converged iterate and extracting s- 4 as the 4th diagonal
element of the ÐFsw F
sс" matrix. We use easeÐs)4 Ñ to denote the estimated asymptotic standard
error of the parameter estimator.
s # , not only those of the )4 s. In
Nonlinearity also affects the distributional properties of 5
the linear model with Gaussian errors where X has rank :, Ð8  :Ñ5 s # Î5 # is a Chi-squared
random variable with 8  : degrees of freedom. In a nonlinear model with : parameters,
Ð8  :Ñ5s # Î5 # is only approximately a Chi-squared random variable.
If sample size is sufficiently large we can rely on the asymptotic results and an approxi-
mate a"  !b"!!% confidence interval for )4 can be calculated as

s Ès- 4 .
s)4 „ D!Î# easeÐs)4 Ñ œ s)4 „ D!Î# 5

Because we do not use aseÐs)4 ), but the estimated aseÐs)4 Ñ, it is reasonable to use instead confi-
dence intervals based on a >-distribution with 8  : degrees of freedom, rather than intervals
based on the Gaussian distribution. If

Šs)4  )4 ‹ÎeseŠs)4 ‹

is a standard Gaussian variable, then

Šs)4  )4 ‹ÎeaseŠs)4 ‹

can be treated as a >8: variable. As a consequence, an !-level test for L! :)4 œ . vs.
L" :)4 Á . compares
s)4  .
>9,= œ [5.27]
easeÐs)4 Ñ

against the !Î# cutoff of a >8: distribution. L! is rejected if l>9,= l ž >!Î#ß8: or,
equivalently, if the a"  !b"!!% confidence interval

s Ès- 4 .
s)4 „ >!Î#ß8: easeÐs)4 Ñ œ s)4 „ >!Î#ß8: 5 [5.28]

does not contain . . These intervals are calculated by the nlin procedure with ! œ !Þ!& for
each parameter of the model by default.
The simple hypothesis L! :)4 œ . is a special case of a linear hypothesis (see also
§A4.8.2). Consider we wish to test whether

© 2003 by CRC Press LLC


A) œ d.
In this constraint on the a: ‚ "b parameter vector ), A is a a; ‚ :b matrix of rank ; . The non-
linear equivalent of the sum of squares reduction test consists of fitting the full model and to
obtain its residual sum of squares WÐs )Ñ0 . Then the constraint A) œ d is imposed on the
model and the sum of squares of this reduced model, WÐs )Ñ< , is obtained. The test statistic

šWÐs
)Ñ<  WÐs
)Ñ0 ›Î;
a#b
J9,= œ [5.29]
WÐs
) Ñ 0 Îa 8  : b

has an approximate J distribution with ; numerator and 8  : denominator degrees of free-


dom. If the model errors are Gaussian-distributed, J a#b is also a likelihood ratio test statistic.
We designate the test statistic as J a#b since there is another statistic that could be used to test
the same hypothesis. It is based on the Wald statistic
w "
a"b "
J9,= œ ŠAs
)  d‹ ’AaFw Fb Aw “ ŠAs 5 # ‰.
)  d‹Îˆ;s [5.30]

The asymptotic distribution of [5.30] is that of a Chi-squared random variable since


asymptotically 5s # converges to a constant, not a random variable. For fixed sample size it has
been found that the distribution of [5.30] is better approximated by an J distribution with ;
numerator and 8  : denominator degrees of freedom, however. We thus use the same
approximate distribution for [5.29] and [5.30]. The fact that the J distribution is a better
approximate distribution than the asymptotic Chi-squared is also justification for using >-
based confidence intervals rather than Gaussian confidence intervals. In contrast to the linear
a"b a#b
model where J9,= and J9,= are identical, the statistics differ in a nonlinear model. Their
relative merits are discussed in §A5.10.6. In the special case L! : )4 œ !, where A is a a" ‚ :b
a"b
vector of zeros with a " in the 4th position and d œ !, J9,= reduces to

a"b
s)#4
J9,= œ œ >#9,= .
s #s- 4
5
Any demerits of the Wald statistic are also demerits of the >-test and >-based confidence
intervals shown earlier.

Example 5.5. Velvetleaf Growth Response. Two herbicides (L" ß L# ) are applied at
seven different rates and the dry weight percentages (relative to a no-herbicide control
treatment) of velvetleaf (Abutilon theophrasti Medikus) plants are recorded (Table 5.2).

A graph of the data shows no clear inflection point in the dose response (Figure 5.9).
The logistic model, although popular for modeling dose-response data, is not
appropriate in this instance. Instead, we select a hyperbolic function, the three-param-
eter extended Langmuir model (Ratkowsky 1990).

© 2003 by CRC Press LLC


Table 5.2. Velvetleaf herbicide dose response data
(]34 is the average across the eight replications for rate B34 )
Herbicide 1 Herbicide 2
3 Rate B3" (lbs ae/acre) ]3" Rate B3# (lbs ae/acre) ]3#
" "I  ) "!!Þ!! "I  ) "!!Þ!!!
# !Þ!")! *&Þ*)) !Þ!%') (&Þ)!!
$ !Þ!$'! *"Þ(&! !Þ!*$) &%Þ*#&
% !Þ!("! )#Þ#)) !Þ")(& $$Þ(#&
& !Þ"%$! &&Þ')) !Þ$(&! "'Þ#&!
' !Þ#)'! ")Þ"'$ !Þ(&!! "!Þ($)
( !Þ&(#! (Þ$&! "Þ&!!! )Þ("$
Data kindly provided by Dr. James J. Kells, Department of Crop and Soil Sciences,
Michigan State University. Used with permission.

100

80
Dry weight percentage

60

40

20

0.05 0.30 0.55 0.80 1.05 1.30 1.55


Rate of application (lbs ae/acre)

Figure 5.9. Velvetleaf dry weight percentages as a function of the amount of active
ingredient applied. Closed circles correspond to herbicide 1, open circles to herbicide 2.

The fullest version of this model for the two-herbicide comparison is


#
"4 B344
Ec]34 d œ !4 # . [5.31]
" € "4 B344

]34 denotes the observation made at rate B34 for herbicide 4 œ "ß #. !" is the asymptote
for herbicide " and !# the asymptote for herbicide #. The model states that the two
herbicides differ in the parameters; hence ) is a a' ‚ "b vector, ) œ
c!" ß !# ß "" ß "# ß #" ß ## dw . To test the hypothesis that the herbicides have the same mean
response function, we let

Ô" " ! ! ! !×
Aœ ! ! " " ! ! .
Õ! ! ! ! "  "Ø

© 2003 by CRC Press LLC


The constraint A) œ 0 implies that c!"  !# ß ""  "# ß #"  ## dw œ c!ß !ß !dw and the
reduced model becomes
# #
Ec]34 d œ !" B34 ‚ˆ" € " B34 ‰ Þ

To test whether the herbicides share the same " parameter, L! : "" œ "# , we let
A œ c!ß !ß "ß  "ß !ß !dw . The reduced model corresponding to this hypothesis is
# #
Ec]34 d œ !4 " B344 ‚ˆ" € " B344 ‰.

Since the dry weights are expressed relative to a no-treatment control we consider as
the full model for analysis
#
"4 B344
]34 œ "!! # € /34 [5.32]
" € "4 B344

instead of [5.31]. To fit this model with the nlin procedure, it is helpful to rewrite it as
#" ##
"" B3" "# B3#
]34 œ "!!œ #" MÖ4 œ "× € "!! œ ## M e4 œ #f € /34 Þ [5.33]
" € "" B3" " € "# B3#

M ef is the indicator function that returns the value " if the condition inside the curly
braces is true, and ! otherwise. M e4 œ "f for example, takes on value " for observations
receiving herbicide " and ! for observations receiving herbicide #. The SAS®
statements to accomplish the fit (Output 5.4) are
proc nlin data=herbicide noitprint;
parameters beta1=0.049 gamma1=-1.570 /* starting values for 4 œ " */
beta2=0.049 gamma2=-1.570;/* starting values for 4 œ # */
alpha = 100; /* constrain ! to "!! */
term1 = beta1*(rate**gamma1);
term2 = beta2*(rate**gamma2);
model drypct = alpha * (term1 / (1 + term1))*(herb=1) +
alpha * (term2 / (1 + term2))*(herb=2);
run;

Output 5.4. The NLIN Procedure

Estimation Summary
Method Gauss-Newton
Iterations 10
Subiterations 5
Average Subiterations 0.5
R 6.518E-6
Objective 71.39743
Observations Read 14
Observations Used 14
Observations Missing 0
NOTE: An intercept was not specified for this model.

Sum of Mean Asymptotic Approx


Source DF Squares Square F Value Pr > F
Regression 4 58171.5 14542.9 2036.89 <.0001
Residual 10 71.3974 7.1397
Uncorrected Total 14 58242.9
Corrected Total 13 17916.9

© 2003 by CRC Press LLC


Output 5.4 (continued).
Asymptotic
Standard
Parameter Estimate Error Asymptotic 95% Confidence Limits
beta1 0.0213 0.00596 0.00799 0.0346
gamma1 -2.0394 0.1422 -2.3563 -1.7225
beta2 0.0685 0.0122 0.0412 0.0958
gamma2 -1.2218 0.0821 -1.4047 -1.0389

We note WÐs ) Ñ œ ("Þ$*(%ß 8  : œ "!ß s


) œ c!Þ!#"$ß  #Þ!$*%ß !Þ!')&ß  "Þ##")dw and
the estimated asymptotic standard errors of !Þ!!&*', !Þ"%##, !Þ!"##, and !Þ!)#".
Asymptotic *&% confidence limits are calculated from [5.28]. For example, a *&%
confidence interval for )# œ #" is
# " „ >!Þ!#&ß"! eseas
s # " b œ  #Þ!$*% „ #Þ##)‡!Þ"%## œ c  #Þ$&'$ ß  "Þ(##&d.

To test the hypothesis


""  "# !
L! :” œ ,
#"  ## • ” ! •

that the two herbicides coincide in growth response, we fit the constrained model
" B#34
]34 œ "!! € /3 Þ [5Þ34]
" € " B#34

with the statements


proc nlin data=herbicide noitprint ;
parameters beta=0.049 gamma=-1.570;
alpha = 100;
term1 = beta*(rate**gamma);
model drypct = alpha*term1 / (1 + term1);
run;

The abridged output follows.

Output 5.5.
The NLIN Procedure

NOTE: Convergence criterion met.

Sum of Mean Asymptotic Approx


Source DF Squares Square F Value Pr > F

Regression 2 57835.3 28917.6 851.27 <.0001


Residual 12 407.6 33.9699
Uncorrected Total 14 58242.9
Corrected Total 13 17916.9

Asymptotic
Standard
Parameter Estimate Error Asymptotic 95% Confidence Limits
beta 0.0419 0.0139 0.0117 0.0722
gamma -1.5706 0.1564 -1.9114 -1.2297

© 2003 by CRC Press LLC


The fit of the reduced model is shown in Figure 5.10 along with the raw data (full
circles denote herbicide ", open circles denote herbicide #).

100

80
Dry weight percentage

60

40

20

0.05 0.30 0.55 0.80 1.05 1.30 1.55


Rate of application (lbs ae/acre)

Figure 5.10. Observed dry weight percentages for velvetleaf weed plants treated with
two herbicides at various rates. Rates are measured in pounds of acid equivalent per
acre. Solid line is model [5.34] fit to the combined data in Table 5.2.

The sum of squares reduction test yields

šWÐs
)Ñ<  WÐs
)Ñ0 ›Î; e%!(Þ'  ("Þ$*(%fÎ#
a#b
J9,= œ œ œ &Þ! a: œ !Þ!$"#&b.
WÐs
)Ñ0 Îa8  :b ("Þ$*(%Î"!

There is sufficient evidence at the &% level to conclude that the growth responses of the
two herbicides are different. At this point it is interesting to find out whether the herbi-
cides differ in the parameter " , # , or both. It is here, in the comparison of individual
parameters across groups, that the Wald-type test can be implemented easily. To test
L! :"" œ "# , for example, we can fit the reduced model
#
" B344
]34 œ "!! # € /34 [5.35]
" € " B344

and use the sum of squares reduction test to compare with the full model [5.32]. Using
the statements (output not shown)

proc nlin data=herbicide noitprint;


parameters beta=0.049 gamma1=-1.570 gamma2=-1.570;
alpha = 100;
term1 = beta*(rate**gamma1);
term2 = beta*(rate**gamma2);
model drypct = alpha * (term1 / (1 + term1))*(herb=1) +
alpha * (term2 / (1 + term2))*(herb=2);
run;

a#b
one obtains WÐs)< Ñ œ "&*Þ) and J9,= œ a"&*Þ)  ("Þ$*(%bÎ(Þ"$*(% œ "#Þ$) with :-
value !Þ!!&&. At the &% level the hypothesis of equal " parameters is rejected.

© 2003 by CRC Press LLC


This test could have been calculated as a Wald-type test from the fit of the full model
s " and "
by observing that " s # are uncorrelated and the appropriate > statistic is

s"  "
" s# !Þ!#"$  !Þ!')&  !Þ!%(#!
>9,= œ œ œ œ  $Þ%('##
s s
easeŠ" "  " # ‹ È !Þ!!&*'# € !Þ!"### !Þ!"$&)

Since the square of a > random variable with @ degrees of freedom is identical to a J
random variable with "ß @ degrees of freedom, we can compare >#9,= œ $Þ%('### œ
"#Þ!) to the J statistic of the sum of squares reduction test. The two tests are asymp-
totically equivalent, but differ for any finite sample size.

An alternative implementation of the > test that estimates the difference between "" and
"# directly, is as follows. Let "# œ "" € $ and write [5.32] as
"" B#3"" ##
a"" € $ bB3#
]34 œ "!!œ MÖ4 œ "× € "!! œ ## M e4 œ #f € /34,
" € "" B#3"" " € a"" € $ bB3#

so that $ measures the difference between "" and "# . The SAS® statements fitting this
model are

proc nlin data=herbicide noitprint;


parameters beta1=0.049 gamma1=-1.570 gamma2=-1.570 delta=0 ;
alpha = 100;
beta = beta1 + delta*(herb=2);
term1 = beta*(rate**gamma1);
term2 = beta*(rate**gamma2);
model drypct = (alpha*term1 / (1 + term1))*(herb=1) +
(alpha*term2 / (1 + term2))*(herb=2);
run;

The estimate s$ œ !Þ!%(# (Output 5.6, abridged) agrees with the numerator of >9,= above
(apart from sign) and easeÐs$ Ñ œ !Þ!"$' is identical to the denominator of >9,= .

Output 5.6. The NLIN Procedure

Sum of Mean Asymptotic Approx


Source DF Squares Square F Value Pr > F
Regression 4 58171.5 14542.9 2036.89 <.0001
Residual 10 71.3974 7.1397
Uncorrected Total 14 58242.9
Corrected Total 13 17916.9

Asymptotic
Standard
Parameter Estimate Error Asymptotic 95% Confidence Limits
beta1 0.0213 0.00596 0.00799 0.0346
gamma1 -2.0394 0.1422 -2.3563 -1.7225
gamma2 -1.2218 0.0821 -1.4047 -1.0389
delta 0.0472 0.0136 0.0169 0.0776

© 2003 by CRC Press LLC


This parameterization allows a convenient method to test for differences in parameters
among groups. If the asymptotic *&% confidence interval for $ does not contain the
value zero, the hypothesis L! :$ œ ! Í "" œ "# is rejected. The sign of s$ informs us
then which parameter is significantly larger. In this example L! : $ œ ! is rejected at the
&% level since the asymptotic *&% confidence interval c!Þ!"'*ß !Þ!(('d does not
contain !.

5.5.2 Confidence and Prediction Intervals


Once estimates of the model parameters have been obtained, the mean response at x3 is
predicted by substituting s ) for ) . To calculate a a"  !b"!!% confidence interval for the
)Ñ we rely again on the asymptotic Gaussian distribution of s
prediction 0 Ðx3 ,s ). Details can be
found in §A5.10.1. Briefly, the asymptotic standard error of 0 Ðx3 ,s
)Ñ is

aseŠ0 Ðx3 ,s
)ы œ É5 # Fw3 aFw Fb" Fw3 ,

s#
where F3 denotes the 3th row of F. To estimate this quantity, 5 # is replaced by its estimate 5
and F, F3 are evaluated at the converged iterate. The >-based confidence interval for Ec] d at x
is

0 Ðx3 ,s
)Ñ „ >!Î#ß8: easeŠ0 Ðx3 ,s
)ы.

These confidence intervals can be calculated with proc nlin of The SAS® System as we now
illustrate with the small example visited earlier.

Example 5.2 Continued. Recall the simple one parameter model ]3 œ " 
expÖ  B)3 × € /3 , 3 œ "ß âß % that was fit to a small data set on page 199. The
converged iterate was s) œ "Þ""%' and the estimate of residual variability 5
s # œ !Þ!!'.
The gradient matrix at s) evaluates to

Ô lne!Þ#f!Þ#s) expš  !Þ#s) › ×


Ö Ù
Ö s s Ù Ô  !Þ##'( ×
Ö lne!Þ&f!Þ&) expš  !Þ&) › Ù Ö  !Þ#!"( Ù
sυ
F ÙœÖ Ù
Ö Ù  !Þ"##%
Ö lne!Þ(f!Þ(s) expš  !Þ(s) › Ù
Ö Ù Õ !Þ"'&! Ø
s s
Õ lne"Þ)f"Þ)) expš  "Þ)) › Ø
w
and FsFs œ !Þ"$%#). The confidence intervals for 0 aBß )b at the four observed data
points are shown in the following table.

© 2003 by CRC Press LLC




Table 5.3. Asymptotic *&% confidence intervals for 0 aBß )b;


s # œ !Þ!!'
>!Þ!#&ß$ œ $Þ")#%, 5
Asymptotic
B3 s3
F sw3 ÐF
F sw F sw3
sс" F easeŠ0 ÐB3 ,s
) ы Conf. Interval
!Þ#  !Þ##'( !Þ$)#' !Þ!%)! !Þ!!!%ß !Þ$!'"
!Þ&  !Þ#!"( !Þ$!$! !Þ!%#( !Þ#$$)ß !Þ&!*#
!Þ(  !Þ"##% !Þ"""' !Þ!#&* !Þ%!'(ß !Þ&(")
"Þ) !Þ"'&! !Þ#!#) !Þ!$%* !Þ(%#)ß !Þ*'&&

Asymptotic *&% confidence intervals are obtained with the l95m and u95m options of
the output statement in proc nlin. The output statement in the code segment that
follows creates a new data set (here termed nlinout) that contains the predicted values
(variable pred), the estimated asymptotic standard errors of the predicted values
(easeÐ0 ÐB3 ß s)ÑÑ, variable stdp), and the lower and upper *&% confidence limits for
0 aB3 ß )b (variables l95m and u95m).
ods listing close;
proc nlin data=Ex_52;
parameters theta=1.3;
model y = 1 - exp(-x**theta);
output out=nlinout pred=pred stdp=stdp l95m=l95m u95m=u95m;
run;
ods listing;
proc print data=nlinout; run;

Output 5.7.
Obs y x PRED STDP L95M U95M

1 0.1 0.2 0.15323 0.048040 0.00035 0.30611


2 0.4 0.5 0.36987 0.042751 0.23382 0.50592
3 0.6 0.7 0.48930 0.025941 0.40674 0.57186
4 0.9 1.8 0.85418 0.034975 0.74287 0.96549

Confidence intervals are intervals for the mean response 0 ax3 ß )b at x3 . Prediction intervals,
on the contrary, have a"  !b"!!% coverage probability for ]3 , which is a random variable.
Prediction intervals are wider than confidence intervals for the mean because the estimated
s Ñ rather than that of 0 Ðx3 ß s
standard error is that of the difference ]3  0 Ðx3 ß ) ) Ñ:

0 Ðx 3 ß s s щ .
)Ñ „ >!Î#ß8: easeˆ]3  0 Ðx3 ß ) [5.36]

In SAS® , prediction intervals are also obtained with the output statement of the nlin
procedure, but instead of l95m= and u95m= use l95= and u95= to save prediction limits to the
output data set.

© 2003 by CRC Press LLC


5.6 Transf ormations
Box 5.3 Transformations

• Transformations of the model are supposed to remedy a model breakdown


such as
— nonhomogeneity of the variance;
— non-Gaussian error distribution;
— nonlinearity.

• Some nonlinear models can be transformed into linear models. However, de-
transforming parameter estimates and predictions leads to bias
(transformation bias).

• The transformation that linearizes the model may be different from the
transformation that stabilizes the variance or makes the errors more
Gaussian-like.

5.6.1 Transformation to Linearity


Since linear models are arguably easier to work with, transforming a nonlinear model into a
linear model is a frequently used device in data analysis. The apparent advantages are un-
biased minimum variance estimation in the linearized form of the model and simplified calcu-
lations. Transformation to linearity obviously applies only to nonlinear models which can be
linearized (intrinsically linear models in the sense of Pázman 1993, p. 36).

Example 5.6. Studies of allometry relate differences in anatomical shape to differences


in size. Relating two size measures under the assumption of a constant ratio of their
relative growth rates gives rise to a nonlinear model of form

Ec]3 d œ "! B"3 " , [5.37]

where ] is one size measurement (e.g., length of fibula) and B is the other size measure
(e.g., length of sternum). "" is the ratio of the relative growth rates "" œ ÐBÎCÑ`CÎ`B.
This model can be transformed to linearity by taking logarithms on both sides
lneEc]3 df œ Y3 œ lne"! f € "" lneB3 f

which is a straight-line regression of Y3 on the logarithm of B. If, however the under-


lying allometric relationship were given by

Ec]3 d œ ! € "! B3""

taking logarithms would not transform the model to a linear one.

© 2003 by CRC Press LLC




Whether a nonlinear model which can be linearized should be linearized depends on a


number of issues. We consider the nature of the model residuals first. A model popular in
forestry is Schumacher's height-age relationship which predicts the height of the socially
dominant trees in an even-aged stand a] b from the reciprocal of the stand age aB œ "Î+1/b
(Schumacher 1939, Clutter et al. 1992). One parameterization of the mean function is
Ec] d œ !expe" B f , [5.38]

and Ec] d is linearized by taking logarithms,


lneEc] df œ lne!f € " B œ # € " B.

To obtain an observational model we need to add stochastic error terms. Residuals propor-
tional to the expected value of the response give rise to
] œ !expe" Bfa" € /‡ b œ !expe" Bf € !expe" Bf/‡ œ !expe" Bf € /‡‡ . [5.39]

This is called a constant relative error model (Seber and Wild 1989, p. 15). Additive
residuals, on the other hand, lead to a constant absolute error model
] œ !expe" Bf € %‡ Þ [5.40]

A linearizing transform is not applied to the mean values but to the observables ] , so that
the two error assumptions will lead to different properties for the transformed residuals. In the
case of [5.39] the transformed observational model becomes
lne] f œ # € " B € lne" € /‡ f œ # € " B € /Þ

If /‡ has zero mean and constant variance, then lne" € /‡ f will have approximately zero
mean and constant variance (which can be seen easily from a Taylor series of lne" € /‡ f).
For constant absolute error the logarithm of the observational model leads to
lne] f œ ln˜!/" B € %‡ ™
!expe" Bf
œ lnœ!/" B € %‡ 
!expe" Bf

œ lnœ!/" B Œ" € 
!expe" Bf
‡
%
œ # € " B € lnœ" €  œ # € " B € %.
!expe" Bf

The error term of this model should have expectation close to zero expectation if %‡ is small
on average compared to Ec] d (Seber and Wild, 1989, p. 15). Varc%d depends on the mean of
the model, however. In case of [5.39] the linearization transform is also a variance-stabilizing
transform but in the constant absolute error model linearization has created heterogeneous
error variances.
Nonlinear models are parameterized so that the parameters represent important physical
and biological measures (see §5.7 on parameterizations) such as rates of change, survival and
mortality, upper and lower yield and growth asymptotes, densities, and so forth. Linearization
destroys the natural interpretation of the parameters. Sums of squares and variability esti-
mates are not reckoned on the original scale of measurement, but on the transformed scale.
The variability of plant yield is best understood in yield units (bushels/acre, e.g.), not in

© 2003 by CRC Press LLC


square roots or logarithms of bushels/acre. Another problematic aspect of linearizing transfor-
mations relates to their transformation bias (also called prediction bias). Taking expec-
tations of random variables is a linear operation and does not apply to nonlinear transforms.
To obtain predicted values on the original scale after a linearized version of the model has
been fit, the predictions obtained in the linearized model need to be detransformed. This
process introduces bias.

Example 5.7Þ Consider a height-age equation of Schumacher form


EcL3 d œ !expe" B3 f,

where L3 is the height of the 3th plant (or stand of plants) and B3 is the reciprocal of
plant (stand) age. As before, the mean function is linearized by taking logarithms on
both sides:
lneEcL3 df œ lne!f € " B3 œ xw3 ".

Here xw3 œ c"ß B3 d, " œ clne!fß  " dw and expexw3 " f œ EcL3 dÞ We can fit the linearized
model
Y3 œ xw3 " € /3 [5.41]

where the /3 are assumed independent Gaussian with mean ! and variance 5 # . A
s which has expectation
s œ xw "
predicted value in this model on the linear scale is ?
s “ œ xw " œ lneEcL df.
sd œ E’xw "
Ec?

If the linearized model is correct, the predictions are unbiased for the logarithm of
height. Are they also unbiased for height? To this end we detransform the linear predic-
tions and evaluate Ecexpe? sfd. In general, if lne] f is Gaussian with mean . and
variance 0, then ] has a Log-Gaussian distribution with
Ec] d œ expe. € 0Î#f
Varc] d œ expe#. € 0faexpe0f  "b.

We note that Ec] d is expe0Î#f times larger than expe.f, since 0 is a positive quantity.
Under the assumption about the distribution of /3 made earlier, the predicted value ?
s is
w # w w "
also Gaussian with mean x " and variance 5 x aX Xb x. The mean of a predicted
height value is then
" " " "
sfd œ expœxw " € 5 # xw aXw Xb x œ EcL dexpœ 5 # xw aXw Xb x.
Ecexpe?
# #

This is an overestimate of the average height.

Transformation to achieve linearity can negatively affect other desirable properties of the
model. The most desirable transformation is one which linearizes the model, stabilizes the
variance, and makes the residuals Gaussian-distributed. The class of Box-Cox transforma-
tions (after Box and Cox 1964, see §5.6.2 below for Box-Cox transformations in the context

© 2003 by CRC Press LLC




of variance heterogeneity) is heralded to accomplish these multiple goals but is not without
controversy. Transformations are typically more suited to rectify a particular problematic
aspect of the model, such as variance heterogeneity or residuals which are far from a
Gaussian distribution. The transformation that stabilizes the variance may be different from
the transformation that linearizes the model.

Example 5.8. Yield density models are used in agronomy to describe agricultural
output (yield) as a function of plant density (see §5.8.7 for an application). Two
particularly simple representatives of yield-density models are due to Shinozaki and
Kira (1956)
"
]3 œ € /3
"! € "" B3
and Holliday (1960)
"
]3 œ € /3 .
"! € "" B3 € "# B#3

Obviously the linearizing transform is the reciprocal "Î] . It turns out, however, that for
many data sets to which these models are applied the appropriate transform to stabilize
the variance is the logarithmic transform.

Nonlinearity of the model, given the reliability and speed of today's computer algorithms,
is not considered a shortcoming of a statistical model. If estimation and inferences can be
carried out on the original scale, there is no need to transform a model to linearity simply to
invoke a linear regression routine and then to incur transformation bias upon detransfor-
mation of parameter estimates and predictions. The lack of Gaussianity of the model residuals
is less critical for nonlinear models than it is for linear ones, since less is lost if the errors are
non-Gaussian. Statistical inference in nonlinear models requires asymptotic results and the
nonlinear least squares estimates are asymptotically Gaussian-distributed regardless of the
distribution of the model errors. In linear models the difference between Gaussian and non-
Gaussian errors is the difference between exact and approximate inference. In nonlinear
models, inference is approximate anyway.

5.6.2 Transformation to Stabilize the Variance


Variance heterogeneity (heteroscedasticity), the case when the variance-covariance matrix of
the model residuals is not given by
Varced œ 5 # I,

but a diagonal matrix whose entries are of different magnitude, is quite common in biological
data. It is related to the intuitive observation that large entities vary more than small entities.
If error variance is a function of the regressor B two approaches can be used to remedy
heteroscedasticity. One can apply a power transformation of the model or fit the model by

© 2003 by CRC Press LLC


nonlinear weighted least squares. The former approach relies on a transformation that
stabilizes the variance. The latter approach accounts for changes in variability with B in the
process of model fitting. Box and Cox (1964) made popular the family of power transfor-
mations defined by
ˆ] -  "‰Î- -Á!
Y œœ [5.42]
lne] f - œ !,

which apply when the response is non-negative a] ž !b. Expanding Y into a first-order
Taylor series around Ec] d we find the variance of the Box-Cox transformed variable to be
approximately

VarcY d œ Varc] deEc] df#a-"b Þ [5.43]

By choosing - properly the variance of Y can be made constant which was the motivation
behind finding a variance-stabilizing transform in §4.5.2 for linear models. There we were
concerned with the comparison of groups where replicate values were available for each
group. As a consequence we could estimate the mean and variance in each group, linearize
the relationship between variances and means, and find a numerical estimate for the para-
meter -. If replicate values are not available, - can be determined by trial and error, choosing
a value for -, fitting the model and examining the fitted residuals s/3 until a suitable value of -
has been found. With some additional programming effort, - can be estimated from the data.
Seber and Wild (1989) discuss maximum likelihood estimation of the parameters of the mean
function and - jointly. To combat variance heterogeneity [5.43] suggests two approaches:
• transform both sides of the model according to [5.42] and fit a nonlinear model with
response Y3 ;
• leave the response ]3 unchanged and allow for variance heterogeneity. The model is fit
by nonlinear weighted least squares where the variance of the response is proportional
to Ec]3 d#a-"b Þ

Carroll and Ruppert (1984) call these approaches power-transform both sides and power-
transformed weighted least squares models. If the original, untransformed model is
]3 œ 0 ax3 ß )b € /3 ,

then the former is


Y3 œ 0 ‡ ax3 ß )b € /3‡

where

Ð0 ax3 ß )b-  "ÑÎ- -Á!


0 ‡ ax 3 ß ) b œ œ
lne0 ax3 ß )bf -œ!

and /3‡ µ 33. K a!ß 5 # b. The second approach uses the original response and

]3 œ 0 ax3 ß )b € /3 ß /3 µ 33. K Š!ß 5 # 0 ax3 ß )b- ‹.

For extensions of the Box-Cox method such as power transformations for negative responses

© 2003 by CRC Press LLC




see Box and Cox (1964), Carroll and Ruppert (1984), Seber and Wild (1989, Ch. 2.8) and
references therein. Weighted nonlinear least squares is applied in §5.8.8.

5.7 Parameterization of Nonlinear Models


Box 5.4 Parameterization

• Parameterization is the process of expressing the mean function of a


statistical model in terms of parameters to be estimated.

• A nonlinear model can be parameterized in different ways. A particular


parameterization is chosen so that the parameter estimators have a certain
interpretation, desirable statistical properties, and allow embedding of
hypotheses (see §1.5).

• Nonlinear models have two curvature components, called intrinsic and


parameter-effects curvature. Changing the parameterization alters the
degree of parameter-effects curvature.

• The smaller the curvature of a model, the more estimators behave like the
efficient estimators in a linear model and the more reliable are inferential
and diagnostic procedures.

Parameterization is the process of expressing the mean function of a statistical model in terms
of unknown constants (parameters) to be estimated. With nonlinear models the same model
can usually be expressed (parameterized) in a number of ways.
Consider the basic differential equation `CÎ`> œ ,a!  Cb where Ca>b is the size of an
organism at time >. According to this model, growth is proportional to the remaining size of
the organism, ! is the final size (total growth), and , is the proportionality constant. Upon
integration of this generating equation one obtains

C a>b œ ! € a0  !bexpe  ,>f [5.44]

where !ß ,ß 0ß B ž ! and 0 is the initial size, C a!bÞ A simple reparameterization is obtained by


setting ) œ 0  ! in [5.44]. The equation now becomes
C a>b œ ! € )expe  ,>f, [5.45]

a form in which it is known as the monomolecular growth model. If furthermore expe  ,f


is replaced by <, the resulting equation
Ca>b œ ! € )<> [5.46]

is known as the asymptotic regression model. Finally, one can put 0 œ !a"  expe,>! fb,
where >! is the time at which the size (yield) is zero, aC a>! b œ !b to obtain the equation

© 2003 by CRC Press LLC


C a>b œ !a"  expe  ,a>  >! bfb. [5.47]

In this form the equation is known as the Mitscherlich law (or Mitscherlich equation),
popular in agronomy to model crop yield as a function of fertilizer input a>b (see §5.8.1 for a
basic application of the Mitscherlich model). In fisheries and wildlife research [5.46] is
known as the Von Bertalanffy model and it finds application as Newton's law of cooling a
body over time in physics. Ratkowsky (1990) discusses it as Mitscherlich's law and the Von
Bertalanffy model but Seber and Wild (1989) argue that the Von Bertalanffy model is derived
from a different generating equation; see §5.2. Equations [5.44] through [5.47] are four para-
meterizations of the same basic relationship. When fitting either parameterization to data,
they yield the same goodness of fit, the same residual error sum of squares, and the same
vector of fitted residuals. The interpretation of their parameters is different, however, as are
the statistical properties of the parameter estimators. For example, the correlations between
the parameter estimates can be quite different. In §4.4.4 it was discussed how correlations and
dependencies can negatively affect the least squares estimate. This problem is compounded in
nonlinear models which usually exhibit considerable correlations among the columns in the
regressor matrix. A parameterization that reduces these correlations will lead to more reliable
convergence of the iterative algorithm. To understand the effects of the parameterization on
statistical inference, we need to consider the concept of the curvature of a nonlinear model.

5.7.1 Intrinsic and Parameter-Effects Curvature


The statistical properties of nonlinear models, their numerical behavior in the estimation
process (convergence behavior), and the reliability of asymptotic inference for finite sample
size are functions of a model's curvature. The curvature consists of two components, termed
intrinsic curvature and parameter-effects curvature (Ratkowsky 1983, 1990). Intrinsic
curvature measures how much the nonlinear model bends if the value of the parameters are
changed by a small amount. This is not the same notion as the bending of the mean function
the regressor B is changed. Models with curved mean function can have low intrinsic
curvature. A curvilinear model such as ] œ "! € "# B# € / has a mean function that curves
when Ec] d is plotted against B but has no intrinsic curvature. To understand these curvature
components better, we consider how the mean function 0 axß )b varies for different values of
) for given values of the regressor. The result is called the expectation surface of the model.
With only two data points B" and B# , the expectation surface can be displayed in a two-
dimensional coordinate system.

Example 5.9. Consider the linear mean function Ec]3 d œ )B#3 and design points at
B" œ " and B" œ #. The expectation surface is obtained by varying
) 0 a)b
EcYd œ ” œ "
%) • ” 0# a)b •

for different values of ). This surface can be plotted in a two-dimensional coordinate


system as a straight line (Figure 5.11).

© 2003 by CRC Press LLC




f 2 (q ) 4

0
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
f 1 (q )

Figure 5.11. Expectation surface of Ec]3 d œ )B#3 . Asterisks mark points on the surface
for equally spaced values of ) œ !Þ#ß !Þ'ß "Þ!ß "Þ%ß and "Þ).

Since the expectation surface does not bend as ) is changed, the model has no intrinsic
curvature.

Nonlinear estimation rests on linearization techniques. The linear approximation

f ax ß ) b  f Ðx ß s sŠ)  s
)Ñ » F )‹

amounts to approximating the expectation surface in the vicinity of s ) by a tangent plane. To


locally replace the expectation surface with a plane is not justifiable if the intrinsic curvature
is large, since then changes in ) will cause considerable bending in faxß )b while the linear
approximation assumes that the surface is flat in the neighborhood of s ) . Seber and Wild
(1989, p. 137) call this the planar assumption and point out that it is invalidated when the
intrinsic curvature component is large. A second assumption plays into the quality of the
linear approximation; when a regular grid of ) values centered at the estimate s ) is distorted
upon projection onto faxß )b, the parameter-effects curvature is large. Straight, parallel, equi-
spaced lines in the parameter space then do not map into straight, parallel, equispaced lines on
the expectation surface. What Seber and Wild (1989, p. 137) call the uniform-coordinate
assumption is invalid if the parameter-effects curvature is large. The linear model whose
expectation surface is shown in Figure 5.11 has no parameter-effects curvature. Equally
spaced values of ), for which the corresponding points on the expectation surface are marked
with asterisks, are equally spaced on the expectation surface.
Ratkowsky (1990) argues that intrinsic curvature is typically the smaller component and
advocates focusing on parameter-effects curvature because it can be influenced by model
parameterization. No matter how a model is parameterized, the intrinsic curvature remains the
same. But the effect of parameterization is generally data-dependent. A parameterization with
low parameter-effects curvature for one data set may produce high curvature for another data
set. After these rather abstract definitions of curvature components and the sobering finding
that it is hard to say how strong these components are for a particular model/data combination

© 2003 by CRC Press LLC


without knowing precisely the design points, one can rightfully ask, why worry? The answer
lies in two facts. First, the effects of strong curvature on many aspects of statistical inference
can be so damaging that one should be aware of the liabilities when choosing a parameteri-
zation that produces high curvature. Second, parameterizations are known for many popular
models that typically result in models with low parameter-effects curvature and desirable
statistical properties of the estimators. These are expected value parameterizations
(Ratkowsky 1983, 1989).
Before we examine some details of the effects of curvature, we demonstrate the
relationship of the two curvature components in nonlinear models for a simple case that is
discussed in Seber and Wild (1989, pp. 98-102) and was inspired by Ratkowsky (1983).

Example 5.10. The nonlinear mean function Ec]3 d œ expe)B3 f is investigated, where
the only design points are B" œ " and B# œ #. The expectation surface is then
expe)f 0 a)b
EcYd œ ” œ " .
expe#)f • ” 0# a)b •

A reparameterization of the model is achieved by letting <B3 œ expe)B3 f. In this form


the expectation surface is
< 1 a<b œ 0" a)b
EcYd œ ” œ " Þ
<# • ” 1# a<b œ 0# a)b •

This surface is graphed in Figure 5.12 along with points on the surface corresponding to
equally spaced sets of ) and < . Asterisks mark points on the surface for equally spaced
values of ) œ !ß !Þ&ß "Þ!ß "Þ&ß #Þ!ß #Þ&ß $Þ!ß and $Þ&, and circles mark points for equally
spaced values of < œ #ß %ß ÞÞÞß $#.

1100 q=3.5

1000

900

800

700

600
f 2 (q )
500
q=3.0
400

300

200 q=2.5

100 q=2.0

0
0 5 10 15 20 25 30 35 40
f 1 (q )

Figure 5.12. Expectation surface of Ec]3 d œ expe)B3 f œ <B3 for B" œ "ß B# œ #.
Adapted from Figures 3.5a) and b) in Seber, G.A.F. and Wild, C.J. (1989) Nonlinear
Regression. Wiley and Sons, New York. Copyright © 1989 John Wiley and Sons, Inc.
Reprinted by permission of John Wiley and Sons, Inc.

© 2003 by CRC Press LLC




The two parameterizations of the model produce identical expectation surfaces; hence
their intrinsic curvatures are the same. The parameter-effects curvatures are quite
different, however. Equally spaced values of ) are no longer equally spaced on the sur-
face as in the linear case (Figure 5.11). Mapping onto the expectation surfaces results in
considerable distortion, and parameter-effects curvature is large. But equally spaced
values of < create almost equally spaced values on the surface. The parameter-effects
curvature of model Ec]3 d œ <B3 is small.

What are the effects of strong curvature in a nonlinear model? Consider using the fitted
residuals

s/3 œ C3  0 Ðx3 ß )
as a diagnostic tool similar to diagnostics applicable in linear models (§4.4.1). There we can
calculate studentized residuals (see §4.8.3) ,

s È"  233 ‹,
<3 œ s/3 Ί5

for example, that behave similar to the unobservable model errors. They have mean zero and
constant variance. Here, 233 is the 3th diagonal element of the projection matrix XaXw Xb" Xw .
If the intrinsic curvature component is large, a similar residual obtained from a nonlinear
model fit will not behave as expected. Let

C3  0 Ð x 3 ß s

<3‡ œ ,
s # a"  233‡ b
É5

sw F
s ÐF
where 233‡ is the 3th diagonal element of F sw . If intrinsic curvature is pronounced, the
sс" F
<3‡ will not have mean !, will not have constant variance, and as shown by Seber and Wild
(1989, p. 178) <3‡ is negatively correlated with the predicted values sC 3 . The diagnostic plot of
<3‡ versus sC 3 , a tool commonly borrowed from linear model analysis will show a negative
slope rather than a band of random scatter about !. Seber and Wild (1989, p. 179) give ex-
pressions of what these authors call projected residuals that do behave as their counterparts
in the linear model (see also Cook and Tsai, 1985), but none of the statistical packages we are
aware of can calculate these. The upshot is that if intrinsic curvature is large, one may
diagnose a nonlinear model as deficient based on a residual plot when in fact the model is
adequate, or find a model to be adequate when in fact it is not. It is for this reason that we shy
away from standard residual plots in nonlinear regression models in this chapter.
The residuals are not affected by the parameter-effects curvature, only by intrinsic curva-
ture. Fitting a model in different parameterizations will produce the same set of fitted
residuals (see §5.8.2 for a demonstration). The parameter estimates, on the other hand, are
very much affected by parameter-effects curvature. The key properties of the nonlinear least
squares estimators s
), namely
• asymptotic unbiasedness
• asymptotic minimum variance
• asymptotic Gaussianity

© 2003 by CRC Press LLC


all depend on the validity of the tangent-plane approximation (the planar assumption). With
increasing curvature this assumption becomes less tenable. The estimators will be
increasingly biased; their sample distribution will deviate from a Gaussian distribution and,
perhaps most damagingly, their variance will exceed 5 # aFw Fb" . Since this is the expression
on which nonlinear statistical routines base the estimated asymptotic standard errors of the
estimators, these packages will underestimate the uncertainty in s ), test statistics will be
inflated and :-values will be too small. In §5.8.2 we examine how the sampling distribution
of the parameter estimates depends on the curvature and how one can diagnose a potential
problem.
Having outlined the damage incurred by pronounced curvature of the nonlinear model,
we are challenged to remedy the problem. In Example 5.10 the parameterization Ec]3 d œ <B3
was seen to be preferable because it leads to smaller parameter-effects curvature. But as the
design points are changed so changes the expectation surface and the curvature components
are altered. One approach of reducing curvature is thus to choose the design points (regressor
values) accordingly. But the design points which minimize the parameter-effects curvature
may not be the levels of interest to the experimenter. In a rate application trial, fertilizers are
often applied at equally spaced levels, e.g., ! lbs/acre, $! lbs/acre, '! lbs/acre. If the response
(crop yield, for example) is modeled nonlinearly the fertilizer levels %Þ$ lbs/acre, "!Þ$
lbs/acre, $%Þ* lbs/acre may minimize the parameter-effects curvature but are of little interest
to the experimenter. In observational studies the regressor values cannot be controlled and
parameter-effects curvature cannot be influenced by choosing design points. In addition, the
appropriate nonlinear model may not be known in advance. The same set of regressor values
will produce different degrees of curvature depending on the model. If alternative models are
to be examined, fixing the design points to reduce curvature effects is not a useful solution in
our opinion.
Ratkowsky (1983) termed nonlinear models with low curvature close-to-linear models,
meaning that their estimators behave similarly to estimators in linear models. They should be
close to Gaussian-distributed, almost unbiased, and the estimates of their precision based on
the linearization should be reliable. Reducing the parameter-effects curvature component as
much as possible will make nonlinear parameter estimates behave as closely to linear as
possible for a given model and data set. Also, the convergence properties of the iterative
fitting process should thereby be improved because models with high parameter-effects
curvature have a shallow residual sum of squares surface. Subsequent reductions in WÐs )Ñ
between iterations will be small and the algorithm requires many iterations to achieve conver-
gence.
For the two parameterizations in Figure 5.12 the residual sum of squares surface can be
shown as a line (Figure 5.13), since there is only one parameter () or <). The sum of squares
surface of the model with higher parameter-effects curvature ()) is more shallow over a wide
range of ) values. Only if s) is close to about #Þ#, a more rapid dip occurs, leading to the final
estimate of s) œ #Þ'**. The minimum of the surface for Ec]3 d œ <B3 is more pronounced and
regardless of the starting value, convergence will occur more quickly. The final estimate in
this parameterization is < s œ "%Þ)(*. In either case, the residual sum of squares finally
achieved is )Þ$!Þ

© 2003 by CRC Press LLC




y
5 10 15 20 25

y$ q$
150

Residual sum of square (in 1,000)

100

50

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5


q

Figure 5.13. Residual sum of squares surfaces for Ec]3 d œ expe)B3 f œ <B3 ß C" œ "#ß
C# œ ##"Þ&. Other values as shown in Figure 5.12. Solid line: sum of squares surface for
Ec]3 d œ expe)B3 fÞ Dashed line: sum of squares surface for Ec]3 d œ <B3 .

Ratkowsky's technique for reducing the parameter-effects curvature relies on rewriting


the model in its expected value parameterization. This method was introduced in §5.4.3 as a
vehicle for determining starting values. In §5.8.2 we show how the parameters of the
Mitscherlich equation have much improved properties in its expected value parameterization
compared to some of the standard parameterizations of the model. Not only are they more
Gaussian-distributed, they also are less biased and their standard error estimates are more
reliable. Furthermore, the model in expected value parameterization converges more quickly
and the estimators are less correlated (§5.8.1). As a consequence, the ÐFw FÑ matrix is better-
conditioned and the iterative algorithm is more stable. Expected value parameterizations can
also be used to our advantage to rewrite a nonlinear model in terms of parameters that are of
more immediate interest to the researcher. Schabenberger et al. (1999) used this method to
reparameterize log-logistic herbicide dose response models and term the technique repara-
meterization through defining relationships.

5.7.2 Reparameterization through Defining Relationships


An expected value parameter corresponds to a predicted value at a given value B‡ of the
regressor. Consider, for example, the log-logistic model
!$
Ec] d œ $ € , [5.48]
" € <expe" lnaBbf

popular in modeling the relationship between a dosage B and a response ] . The response
changes in a sigmoidal fashion between the asymptotes ! and $ . The rate of change and
whether response increases or decreases in B depend on the parameters < and " . Assume we
wish to replace a parameter, < say, by its expected value equivalent:

© 2003 by CRC Press LLC


!$
.‡ œ $ € .
" € <expe" lnaB‡ bf

Solving for < leads to


!$
<œŒ  "expe  " lnÐB‡ ÑfÞ
.‡  $

After substituting back into [5.48] and simplifying, one arrives at


!$
Ec] d œ $ € .

"€ Š !.
‡ $  "‹expe" lnaBÎB‡ bf

This model should have less parameter-effects curvature than [5.48], since < was replaced by
its expected value parameter. It has not become more interpretable, however. Expected value
parameterization rests on estimating the unknown .‡ for a known value of B‡ . This process
can be reversed if we want to express the model as a function of an unknown value B‡ for a
known value .‡ . A common task in dose-response studies is to find that dosage which reduc-
es/increases the response by a certain amount or percentage. In a study of insect mortality as a
function of insecticide concentration, for example, we might be interested in the dosage at
which &!% of the treated insects die (the so-called PH&! value). In biomedical studies, one is
often interested in the dosage that cures *!% of the subjects. The idea is as follows. Consider
the model Ec] lBd œ 0 aBß )b with parameter vector ) œ c)" ß )# ß âß ): dw . Find a value B‡ which
is of interest to the investigation, for example PH&! or KV&! , the dosage that reduc-
es/increases growth by &!%. Our goal is to estimate B‡ . Now set
Ec] lB‡ d œ 0 aB‡ ß )b,

termed the defining relationship. Solve for one of the original parameters, )" say. Substitute
the expression obtained for )" into the original model 0 aBß )b and one obtains a model where
)" has been replaced by B‡ . Schabenberger et al. (1999) apply these ideas in the study of
herbicide dose response where the log-logistic function has negative slope and represents the
growth of treated plants relative to an untreated control. Let -O be the value which reduces
growth by aO‡"!!b%. In a model with lower and upper asymptotes such as model [5.48] one
needs to carefully define whether this is a reduction of the maximum response ! or of the dif-
ference between the maximum and minimum response. Here, we define -O as the value for
which
"!!  O
Ec] l-O d œ $ € Œ a!  $ b.
"!!

The defining relationship to parameterize a log-logistic model in terms of -O is then


"!!  O a!  $ b
$€Œ a!  $ b œ $ € .
O " € <expe" lna-O bf

Schabenberger et al. (1999) chose to solve for <. Upon substituting the result back into [5.48]
one obtains the reparameterized log-logistic equation

© 2003 by CRC Press LLC




!$
Ec] lBd œ $ € . [5.49]
" € OÎa"!!  O bexpe" lnaBÎ-O bf

In the special case where O œ &!, the term OÎa"!!  O b in the denominator vanishes.
Two popular ways of expressing the Mitscherlich equation can also be developed by this
method. Our [5.44] shows the Mitscherlich equation for crop yield C at nutrient level B as
Ec] lBd œ ! € a0  !b/,B ,

where ! is the yield asymptote and 0 is the yield at nutrient concentration B œ !. Suppose we
are interested in estimating what Black (1993, p. 273) calls the availability index, the nutrient
level B! at which the average yield is zero and want to replace 0 in the process. The defining
relationship is
Ec] lB! d œ ! œ ! € a0  !b/,B! .

Solving for 0 gives 0 œ !a"  expe,B! fb. Substituting back into the original equation and
simplifying one obtains the other popular parameterization of the Mitscherlich equation
(compare to [5.47]):
Ec] d œ ! € a!a"  /,B! b  !b/,B
œ !  !/,B! /,B
œ !ˆ"  /,aBB! b ‰.

5.8 Applications
In this section we present applications involving nonlinear statistical models and discuss their
implementation with The SAS® System. A good computer program for nonlinear modeling
should provide simple commands to generate standard results of any nonlinear analysis such
as parameter estimates, their standard errors, hypothesis tests or confidence intervals for the
parameters, confidence and prediction intervals for the response, and residuals. It should also
allow different fitting methods such as the Gauss-Newton, Newton-Raphson, and Levenberg-
Marquardt methods in their appropriately modified forms. An automatic differentiator helps
to the user avoid having to code first (Gauss-Newton) and second (Newton-Raphson)
derivatives of the mean function with respect to the parameters. The nlin procedure of The
SAS® System fits these requirements.
In §5.8.1 we analyze a simple data set on sugar cane yields with Mitscherlich's law of
physiological relationships. The primary purpose is to illustrate a standard nonlinear
regression analysis with The SAS® System. But we also provide some additional details on
the genesis of this model that is key in agricultural investigations of crop yields as a function
of nutrient availability and fit the model in different parameterizations. As discussed in
§5.7.1, changing the parameterization of a model changes the statistical properties of the esti-
mators. Ratkowsky's simulation method (Ratkowsky 1983) is implemented for the sugar cane
yield data in §5.8.2 to compare the sampling distributions, bias, and excess variance of
estimators in different parameterizations of the Mitscherlich equation.

© 2003 by CRC Press LLC


Linear-plateau models play an important role in agronomy. §5.8.3 shows some of the
basic operations in fitting a linear-plateau model and its relatives. In §5.8.4 we compare pa-
rameters among linear-plateau models corresponding to treatment groups.
Many nonlinear problems require additional programming. A shortcoming of many non-
linear regression packages, proc nlin being no exception, is their inability to test linear and
nonlinear hypotheses. This is not too surprising since the meaningful set of hypotheses in a
nonlinear model depends on the context (model, data, parameterization). For example, the
Wald-type J -test of a linear hypotheses L! :A) œ d in general requires coding the A matrix
and computing the Wald statistic
w "
a"b "
J9,= œ ŠAs
)  d‹ ’AaFw Fb Aw “ ŠAs 5# ‰
)  d‹Îˆ;s

in a matrix programming language. Only when the A matrix has a simple form can the nlin
procedure be tricked in calculating the Walt test directly. The SAS® System provides proc
iml, an interactive matrix language to perform these tasks as part of the SAS/IML® module.
The estimated asymptotic variance covariance matrix 5 s # ÐFw Fс" can be output by proc nlin
and read into proc iml. Fortunately, the nlmixed procedure, which was added in Release 8.0
of The SAS® System, has the ability to perform tests of linear and nonlinear combinations of
the model parameters, eliminating the need for additional matrix programming. We
demonstrate the Wald test for treatment comparisons in §5.8.5 where a nonlinear response is
analyzed in a # ‚ $ ‚ ' factorial design. We analyze the factorial testing for main effects and
interactions, and perform pairwise treatment comparisons based on the nonlinear parameters
akin to multiple comparisons in a linear analysis of variance model. The nlmixed procedure
can be used there to formulate contrasts efficiently.
Dose-response models such as the logistic or log-logistic models are among the most
frequently used nonlinear equations. Although they offer a great deal of flexibility, they are
no panacea for every data set of dose responses. One limitation of the logistic-type models,
for example, is that the response monotonically increases or decreases. Hormetic effects,
where small dosages of an otherwise toxic substance can have beneficial effects, can throw
off a dose-response investigation with a logistic model considerably. In §5.8.6 we provide
details on how to construct hormetic models and examine a data set used by Schabenberger et
al. (1999) to compare effective dosages among two herbicides, where for a certain weed
species one herbicide induces a hormetic response while the other does not.
Yield-density models are a special class of models closely related to linear models. Most
yield-density models can be linearized. In §5.8.7 we fit yield-density models to a data set by
Mead (1970) on the yield of different onion varieties. Tests of hypotheses comparing the
varieties as well as estimation of the genetic and environmental potentials are key in this
investigation, which we carry out using the nlmixed procedure.
The homogeneous variance assumption is not always tenable in nonlinear regression
analyses just as it is not tenable for many linear models. Transformations that stabilize the
variance may destroy other desirable properties of the model and transformations that
linearize the model do not necessarily stabilize the variance (§5.6). In the case of hetero-
geneous error variances we prefer to use weighted nonlinear least squares instead of transfor-
mations. In §5.8.8 we apply weighted nonlinear least squares to a growth modeling problem
and employ a grouping approach to determine appropriate weights.

© 2003 by CRC Press LLC


5.8.1 Basic Nonlinear Analysis with The SAS® System —
Mitscherlich's Yield Equation
In this section we analyze a small data set on the yields of sugar cane as a function of nitrogen
fertilization. The purpose here is not to draw precise inferences and conclusions about the
relationship between crop yield and fertilization, but to demonstrate the steps in fitting a
nonlinear model with proc nlin of The SAS® System beginning with the derivation of
starting values to the actual fitting, the testing of hypotheses, and the examination of the
effects of parameterizations. The data for this exercise are shown in Table 5.4.

Table 5.4. Sugar cane yield in randomized complete block design with five blocks
and six levels of nitrogen fertilization
Nitrogen Block Treatment
(kg/ha) 1 2 3 4 5 sample mean
0 )*Þ%* &%Þ&' (%Þ$$ ()Þ#! '"Þ&" ("Þ'#
25 "!)Þ() "!#Þ!" "!&Þ!% "!&Þ#$ "!'Þ&# "!&Þ&"
50 "$'Þ#) "#*Þ&" "$#Þ&% "$#Þ($ "$%Þ!# "$$Þ!"
100 "&(Þ'$ "'(Þ$* "&&Þ$* "%'Þ)& "&&Þ)" "&'Þ&(
150 ")&Þ*' "('Þ'' "()Þ&$ "*&Þ$% ")&Þ&' ")%Þ%"
200 "*&Þ!* "*!Þ%$ ")$Þ&# ")!Þ** #!&Þ'* "*"Þ"%

180
Treatment sample means across blocks

140

100

60

0 50 100 150 200


N (kg/ha)

Figure 5.14. Treatment average sugar cane yield vs. nitrogen level applied.

The averages for the nitrogen levels calculated across blocks monotonically increase in
the amount of N applied (Figure 5.14). The data do not indicate a decline or a maximum yield
at some N level within the range of fertilizer applied nor do they indicate a linear-plateau
relationship, since the yield does not appear constant for any level of nitrogen fertilization.
Rather, the maximum yield is approached asymptotically. At the control level (! N), the
average yield is of course not !; it corresponds to the natural fertility of the soil.

© 2003 by CRC Press LLC


Liebig's law of constant returns, which implies a linear-plateau relationship (see §5.1),
certainly does not apply in this situation. Mitscherlich (1909) noticed by studying experimen-
tal data, that the rate of yield aCb increase is often not constant but changes with the amount
of fertilizer applied aBb. Since the yield relationships he examined appeared to approach some
maximum value ! asymptotically, he postulated that the rate of change `CÎ`B is proportional
to the difference between the maximum a!b and the current yield. Consequently, as yield
approaches its maximum, the rate of change with fertilizer application approaches zero.
Mathematically this relationship between yield C and fertilizer amount B is expressed through
the generating equation
`CÎ`B œ ,Ð!  CÑ. [5.50]

The parameter , is the proportionality constant that Mitscherlich called the effect-factor
(Wirkungsfaktor in the original publication which appeared in German). The larger , , the
faster yield approaches its asymptote. Solving this generating equation leads to various
mathematical forms (parameterizations) of the Mitscherlich equation that are known under
different names. Four of them are given in §5.7; a total of eight parameterizations are
presented in §A5.9.1. We prefer to call simply the Mitscherlich equation what has been
termed Mitscherlich's law of physiological relationships. Two common forms of the equation
are
C aBb œ !a"  expe  ,aB  B! bfb
[5.51]
C aBb œ ! € a0  !bexpe  ,Bf.

100

Model 2
80
Model 1

Model 3
60

40
Model 1: k = 0.03, x0 = -20, x = 45.12
Model 2: k = 0.04, x0 = -20, x = 55.07
20 Model 3: k = 0.05, x0 = -10, x = 39.35

-20 -10 0 10 20 30 40 50 60 70 80 90 100 110

Figure 5.15. Mitscherlich equations for different values of , , B! and 0, ! œ "!!.

Both equations are three parameter models (Figure 5.15). In the first the parameters are
!, ,, and B! ; in the second equation the parameters are !, 0, and ,. ! represents the
asymptotic yield and B! is the fertilizer concentration at which the yield is !, i.e., CaB! b œ !.
Black (1993, p. 273) calls  B! the availability index of the nutrient in the soil (and seed)
when none is added in the fertilizer or as Mead et al. (1993, p. 264) put it, “the amount of
fertilizer already in the soil.” Since B œ ! fertilizer is the minimum that can be applied, B! is

© 2003 by CRC Press LLC




obtained by extrapolating the yield-nutrient relationship below the lowest rate to the point
where yield is exactly zero (Figure 5.15). This assumes that the Mitscherlich equation extends
past the lowest level applied, which may not be the case. We caution therefore against
attaching too much validity to the parameter B! . The second parameterization replaces the
parameter B! with the yield that is obtained if no fertilizer is added, 0 œ CÐ!Ñ. The relation-
ship between the two parameterizations is
0 œ !a"  expe,B! fb.

In both model formulas, the parameter , is a scale parameter that governs the rate of
change. It is not the rate of change as is sometimes stated. Figure 5.15 shows three
Mitscherlich equations with asymptote ! œ "!! that vary in , and B! . With increasing ,, the
asymptote is reached more quickly (compare Models 1 and 2 in Figure 5.15). It is also clear
from the figure that B! is an extrapolated value.
One of the methods for finding starting values in a nonlinear model that was outlined in
§5.4.3 relies on the expected value parameterization of the model (Ratkowsky 1990, Ch.
2.3.1). Here we choose values of the regressor variables and rewrite the model in terms of the
mean response at those values. We call these expected value parameters since they corres-
pond to the means at the particular values of the regressors that were chosen. For each
regressor value for which an expected value parameter is obtained, one parameter of the
original model is replaced. An expected value parameterization for the Mitscherlich model
due to Schnute and Fournier (1980) is
CÐBÑ œ .‡ € a.‡‡  .‡ bˆ"  )7" ‰Îˆ"  )8" ‰
7  " œ a8  "baB  B‡ bÎaB‡‡  B‡ b [5.52]
8 œ number of observations.

Here, .‡ and .‡‡ are the expected value parameters for the yield at nutrient levels B‡ and B‡‡ ,
respectively. Expected value parameterizations have advantages and disadvantages. Finding
starting values is particularly simple if a model is written in terms of expected value param-
eters. If, for example, B‡ œ #& and B‡‡ œ "&! are chosen for the sugar cane data, reasonable
starting values (from Fig. 5.14) are identified as .‡! œ "#! and .‡‡! œ "(&. Models in expec-
ted value parameterization are also closer to linear models in terms of the statistical properties
of the estimators because of low parameter-effects curvature. Ratkowsky (1990) notes that the
Mitscherlich model is notorious for high parameter-effects curvature which gives particular
relevance to [5.52]. A disadvantage is that the interpretation of parameters in terms of
physical or biological quantities is lost compared to other parameterizations.
Starting values for the standard forms of the Mitscherlich model can also be found
relatively easily by using the various devices described in §5.4.3. Consider the model
C aBb œ !a"  expe  ,aB  B! bfbÞ

Since ! is the upper asymptote, Figure 5.14 would suggest a starting value of !! œ #!!.
Once ! is fixed we can rewrite the model as
ln˜!!  C ™ œ lne!f € ,B!  ,B.

This is a linear regression with response lne!!  C f, intercept lne!f € ,B! , and slope ,. For
the averages of the sugar cane yield data listed in Table 5.4 and graphed in Figure 5.14 we

© 2003 by CRC Press LLC


can use proc reg in The SAS® System to find a starting value of , by fitting the linear
regression.
data CaneMeans; set CaneMeans;
y2 = log(200-yield);
run;
proc reg data=CaneMeans; model y2 = nitro; run; quit;

From Output 5.8 we gather that a reasonable starting value is ,! œ !Þ!"$&'. We delib-
erately ignore all other results from this linear regression since the value !! œ #!! that was
substituted to enable a linearization by taking logarithms was only a guess. Finally, we need a
starting value for B! , the nutrient concentration at which the yield is !. Visually extrapolating
the response trend in Figure 5.14, a value of B!! œ  #& seems not unreasonable as a first
guess.

Output 5.8.
The REG Procedure
Model: MODEL1
Dependent Variable: y2

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 1 5.45931 5.45931 320.02 <.0001


Error 4 0.06824 0.01706
Corrected Total 5 5.52755

Root MSE 0.13061 R-Square 0.9877


Dependent Mean 3.71778 Adj R-Sq 0.9846
Coeff Var 3.51315

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 4.90434 0.08510 57.63 <.0001


nitro 1 -0.01356 0.00075804 -17.89 <.0001

Now that starting values have been assembled we fit the nonlinear regression model with
proc nlin. The statements that accomplish this in the parameterization for which the starting
values were obtained are
proc nlin data=CaneMeans method=newton;
parameters alpha=200 kappa=0.0136 nitro0=-25;
Mitscherlich = alpha*(1-exp(-kappa*(nitro-nitro0)));
model yield = Mitscherlich;
run;

The method=newton option of the proc nlin statement selects the Newton-Raphson algo-
rithm. If the method= option is omitted, the procedure defaults to the Gauss-Newton method.
In either case proc nlin implements not the unmodified algorithms but provides internally
necessary modifications such as step-halving that stabilize the algorithm. Among other fitting
methods that can be chosen are the Marquardt-Levenberg algorithm (method=Marquardt) that
is appropriate if the columns of the derivative matrix are highly correlated (Marquardt 1963).
Prior to Release 6.12 of The SAS® System the user had to specify first derivatives of the

© 2003 by CRC Press LLC




mean function with respect to all parameters for the Gauss-Newton method and first and
second derivatives for the Newton-Raphson method. To circumvent the specification of
derivatives, one could use method=dud which invoked a derivative-free method (Ralston and
Jennrich 1978). This acronym stands for Does not Use Derivatives and the method enjoyed
popularity because of this feature. The numerical properties of this algorithm are typically
poor and the algorithm is not efficient in terms of computing time. Since The SAS® System
calculates derivatives automatically starting with Release 6.12, the DUD method should no
longer be used. There is no justification in our opinion for using a method that approximates
derivatives over one that determines the actual derivatives. Even in newer releases the user
can still enter derivatives through der. statements of proc nlin. For the Mitscherlich model
above one would code, for example,
proc nlin data=CaneMeans method=gauss;
parameters alpha=200 kappa=0.0136 nitro0=-25;
Mitscherlich = alpha*(1-exp(-kappa*(nitro-nitro0)));
model yield = Mitscherlich;
der.alpha = 1 - exp(-kappa * (nitro - nitro0));
der.kappa = alpha * ((nitro - nitro0) * exp(-kappa * (nitro - nitro0)));
der.nitro0 = alpha * (-kappa * exp(-kappa * (nitro - nitro0)));
run;

to obtain a Gauss-Newton fit of the model. Not only is the added programming not worth the
trouble and mistakes in coding the derivatives are costly, the built-in differentiator of proc
nlin is of such high quality that we recommend allowing The SAS® System to determine the
derivatives. If the user wants to examine the derivatives used by SAS® , add the two options
list listder to the proc nlin statement.

Finally, the results of fitting the Mitscherlich model by the Newton-Raphson method in
the parameterization
C3 œ !a"  expe  ,aB3  B! bfb € /3 ß 3 œ "ß âß ',

where the /3 are uncorrelated random errors with mean ! and variance 5 # , with the statements
proc nlin data=CaneMeans method=newton noitprint;
parameters alpha=200 kappa=0.0136 nitro0=-25;
Mitscherlich = alpha*(1-exp(-kappa*(nitro-nitro0)));
model yield = Mitscherlich;
run;

shown as Output 5.9. The procedure converges after six iterations with a residual sum of
s Ñ œ &(Þ#'$". The model fits the data very well as measured by
squares of WÐ)
&(Þ#'$"
Pseudo-V # œ "  œ !Þ*%(.
"!ß ((&Þ)
The converged iterates (the parameter estimates) are
s
) œ c!
sß ,
sß B
w w
s! d œ c#!&Þ)ß !Þ!""#ß  $)Þ((#)d ,
from which a prediction of the mean yield at fertilizer level '! 51 ‚ 2+" , for example, can
be obtained as
C '!b œ #!&Þ)‡a"  expe  !Þ!""#‡a'! € $)Þ((#)bfb œ "$(Þ(##.
sa

© 2003 by CRC Press LLC


Output 5.9.
The NLIN Procedure
Iterative Phase
Dependent Variable yield
Method: Newton

NOTE: Convergence criterion met.

Estimation Summary

Method Newton
Iterations 6
R 8.93E-10
PPC 6.4E-11
RPC(kappa) 7.031E-6
Object 6.06E-10
Objective 57.26315
Observations Read 6
Observations Used 6
Observations Missing 0

Sum of Mean Approx


Source DF Squares Square F Value Pr > F

Regression 3 128957 42985.6 280.77 0.0004


Residual 3 57.2631 19.0877
Uncorrected Total 6 129014
Corrected Total 5 10775.8

Approx Approximate 95% Confidence


Parameter Estimate Std Error Limits

alpha 205.8 8.9415 177.3 234.2


kappa 0.0112 0.00186 0.00529 0.0171
nitro0 -38.7728 6.5556 -59.6360 -17.9095

Approximate Correlation Matrix

alpha kappa nitro0


alpha 1.0000000 -0.9300182 -0.7512175
kappa -0.9300182 1.0000000 0.9124706
nitro0 -0.7512175 0.9124706 1.0000000

For each parameter in the parameters statement proc nlin lists its estimate, (asymptotic)
estimated standard error, and (asymptotic) *&% confidence interval. For example, ! s œ #!&Þ)
with easeÐ!
sÑ œ )Þ*%"&. The asymptotic *&% confidence interval for ! is calculated as
s „ >!Þ!#&ß$ ‚ easea!
! sb œ #!&Þ) „ $Þ")# ‚ )Þ*%"& œ c"((Þ$ß #$%Þ#d .

Based on this interval one would, for example, reject the hypothesis that the upper yield
asymptote is #&! and fail to reject the hypothesis that the asymptote is #!!.
The printout of the Approximate Correlation Matrix lists the estimated correlation
coefficients between the parameter estimates,

Cov’s)4 ß s)5 “
s ’s)4 ß s)5 “ œ
Corr .
ÊVar’s)4 “Var’s)5 “

The parameter estimators are fairly highly correlated, Corrc!


sß ,
sd œ  !Þ*$, Corrc!
sß B
s! d œ

© 2003 by CRC Press LLC




 !Þ(&, Corrc,sß B
s! d œ !Þ*"#. Studying the derivatives of the Mitscherlich model in this
parameterization, this is not surprising. They all involve the same term
expe  ,aB  B! bf.

Highly correlated parameter estimators are indicative of poor conditioning of the Fw F matrix
and can cause instabilities during iterations. In the presence of large correlations one should
switch to the Marquardt-Levenberg algorithm or change the parameterization. Below we will
see how the expected value parameterization leads to considerably smaller correlations.
If the availability index  B! is of lesser interest than the control yield one can obtain an
estimate of sCa!b œ s0 from the parameter estimates in Output 5.9. Since 0 œ
!a"  expe,B! fb, we simply substitute estimates for the unknowns and obtain
s0 œ !
sa"  expe,
ssB! fb œ #!&Þ)a"  expe  !Þ!""#‡$)Þ((#)fb œ (#Þ&.

Although it is easy to obtain the point estimate of 0, it is not a simple task to calculate the
standard error of this estimate needed for a confidence interval, for example. Two possibi-
lities exist to accomplish that. One can refit the model in the parameterization
C3 œ ! € a0  !bexpe  ,B3 f € /3 ,

that explicitly involves 0 . proc nlin calculates approximate standard errors and *&%
confidence intervals for each parameter. The second method uses the capabilities of proc
nlmixed to estimate the standard error of nonlinear functions of the parameters by the delta
method. We demonstrate both approaches.
The statements
proc nlin data=CaneMeans method=newton noitprint;
parameters alpha=200 kappa=0.0136 ycontrol=72;
Mitscherlich = alpha + (ycontrol - alpha)*(exp(-kappa*nitro));
model yield = Mitscherlich;
run;

fit the model in the new parameterization (Output 5.10). The quality of the model fit has not
changed from the first parameterization in terms of B! . The analysis of variance tables in
Outputs 5.9 and 5.10 are identical. Furthermore, the estimates of ! and , and their standard
errors have not changed. The parameter labeled ycontrol now replaces the term nitro0 and
its estimate agrees with the calculation based on the estimates of the model in the first param-
eterization. From Output 5.10 we are able to state that with (approximately) *&% confidence
the interval c&*Þ'#&ß )&Þ%%!d contains the control yield.
Using proc nlmixed, the parameterization of the model need not be changed in order to
obtain an estimate and a confidence interval of the control yield:
proc nlmixed data=CaneMeans df=3 technique=NewRap;
parameters alpha=200 kappa=0.0136 nitro0=-25;
s2 = 19.0877;
Mitscherlich = alpha*(1-exp(-kappa*(nitro-nitro0)));
model yield ~ normal(Mitscherlich,s2);
estimate 'ycontrol' alpha*(1-exp(kappa*nitro0));
run;

The variance of the error distribution is fixed at the residual mean square from the earlier
fits (see Output 5.10). Otherwise proc nlmixed will estimate the residual variance by maxi-

© 2003 by CRC Press LLC


mum likelihood which would not correspond to an analysis equivalent to what is shown in
Outputs 5.9 and 5.10. The df=3 option in the proc nlmixed statement ensures that the proce-
dure uses the same residual degrees of freedom as proc nlin. The procedure converged in six
iterations to parameter estimates identical to those in Output 5.9 (see Output 5.11). The esti-
mate of the control yield shown under Additional Estimates is identical to that in Output
5.10.

Output 5.10.
The NLIN Procedure
Iterative Phase
Dependent Variable yield
Method: Newton

NOTE: Convergence criterion met.

Estimation Summary

Method Newton
Iterations 5
R 4.722E-8
PPC(kappa) 1.355E-8
RPC(alpha) 0.000014
Object 2.864E-7
Objective 57.26315
Observations Read 6
Observations Used 6
Observations Missing 0

Sum of Mean Approx


Source DF Squares Square F Value Pr > F

Regression 3 128957 42985.6 280.77 0.0004


Residual 3 57.2631 19.0877
Uncorrected Total 6 129014
Corrected Total 5 10775.8

Approx Approximate 95% Confidence


Parameter Estimate Std Error Limits

alpha 205.8 8.9415 177.3 234.2


kappa 0.0112 0.00186 0.00529 0.0171
ycontrol 72.5329 4.0558 59.6253 85.4405

Approximate Correlation Matrix


alpha kappa initial
alpha 1.0000000 -0.9300182 0.3861973
kappa -0.9300182 1.0000000 -0.5552247
initial 0.3861973 -0.5552247 1.0000000

Finally, we fit the Mitscherlich model in the expected value parameterization [5.52] and
choose B‡ œ #& and B‡‡ œ "&!.
proc nlin data=CaneMeans method=newton;
parameters mustar=125 mu2star=175 theta=0.75;
n = 6; xstar = 25; x2star = 150;
m = (n-1)*(nitro-xstar)/(x2star-xstar) + 1;
Mitscherlich = mustar + (mu2star-mustar)*(1-theta**(m-1))/
(1-theta**(n-1));
model yield = Mitscherlich;
run;

© 2003 by CRC Press LLC




The model fit is identical to the preceding parameterizations (Output 5.12). Notice that
the correlations among the parameters are markedly reduced. The estimators of .‡ and .‡‡ are
almost orthogonal.

Output 5.11. The NLMIXED Procedure

Specifications
Data Set WORK.CANEMEANS
Dependent Variable yield
Distribution for Dependent Variable Normal
Optimization Technique Newton-Raphson
Integration Method None

Dimensions
Observations Used 6
Observations Not Used 0
Total Observations 6
Parameters 3
Parameters
alpha kappa nitro0 NegLogLike
200 0.0136 -25 22.8642468

Iteration History
Iter Calls NegLogLike Diff MaxGrad Slope
1 10 16.7740728 6.090174 24.52412 -11.0643
2 15 16.1072166 0.666856 1767.335 -1.48029
3 20 15.8684503 0.238766 21.15303 -0.46224
4 25 15.8608402 0.00761 36.2901 -0.01514
5 30 15.8607649 0.000075 0.008537 -0.00015
6 35 15.8607649 6.02E-10 0.000046 -1.21E-9

NOTE: GCONV convergence criterion satisfied.

Fit Statistics
-2 Log Likelihood 31.7
AIC (smaller is better) 37.7
AICC (smaller is better) 49.7
BIC (smaller is better) 37.1

Parameter Estimates
Standard
Parameter Estimate Error DF t Value Pr>|t| Lower Upper
alpha 205.78 8.9496 3 22.99 0.0002 177.30 234.26
kappa 0.01121 0.001863 3 6.02 0.0092 0.005281 0.01714
nitro0 -38.7728 6.5613 3 -5.91 0.0097 -59.6539 -17.8917

Additional Estimates
Standard
Label Estimate Error DF t Value Pr > |t| Lower Upper
ycontrol 72.5329 4.0571 3 17.88 0.0004 59.6215 85.4443

To obtain a smooth graph of the response function at a larger number of N concentrations


than were applied and approximate *&% confidence limits for the mean predictions, we can
use a simple trick. To the data set we append a filler data set containing the concentrations at
which the mean sugar cane yield is to be predicted. The response variable is set to missing
values in this data set. SAS® will ignore the observations with missing response in fitting the
model, but use the observations that have regressor information to calculate predictions. To
obtain predictions at nitrogen concentrations between ! and #!! 51 ‚ 2+" in steps of #
51 ‚ 2+" we use the following code.

© 2003 by CRC Press LLC


data filler; do nitro=0 to 200 by 2; yield=.; pred=1; output; end; run;
data FitThis; set CaneMeans filler; run;

proc nlin data=FitThis method=newton;


parameters alpha=200 kappa=0.0136 nitro0=-25;
Mitscherlich = alpha*(1-exp(-kappa*(nitro-nitro0)));
model yield = Mitscherlich;
output out=nlinout predicted=predicted u95m=upperM l95m=lowerM;
run;
proc print data=nlinout(obs=15); run;

Output 5.12. The NLIN Procedure


Iterative Phase
Dependent Variable yield
Method: Newton
Sum of
Iter mustar mu2star theta Squares
0 125.0 175.0 0.7500 1659.6
1 105.3 180.9 0.7500 57.7504
2 105.1 181.0 0.7556 57.2632
3 105.1 181.0 0.7556 57.2631
NOTE: Convergence criterion met.

Estimation Summary
Method Newton
Iterations 3
R 4.211E-8
PPC 3.393E-9
RPC(mustar) 0.000032
Object 1.534E-6
Objective 57.26315
Observations Read 6
Observations Used 6
Observations Missing 0

Sum of Mean Approx


Source DF Squares Square F Value Pr > F
Regression 3 128957 42985.6 280.77 0.0004
Residual 3 57.2631 19.0877
Uncorrected Total 6 129014
Corrected Total 5 10775.8

Approx Approximate 95% Confidence


Parameter Estimate Std Error Limits

mustar 105.1 2.5101 97.1083 113.1


mu2star 181.0 2.4874 173.1 188.9
theta 0.7556 0.0352 0.6437 0.8675

Approximate Correlation Matrix

mustar mu2star theta


mustar 1.0000000 0.0606554 -0.3784466
mu2star 0.0606554 1.0000000 0.1064661
theta -0.3784466 0.1064661 1.0000000

The variable pred was set to one for observations in the filler data set to distinguish
observations from filler data. The output out= statement saves predicted values and *&%
confidence limits for the mean yield in the data set nlinout. The first fifteen observations of
the output data set are shown below and a graph of the predictions is illustrated in Figure
5.16. Observations for which pred=. are the observations to which the model is fitted,
observations with pred=1 are the filler data.

© 2003 by CRC Press LLC




Output 5.13.
Obs nitro yield pred PREDICTED LOWERM UPPERM

1 0 71.62 . 72.533 59.626 85.440


2 25 105.51 . 105.097 97.108 113.085
3 50 133.01 . 129.702 120.515 138.889
4 100 156.57 . 162.343 153.875 170.812
5 150 184.41 . 180.980 173.064 188.896
6 200 191.14 . 191.621 179.874 203.368
7 0 . 1 72.533 59.626 85.440
8 2 . 1 75.487 63.454 87.520
9 4 . 1 78.375 67.128 89.623
10 6 . 1 81.200 70.647 91.752
11 8 . 1 83.961 74.015 93.908
12 10 . 1 86.662 77.232 96.092
13 12 . 1 89.303 80.302 98.303
14 14 . 1 91.885 83.229 100.540
15 16 . 1 94.410 86.020 102.799

220
Predicted Sugar Cane Yield

170

120

70

0 50 100 150 200


Nitrogen (kg/ha)

Figure 5.16. Predicted sugar cane yields and approximate *&% confidence limits.

5.8.2 The Sampling Distribution of Nonlinear Estimators —


the Mitscherlich Equation Revisited
In §5.7.1 we discussed that strong intrinsic and parameter-effects curvature in a nonlinear
model can have serious detrimental effects on parameter estimators, statistical inference, the
behavior of residuals, etc. To recall the main results of that section, intrinsic curvature does
not affect the fit of the model in the sense that different parameterizations produce the same
set of residuals and leave unchanged the analysis of variance decomposition. Changing the
parameterization of the model does affect the parameter-effects curvature of the model,
however. Parameter estimators in nonlinear models with low curvature behave like their

© 2003 by CRC Press LLC


counterparts in the linear model. They are approximately Gaussian-distributed even if the
model errors are not, they are asymptotically unbiased and have minimum variance. In non-
linear models some bias will always be incurred due to the nature of the process. Ratkowsky
(1983) focuses on the parameter-effects curvature of nonlinear models in particular because
(i) he contends that it is the larger of the two curvature components, and (ii) because it can be
influenced by the modeler through changing the parameterization. The expected value para-
meterizations he developed supposedly lead to models with low parameter effects curvature
and we expect the estimators in these models to exhibit close-to-linear behavior.
Measures of curvature are not easily computed as they depend on second derivatives of
the mean function (Beale 1960, Bates and Watts 1980, Seber and Wild 1989, Ch. 4). A
relatively simple approach that allows one to examine the effects of curvature on the distri-
bution of the parameter estimators was given in Ratkowsky (1983). It relies on simulating the
sampling distribution of s
) and calculating test statistics to examine bias, excess variability,
and non-Gaussianity of the estimators. We now discuss and apply these ideas to the sugar
cane yield data of §5.8.1. The two parameterizations we compare are one of the standard
equations,
Ec]3 d œ !a"  expe  ,aB3  B! bfb,

and the expected value parameterization [5.52].


To outline the method we focus on the first parameterization, but similar steps are taken
for any parameterization. First, the model is fit to the data at hand (Table 5.4, Figure 5.14)
and the nonlinear least squares estimates s ) œ Òs)" ß âß s): Ów and their estimated asymptotic
standard errors are obtained along with the residual mean square. These values are shown in
Output 5.9 (p. 243). For example, Q WV œ "*Þ!)((. Then simulate O data sets, keeping the
regressor values as in Table 5.4, and set the parameters of the model equal to the estimates s ).
The error distribution is chosen to be Gaussian with mean zero and variance Q WV .
Ratkowsky (1983) recommends selecting O fairly large, O œ "!!!, say. For each of these O
data sets the model is fit by nonlinear least squares and we obtain parameter vectors
s
)" ß âß s
)O . Denote by s)34 the 4 œ "ß âß O th estimate of the 3th element of ) and by s)3 the 3th
element of s) from the fit of the model to the original (nonsimulated) data. The relative bias in
estimating )3 is calculated as
s)3Þ  s)3
RelativeBias%3 œ "!! ‚ Ž , [5.53]
s)3 

where s)3Þ is the average of the estimates s)34 across the O simulations, i.e.,
O
s)3Þ œ " " s)34 .
O 4œ"

If the curvature is strong, the estimated variance of the parameter estimators is an under-
estimate. Similar to the relative bias we calculate the relative excess variance as

© 2003 by CRC Press LLC




Î =#3  Vars’s)3 “ Ñ
RelativeExcessVariance%3 œ "!! ‚ Ð Ó . [5.54]
Ï s
Var s
)
’ 3“ Ò

Here, =#3 is the sample variance of the estimates for )34 in the simulations,
8
" #
=#3 œ "Šs)34  s)3Þ ‹ ,
O  " 4œ"

and Var sÒs)3 Ó is the estimated asymptotic variance of s)3 from the original fit. Whether the
relative bias and the excess variance are significant can be tested by calculating test statistics
s)3Þ  s)3
^ a"b œ ÈO "Î#
s’s)3 “
Var

s’s)3 “  È#aO  "b  ".


^ a#b œ Ê#O=#3 ÎVar

^ a"b and ^ a#b are compared against cutoffs of a standard Gaussian distribution to determine
the significance of the bias Ð^ a"b Ñ or the variance excess Ð^ a#b Ñ. This seems like a lot of
trouble to determine whether model curvature induces bias and excess variability of the
coefficients, but it is fairly straightforward to implement the process with The SAS® System.
The complete code including tests for Gaussianity, histograms of the parameter estimates in
the simulations, and statistical tests for excess variance and bias can be found on the CD-
ROM.

0.04 200

0.03 150

0.02 100

0.01 50

0.00 0
190 230 270 0.000 0.006 0.012 0.018
a k

0.06

0.05

0.04

0.03

0.02

0.01

0.00
-70 -40 -10
x0

Figure 5.17. Sampling distribution of parameter estimates when fitting Mitscherlich equation
in parameterization Ec]3 d œ !a"  expe  ,aB3  B! bfb to data in Table 5.4.

© 2003 by CRC Press LLC


We prefer to smooth the sample histograms of the estimates with a nonparametric kernel
estimator (§4.7). Figure 5.17 shows the smoothed sampling distributions of the estimates in
the standard parameterization of the Mitscherlich equation for O œ "!!! and Figure 5.18
displays the expected value parameterization. In Figure 5.17 the parameter estimates of !, the
yield asymptote, and B! , the index of availability, appear particularly skewed. The
distribution of the former is skewed to the right, that of the latter is skewed to the left. The
distribution of the estimates in the expected value parameterization are much less skewed
(Figure 5.18). Whether the deviation from a Gaussian distribution is significant can be
assessed with a test for normality (Table 5.5). In the standard parameterization all parameter
estimates deviate significantly (at the &% level) from a Gaussian distribution, even , s , which
appears quite symmetric in Figure 5.17. In the expected value parameterization the null
hypothesis of a Gaussian distribution cannot be rejected for any of the three parameters.

0.15 0.15

0.10 0.10

0.05 0.05

0.00 0.00
98 106 114 170 180 190
m* m**

10

0
0.61 0.72 0.83
q

Figure 5.18. Sampling distribution of parameter estimates when fitting Mitscherlich equation
in expected value parameterization to data in Table 5.4.

Table 5.5. T -values of tests of Gaussianity for standard and expected value parameterization
of Mitscherlich equation fitted to data in Table 5.4
Parameter Estimate
!
s ,
s B
s! s‡
. s‡‡
. )
T -value  !Þ!!!" !Þ!#(  !Þ!!!" !Þ%$# !Þ$#! !Þ("$

Asymptotic inferences relying on the Gaussian distribution of the estimators seem


questionable for this data set based on the normality tests. How about the bias and the
variance excesses? From Table 5.6 it is seen that in the standard parameterization the param-
eters ! and B! are estimated rather poorly, and their bias is highly significant. Not only are
the distributions of these two estimators not Gaussian (Table 5.5), they are also centered at

© 2003 by CRC Press LLC




the wrong values. Particularly concerning is the large excess variance in the estimate of the
asymptotic yield !. Nonlinear least squares estimation underestimates the variance of this
parameter by "&Þ"%. Significance tests about the yield asymptote should be interpreted with
the utmost caution. The expected value parameterization fairs much better. Its relative biases
are an order of magnitude smaller than the biases in the standard parameterization and not
s‡‡ , and s) are reliable as shown by the small excess
s‡ , .
significant. The variance estimates of .
variance and the large :-values.

Table 5.6. Bias [5.53] and variance [5.54] excesses for standard and expected value
parameterization of Mitscherlich equation fitted to data in Table 5.4
Statistic and T -Value
s)3Þ RelativeBias% T RelativeExcessVariance% T
!s #!(Þ% !Þ)!  !Þ!!!" "&Þ"! !Þ!!!*
,
s !Þ!""  !Þ!# !Þ*) !Þ%) !Þ)*
B
s!  $*Þ& "Þ*' !Þ!!!# &Þ"# !Þ#%

. "!&Þ# !Þ!* !Þ#$ !Þ"! !Þ*&
s‡‡
. ")!Þ*  !Þ!!# !Þ*(  #Þ!! !Þ''
s) !Þ(&' !Þ""$ !Þ%%  !Þ"' !Þ**

5.8.3 Linear-Plateau Models and Their Relatives — a Study of


Corn Yields from Tennessee
In the introduction to this chapter we mentioned the linear-plateau model (see Figure 5.1) as a
manifestation of Liebig's law of the minimum where the rate of change in plant response to
changes in the availability of a nutrient is constant until some concentration is reached at
which other nutrients become limiting and the response attains a plateau. Plateau-type models
are not only applicable in studies of plant nutrition, such relationships can be found in many
other situations.
Watts and Bacon (1974) present data from an experiment where sediment was agitated in
a tank of fluid. After agitation stopped (time > œ !) the height of the clear zone above the
sediment was measured for the next "&! minutes (Figure 5.19). The height of the clear zone
could be modeled as a single function of the time after agitation or two separate functions
could be combined, one describing the initial upward trend, the other flattened behavior to the
right. The point at which the switch between the two functions occurs is generally called a
change-point. If the two functions connect, it is also termed a join-point. Watts and Bacon
model the relationship between height of the clear zone and time after agitation ceased with a
variation of the hyperbolic model,
>3  !
Ec]3 d œ "! € "" a>3  !b € "# a>3  !btanhœ  € /3 .
#

Here, ! is the change-point parameter and # determines the radius of curvature at the change
point. The two functions connect smoothly.

© 2003 by CRC Press LLC


6

Height of clear zone (in)


Change point?
4

0 20 40 60 80 100 120 140 160


Time in minutes since agitation stopped

Figure 5.19. Sediment settling data based on Table 2 in Watts and Bacon (1974). Reprinted
with permission from Technometrics. Copyright © 1974 by the American Statistical Associa-
tion. All rights reserved.

3.5

3.0
ln(number of primordia)

2.5

2.0

1.5

1.0

10 20 30 40 50 60 70
Days since sowing

Figure 5.20. Segmented linear trend for Kirby's wheat shoot apex data (Kirby 1974 and
Lerman 1980). Adapted with permission from estimates reported by Lerman (1980).
Copyright © 1980 by the Royal Statistical Society.

Linear-plateau models are special cases of these segmented models where the transition
of the segments is not smooth, there is a kink at the join-point. They are in fact special cases
of the linear-slope models that connect two linear segments. Kirby (1974) examined the
shoot-apex development in wheat where he studied the natural logarithm of the number of
primordia as a function of days since sowing (Figure 5.20). Arguing on biological grounds it
was believed that the increase in ln(# primordia) slows down sharply (abruptly) at the end of
spikelet initiation which can be estimated from mature plants. The kink this creates in the

© 2003 by CRC Press LLC




response is obvious in the model graphed in Figure 5.20 which was considered by Lerman
(1980) for Kirby's data.
If the linear segment on the right-hand side has zero slope, we obtain the linear-plateau
model. Anderson and Nelson (1975) studied various segmented models for crop yield as a
function of fertilizer, the linear-plateau model being a special case. We show some of these
models in Figure 5.21 along with the terminology used in the sequel.

Linear Model
Linear-Plateau Model
7 Linear-Slope Model
Linear-Slope-Plateau Model

6
E[Y]

0 2 4 6 8 10
X

Figure 5.21. Some members of the family of linear segmented models. The linear-slope
model (LS) joins two line segments with non-zero slopes, the linear-plateau model (LP) has
two line segments, the second of which has zero slope and the linear-slope-plateau model
(LSP) has two line segments that connect to a plateau.

Anderson and Nelson (1975) consider the fit of these models to two data sets of corn
yields from twenty-two locations in North Carolina and ten site-years in Tennessee. We
repeat part of the Tennessee data in Table 5.7 (see also Figure 5.22).

Table 5.7. Tennessee average corn yields for two locations and three years as a function of
nitrogen (kg ‚ ha" ) based on experiments of Engelstad and Parks (1971)
Knoxville Jackson
"
N (kg ‚ ha ) 1962 1963 1964 1962 1963 1964
! %%Þ' %&Þ" '!Þ* %'Þ& #*Þ$ #)Þ)
'( ($Þ! ($Þ# (&Þ* &*Þ! &&Þ# $(Þ'
"$% (&Þ# )*Þ$ )$Þ( ("Þ* ((Þ$ &&Þ#
#!" )$Þ$ *"Þ# )%Þ$ ($Þ" ))Þ! ''Þ)
#') ()Þ% *"Þ% )"Þ) (%Þ& )*Þ% '(Þ!
$&& )!Þ* ))Þ! )%Þ& (&Þ& )(Þ! '(Þ)
Data appeared in Anderson and Nelson (1975). Used with permission of the International
Biometric Society.

© 2003 by CRC Press LLC


Graphs of the yields for the two locations and three years are shown in Figure 5.22.
Anderson and Nelson (1975) make a very strong case for linear-plateau models and their
relatives in studies relating crop yields to nutrients or fertilizers. They show that using a
quadratic polynomial when a linear-plateau model is appropriate can lead to seriously flawed
inferences (see Figure 5.1, p. 186). The quadratic response model implies a maximum (or
minimum) at some concentration while the linear-plateau model asserts that yields remain
constant past a critical concentration. Furthermore, the maximum yield achieved under the
quadratic model tends to be too large, with positive bias. The quadratic models also tend to fit
a larger slope at low concentrations than is supported by the data.
The models distinguished by Anderson and Nelson (1975) are the variations of the
linear-plateau models shown in Figure 5.21. Because they do not apply nonlinear fitting
techniques they advocate fixing the critical concentrations (N levels of the join-points) and fit
the resulting linear model by standard linear regression techniques. The possible values of the
join-points are varied only to be among the interior nitrogen concentrations a'(ß "$%ß âß #')b
or averages of adjacent points a"!!Þ&ß "'(Þ&ß âß $!"Þ&b.

0 100 200 300


year: 1962 year: 1962
Jackson Knoxville

80

40
Average Corn Yield (q/ha)

year: 1963 year: 1963


Jackson Knoxville

80

40
year: 1964 year: 1964
Jackson Knoxville

80

40

0 100 200 300


N (kg/ha)

Figure 5.22. Corn yields from two locations and three years, according to Anderson and
Nelson (1975).

Table 5.8 shows the models, their residual sums of squares, and the join-points that
Anderson and Nelson (1975) determined to best fit the particular subsets of the data. Because
we can fit these models as nonlinear models we can estimate the join-points from the data in
most cases. As we will see convergence difficulties can be encountered if, for example, a
linear-slope-plateau model is fit in nonlinear form to a data set with only six points. The non-
linear version of this model has five parameters and there may not be sufficient information in
the data to estimate the parameters.

© 2003 by CRC Press LLC




Table 5.8. Results of fitting the models selected by Anderson and Nelson (1975) for the
Tennessee corn yield data of Table 5.7 (The join-points were fixed and the
resulting models fit by linear regression)
Location Year Model Type WWV Join-Point " Join-Point 2 5†
Knoxville 1962 LSP‡ "%Þ#' '( #!" $
Knoxville 1963 LP "!Þ&* "!!Þ& #
Knoxville 1964 LP %Þ&' "!!Þ& #
Jackson 1962 LP ""Þ!' "'(Þ& #
Jackson 1963 LP (Þ"# "'(Þ& #
Jackson 1964 LP "$Þ%* #!" #

5 œ Number of parameters estimated in the linear regression model with fixed join-points

LSP: Linear-Slope-Plateau model, LP: Linear-Plateau model, LS: Linear-Slope model

Before fitting the models in nonlinear form, we give the mathematical expressions for the
LP, LS, and LSP models from which the model statements in proc nlin will be build. For
completeness we include the simple linear regression model too (SLR). Let B denote N con-
centration and !" , !# the two-join points. Recall that M aB ž !" b, for example, is the indicator
function that takes on value " if B ž !" and ! otherwise. Furthermore, define the following
four quantities
)" œ "! € "" B
)# œ "! € "" !"
[5.55]
)$ œ "! € "" !" € "# aB  !" b
)% œ "! € "" !" € "# a!#  !" b.

)" is the linear slope of the first segment, )# the yield achieved when the first segment reaches
concentration B œ !" , and so forth. The three models now can be written as
SLR: Ec] d œ )"
LP: Ec] d œ )" M aB Ÿ !" b € )# M aB ž !" b
[5.56]
LS: Ec] d œ )" M aB Ÿ !" b € )$ M aB ž !" b
LSP: Ec] d œ )" M aB Ÿ !" b € )$ M a!"  B Ÿ !# b € )% M aB ž !# b.

We find this representation of the linear-plateau family of models useful because it suggests
how to test certain hypotheses. Take the LS model, for example. Comparing )$ and )# we see
that the linear-slope model reduces to a linear-plateau model if "# œ ! since then )# œ )$ .
Furthermore, if "" œ "# , an LS model reduces to the simple linear regression model (SLR).
The proc nlin statements to fit the various models follow.
proc sort data=tennessee; by location year; run;

title 'Linear Plateau (LP) Model';


proc nlin data=tennessee method=newton noitprint;
parameters b0=45 b1=0.43 a1=67;
firstterm = b0+b1*n;
secondterm = b0+b1*a1;
model yield = firstterm*(n <= a1) + secondterm*(n > a1);
by location year;
run;

© 2003 by CRC Press LLC


title 'Linear-Slope (LS) Model';
proc nlin data=tennessee method=newton noitprint;
parameters b0=45 b1=0.43 b2=0 a1=67;
bounds b2 >= 0;
firstterm = b0+b1*n;
secondterm = b0+b1*a1+b2*(n-a1);
model yield = firstterm*(n <= a1) + secondterm*(n > a1);
by location year;
run;

title 'Linear-Slope-Plateau (LSP) Model for Knoxville 1962';


proc nlin data=tennessee method=newton;
parameters b0=45 b1=0.43 b2=0 a1=67 a2=150;
bounds b1 > 0, b2 >= 0;
firstterm = b0+b1*n;
secondterm = b0+b1*a1+b2*(n-a1);
thirdterm = b0+b1*a1+b2*(a2-a1);
model yield = firstterm*(n <= a1) + secondterm*((n > a1) and (n <= a2)) +
thirdterm*(n > a2);
run;

With proc nlmixed we can fit the various models and perform the necessary hypothesis
tests through the contrast or estimate statements of the procedure. To fit linear-slope
models and compare them to the LP and SLR models use the following statements.
proc nlmixed data=tennessee df=3;
parameters b0=45 b1=0.43 b2=0 a1=67 s=2;
firstterm = b0+b1*n;
secondterm = b0+b1*a1+b2*(n-a1);
model yield ~ normal(firstterm*(n <= a1) + secondterm*(n > a1),s*s);
estimate 'Test against SLR' b1-b2;
estimate 'Test against LP ' b2;
contrast 'Test against SLR' b1-b2;
contrast 'Test against LP ' b2;
by location year;
run;

Table 5.9. Results of fitting the linear-plateau type models to the Tennessee corn yield data
of Table 5.7 (The join-points were estimated from the data)
Location Year Model Type WWV Join-Point " 5†
Knoxville 1962 LP‡ $'Þ"! )#Þ# $
Knoxville 1963 LP (Þ)* "!(Þ! $
Knoxville 1964 LP %Þ&% "!"Þ$ $
Jackson 1962 LS !Þ!& "$%Þ) %
Jackson 1963 LP &Þ$" "'#Þ& $
Jackson 1964 LP "$Þ#$ #!$Þ* $

5 œ No. of parameters estimated in the nonlinear regression model with estimated join-points

LP: Linear-Plateau model; LS: Linear-Slope model

In Table 5.9 we show the results of the nonlinear models that best fit the six site-years.
The linear-slope-plateau model for Knoxville in 1962 did not converge with proc nlin. This
is not too surprising since this model has & parameters a"! , "" , "# , !" , !# b and only six obser-
vations. Not enough information is provided by the data to determine all nonlinear param-
eters. Instead we determined that a linear-plateau model best fits these data if the join-point is
estimated.

© 2003 by CRC Press LLC




Comparing Tables 5.8 and 5.9 several interesting facts emerge. The model selected as
best by Anderson and Nelson (1975) based on fitting linear regression models with fixed
join-points are not necessarily the best models selected when the join-points are estimated.
For data from Knoxville 1963 and 1964 as well as Jackson 1963 and 1964 both approaches
arrive at the same basic model, a linear-plateau relationship. The residual sums of squares
between the two approaches then must be close if the join-point in Anderson and Nelson's
approach was fixed at a value close to the nonlinear least squares iterate of !" . This is the
case for Knoxville 1964 and Jackson 1964. As the estimated join-point is further removed
from the fixed join point (e.g., Knoxville 1963), the residual sum of squares in the nonlinear
model is considerably lower than that of the linear model fit.
For the data from the Jackson location in 1962, the nonlinear method selected a different
model. Whereas Anderson and Nelson (1975) select an LP model with join-point at "'(Þ&,
fitting a series of nonlinear models leads one to a linear-slope (LS) model with join-point at
"$%Þ) kg ‚ ha" . The residual sums of squares of the nonlinear model is more than #!! times
smaller than that of the linear model. Although the slope a"# b estimate of the LS model is
close to zero (Output 5.14), so is its standard error and the approximate *&% confidence inter-
val for a"# b does not include zero ac!Þ!"!&ß !Þ!#&$db. Not restricting the second segment of
the model to be a flat line significantly improves the model fit (not only over a model with
fixed join-point, but also over a model with estimated join-point).

Output 5.14.
------------------- location=Jackson year=1962 -----------------------

The NLIN Procedure


NOTE: Convergence criterion met.

Estimation Summary
Method Newton
Iterations 7
R 1.476E-6
PPC(b2) 1.799E-8
RPC(b2) 0.000853
Object 0.000074
Objective 0.053333
Observations Read 6
Observations Used 6
Observations Missing 0

Sum of Mean Approx


Source DF Squares Square F Value Pr > F
Regression 4 27406.9 6851.7 8419.27 0.0001
Residual 2 0.0533 0.0267
Uncorrected Total 6 27407.0
Corrected Total 5 673.6

Approx Approximate 95% Confidence


Parameter Estimate Std Error Limits
b0 46.4333 0.1491 45.7919 47.0747
b1 0.1896 0.00172 0.1821 0.1970
b2 0.0179 0.00172 0.0105 0.0253
a1 134.8 1.6900 127.5 142.0

The data from Knoxville in 1962 are a somewhat troubling case. The linear-slope-plateau
model that Anderson and Nelson (1975) selected does not converge when the join-points are
estimated from the data. Between the LS and LP models the former does not provide a signi-

© 2003 by CRC Press LLC


ficant improvement over the latter and we are led in the nonlinear analysis to select a linear-
plateau model for these data. From the nonlinear predictions shown in Figure 5.23, the linear-
plateau model certainly fits the Knoxville 1962 data adequately. Its Pseudo-V # is !Þ*'.

0 100 200 300


year: 1962 year: 1962
Jackson Knoxville

80

40
Average Corn Yield (q/ha)

year: 1963 year: 1963


Jackson Knoxville

80

40
year: 1964 year: 1964
Jackson Knoxville

80

40

0 100 200 300


N (kg/ha)

Figure 5.23. Predicted corn yields for Tennessee corn yield data (Model for Jackson 1962 is
a linear-slope model; all others are linear-plateau models).

5.8.4 Critical R S$ Concentrations as a Function of Sampling


Depth — Comparing Join-Points in Plateau Models
It has been generally accepted that soil samples for soil NO$ testing must be collected to
depths greater than $! cm because rainfall moves NO$ from surface layers quickly to deeper
portions of the root zone. Blackmer et al. (1989) suggested that for the late-spring test, nitrate
concentrations in the top $! cm of the soil are indicative of the amounts in the rooting zone
because marked dispersion of nitrogen can occur as water moves through macropores (see
also Priebe and Blackmer, 1989). Binford, Blackmer, and Cerrato (1992) analyze extensive
data from 45 site-years (1346 plot years) collected between 1987 and 1989 in Iowa. When
corn plants were "& to $! cm tall, samples representing ! to $! cm and $! to '! cm soil layers
were collected. Each site-year included seven to ten rates of N applied before planting. For
the N-responsive site years, Figure 5.24 shows relative corn yield for the N rates !, ""#, ##%,
and $$' kg ‚ ha" .
A site was labeled N-responsive, if a plateau model fit the data and relative yield was
determined as a percentage of the plateau yield (Cerrato and Blackmer 1990). The data in
Figure 5.24 strongly suggest a linear-plateau model with join-point. For a given sampling
depth 4 this linear-response-plateau model becomes

© 2003 by CRC Press LLC


"!4 € ""4 R S$ 34 R S$ 34 Ÿ !4
Ec]34 d œ œ [5.57]
"!4 € ""4 !4 R S$ 34 ž !4

or
Ec]34 d œ a"!4 € ""4 R S$34 bM eR S$34 Ÿ !4 f € a"!4 € ""4 !4 bM eR S$34 ž !4 f.

Depth = 30 cm
110 Depth = 60 cm

100

90
Relative Yield Percent

80

70

60

50

40

30

0 20 40 60 80
Soil NO3 (mg kg-1)

Figure 5.24. Relative yields as a function of soil NO$ for $! and '! cm sampling depths.
Data from Binford, Blackmer, and Cerrato (1992, Figures 2c, 3c) containing only N respon-
sive site years on sites that received !, ""#, ##%, or $$' kg ‚ ha" N. Data kindly made
available by Dr. A. Blackmer, Department of Agronomy, Iowa State University. Used with
permission.

If soil samples from the top $! cm a4 œ "b are indicative of the amount of N in the root-
ing zone and movement through macropores causes marked dispersion as suggested by
Blackmer et al. (1989), one would expect the ! to $! cm data to yield a larger intercept
a"!" ž "!# b, smaller slope a"""  ""# b and larger critical concentration a!" ž !# b. The
plateaus, however, should not be significantly different. Before testing
L! À "!" œ "!# vs. L" À "!" ž "!#
L! À """ œ ""# vs. L" À """  ""# [5.58]
L! À !" œ !# vs. L" À !" ž !# ,

we examine whether there are any differences in the plateau models between the two sampl-
ing depths. To this end we fit the full model [5.57] and compare it to the reduced model
"! € "" R S$ 34 R S$ 34 Ÿ !
Ec]34 d œ œ [5.59]
"! € "" ! R S$ 34 ž !,

that does not vary parameters by sample depth with a sum of squares reduction test.
%include 'DriveLetterOfCDROM:\Data\SAS\BlackmerData.txt';

© 2003 by CRC Press LLC


/* Reduced model [5.59] */
proc nlin data=blackmer noitprint;
parameters b0=24 b1=24 alpha=25 ;
model ryp = (b0 + b1*no3)*(no3 <= alpha) + (b0 + b1*alpha)*(no3 > alpha);
run;

/* Full model [5.57] */


proc nlin data=blackmer method=marquardt noitprint;
parms b01=24 b11=4 alp1=25 del_b0=0 del_b1=0 del_alp=0;
b02 = b01 + del_b0;
b12 = b11 + del_b1;
alp2 = alp1 + del_alp;
model30 = (b01 + b11*no3)*(no3 <= alp1) + (b01 + b11*alp1)*(no3 > alp1);
model60 = (b02 + b12*no3)*(no3 <= alp2) + (b02 + b12*alp2)*(no3 > alp2);
model ryp = model30*(depth=30) + model60*(depth=60);
run;

The full model parameterizes the responses for '! cm sampling depth a"!# ß ""# ß !# b as
"!# œ "!" € ?"! , ""# œ """ € ?"" , !# œ !" € ?! so that differences between the sampling
depths in the parameters can be accessed immediately on the output. The reduced model has a
residual sum of squares of WWÐs) Ñ< œ $*ß ('"Þ' on %(( degrees of freedom (Output 5.15).

Output 5.15.
The NLIN Procedure

NOTE: Convergence criterion met.

Estimation Summary

Method Gauss-Newton
Iterations 8
R 0
PPC 0
RPC(alpha) 0.000087
Object 4.854E-7
Objective 39761.57
Observations Read 480
Observations Used 480
Observations Missing 0

Sum of Mean Approx


Source DF Squares Square F Value Pr > F

Regression 3 3839878 1279959 774.74 <.0001


Residual 477 39761.6 83.3576
Uncorrected Total 480 3879639
Corrected Total 479 168923

Approx Approximate 95% Confidence


Parameter Estimate Std Error Limits

b0 8.7901 2.7688 3.3495 14.2308


b1 4.8995 0.2207 4.4659 5.3332
alpha 18.0333 0.3242 17.3963 18.6703

The full model's residual sum of squares is WWÐs ) Ñ0 œ #*ß #$'Þ& on %(% degrees of
freedom (Output 5.16). The test statistic for the three degree of freedom hypothesis

© 2003 by CRC Press LLC


Ô "!"  "!# × Ô ! ×
L! À """  ""# œ !
Õ !"  !# Ø Õ ! Ø

is
a$*ß ('"Þ'  #*ß #$'Þ&bÎ$
J9,= œ œ &'Þ)(*
#*ß #$'Þ&Î%(%

with :-value less than !Þ!!!".


Since the plateau models are significantly different for the two sampling depths we now
proceed to examine the individual hypotheses [5.58]. From Output 5.16 we see that the esti-
mates of the parameters ?"! , ?"" , and ?! have signs consistent with the alternative hypoth-
eses. The asymptotic *&% confidence intervals are two-sided intervals but the alternative
hypotheses are one-sided. We thus calculate the one-sided :-values for the three tests with
data pvalues;
tb0 = -9.7424/4.2357; pb0 = ProbT(tb0,474);
tb1 = 2.1060/0.3205; pb1 = 1-ProbT(tb1,474);
talpha = -6.8461/0.5691; palpha = ProbT(talpha,474);
run;
proc print data=pvalues; run;

Table 5.10. Test statistics and :-values for hypotheses [5.58]


Hypothesis >9,= :-value
L" À "!" ž "!#  *Þ(%#%Î%Þ#$&( œ  #Þ$!! !Þ!"!*
L" À """  ""# #Þ"!'!Î!Þ$#!$ œ 'Þ&(&  !Þ!!!"
L" À !" ž !#  'Þ)%'"Î!Þ&'*" œ  "#Þ!#*  !Þ!!!"

Output 5.16. The NLIN Procedure

NOTE: Convergence criterion met.

Estimation Summary
Method Marquardt
Iterations 6
R 0
PPC 0
RPC(del_alp) 3.92E-6
Object 1.23E-10
Objective 29236.55
Observations Read 480
Observations Used 480
Observations Missing 0

Sum of Mean Approx


Source DF Squares Square F Value Pr > F
Regression 6 3850403 641734 452.94 <.0001
Residual 474 29236.5 61.6805
Uncorrected Total 480 3879639
Corrected Total 479 168923

© 2003 by CRC Press LLC


Output 5.16 (continued).
Approx Approximate 95% Confidence
Parameter Estimate Std Error Limits

b01 15.1943 2.8322 9.6290 20.7596


b11 3.5760 0.1762 3.2297 3.9223
alp1 23.1324 0.4848 22.1797 24.0851
del_b0 -9.7424 4.2357 -18.0657 -1.4191
del_b1 2.1060 0.3203 1.4766 2.7354
del_alp -6.8461 0.5691 -7.9643 -5.7278

Even if a Bonferroni adjustment is made to protect the experimentwise Type-I error rate
in this series of three tests, at the experimentwise &% error level all three tests lead to rejec-
tion of their respective null hypotheses. Table 5.11 shows the estimates for the full model.
The two plateau values of *(Þ*# and *(Þ** are very close and probably do not warrant a
statistical comparison. To demonstrate how a statistical test for "!" € """ !" œ "!# € ""# !#
can be performed we carry it out.

Table 5.11. Parameter estimates in final plateau models


Parameter Meaning Estimate
"!" Intercept $! -7 "&Þ"*%$
""" Slope $! -7 $Þ&('!
!" Critical concentration $! -7 #$Þ"$#%
"!" € """ !" Plateau $! -7 *(Þ*"&)
"!# Intercept '! -7 "&Þ"*%$  *Þ(%#% œ &Þ%&"*
""# Slope '! -7 $Þ&('! € #Þ"!'! œ &Þ')#!
!# Critical concentration '! -7 #$Þ"$#%  'Þ)%'" œ "'Þ#)'$
"!# € ""# !# Plateau '! -7 *(Þ**!!

The first method relies on reparameterizing the model. Let "!4 € ""4 !4 œ X4 denote the
plateau for sampling depth 4 and notice that the model becomes
Ec]34 d œ aX4 € ""4 aR S$34  !4 bbM eR S$34 Ÿ !4 f € X4 M eR S$34 ž !4 f
œ ""4 aR S$34  !4 bM eR S$34 Ÿ !4 f € X4 .

The intercepts "!" and "!# were eliminated from the model which now contains X" and
X# œ X" € ?X as parameters. The SAS® statements
proc nlin data=blackmer method=marquardt noitprint;
parms b11=3.56 alp1=23.13 T1=97.91 b12=5.682 alp2=16.28 del_T=0;
T2 = T1 + del_T;
model30 = b11*(no3-alp1)*(no3 <= alp1) + T1;
model60 = b12*(no3-alp2)*(no3 <= alp2) + T2;
model ryp = model30*(depth=30) + model60*(depth=60);
run;

yield Output 5.17. The approximate *&% confidence interval for ?X aÒ  "Þ')$ß "Þ)$%Ób con-
tains zero and there is insufficient evidence at the &% significance level to conclude that the
relative yield plateaus differ among the sampling depths. The second method of comparing
the plateau values relies on the capabilities of proc nlmixed to estimate linear and nonlinear

© 2003 by CRC Press LLC


functions of the model parameters. Any of the parameterizations of the plateau model will do
for this purpose. For example, the statements
proc nlmixed data=blackmer df=474;
parms b01=24 b11=4 alp1=25 del_b0=0 del_b1=0 del_alp=0;
s2 = 61.6805;
b02 = b01 + del_b0;
b12 = b11 + del_b1;
alp2 = alp1 + del_alp;
model30 = (b01 + b11*no3)*(no3 <= alp1) + (b01 + b11*alp1)*(no3 > alp1);
model60 = (b02 + b12*no3)*(no3 <= alp2) + (b02 + b12*alp2)*(no3 > alp2);
model ryp ~ normal(model30*(depth=30) + model60*(depth=60),s2);
estimate 'Difference in Plateaus' b01+b11*alp1 - (b02+b12*alp2);
run;

will do the trick (Output 5.18). Since the nlmixed procedure approximates a likelihood and
estimates all distribution parameters, it would iteratively determine the variance of the
Gaussian error distribution. To prevent this we fix the variance with the s2 = 61.6805; state-
ment. This is the residual mean square estimate obtained from fitting the full model in proc
nlin (see Output 5.16 or Output 5.17). Also, because proc nlmixed determines residual
degrees of freedom by a method different from proc nlin we fix the residual degrees of
freedom with the df= option of the proc nlmixed statement.

Output 5.17.
The NLIN Procedure

NOTE: Convergence criterion met.

Estimation Summary

Method Marquardt
Iterations 1
R 2.088E-6
PPC(alp1) 4.655E-7
RPC(del_T) 75324.68
Object 0.000106
Objective 29236.55
Observations Read 480
Observations Used 480
Observations Missing 0

Sum of Mean Approx


Source DF Squares Square F Value Pr > F

Regression 6 3850403 641734 452.94 <.0001


Residual 474 29236.5 61.6805
Uncorrected Total 480 3879639
Corrected Total 479 168923

Approx Approximate 95% Confidence


Parameter Estimate Std Error Limits

b11 3.5760 0.1762 3.2297 3.9223


alp1 23.1324 0.4848 22.1797 24.0851
T1 97.9156 0.6329 96.6720 99.1592
b12 5.6820 0.2675 5.1564 6.2076
alp2 16.2863 0.2980 15.7008 16.8719
del_T 0.0753 0.8950 -1.6834 1.8340

© 2003 by CRC Press LLC


Output 5.18. The NLMIXED Procedure

Specifications
Data Set WORK.BLACKMER
Dependent Variable ryp
Distribution for Dependent Variable Normal
Optimization Technique Dual Quasi-Newton
Integration Method None

Dimensions
Observations Used 480
Observations Not Used 0
Total Observations 480
Parameters 6

NOTE: GCONV convergence criterion satisfied.

Fit Statistics
-2 Log Likelihood 3334.7
AIC (smaller is better) 3346.7
AICC (smaller is better) 3346.9
BIC (smaller is better) 3371.8

Parameter Estimates
Standard
Parameter Estimate Error DF t Value Pr > |t| Lower Upper
b01 15.1943 2.8322 474 5.36 <.0001 9.6290 20.7595
b11 3.5760 0.1762 474 20.29 <.0001 3.2297 3.9223
alp1 23.1324 0.4848 474 47.71 <.0001 22.1797 24.0851
del_b0 -9.7424 4.2357 474 -2.30 0.0219 -18.0656 -1.4192
del_b1 2.1060 0.3203 474 6.57 <.0001 1.4766 2.7354
del_alp -6.8461 0.5691 474 -12.03 <.0001 -7.9643 -5.7278

Additional Estimates
Standard
Label Estimate Error DF t Value Pr > |t|
Difference in Plateaus -0.07532 0.8950 474 -0.08 0.9330

Figure 5.25 shows the predicted response functions for the ! to $! cm and ! to '! cm
sampling depths.

Depth = 30 cm
110 Depth = 60 cm

100

90
Relative Yield Percent

80

70

60

50

40

30

0 20 40 60 80
Soil NO3 (mg kg-1)

Figure 5.25. Fitted linear-plateau models.

© 2003 by CRC Press LLC


5.8.5 Factorial Treatment Structure With Nonlinear Response
Many agronomic studies involve a treatment structure with more than one factor. Typically,
the factors are crossed so that replicates of experimental units are exposed to all possible
combinations of the factors. When modeling the mean function of such data with linear com-
binations of the main effects and interactions of the factors, one naturally arrives at analysis
of variance as the tool for statistical inference. If the mean function is nonlinear, it is less
clear how to compare treatments, test for main effects, and investigate interactions. As an
example we consider a velvetleaf (Abutilon theophrasti Medicus) multiple growth stage
experiment. Velvetleaf was grown to & to ' cm (# to $ leaves), ) to "$ cm ($ to % leaves), and
"& to #! cm (& to ' leaves) in a commercial potting mixture in "-L plastic pots. Plants were
grown in a "'-h photoperiod of natural lighting supplemented with sodium halide lights
providing a midday photosynthetic photon flux density of "ß !!! .molÎm# Îs. Weeds were
treated with two herbicides (glufosinate and glyphosate) at six different rates of application.
Two separate runs of the experiment were conducted with four replicates each in a random-
ized complete block design. The above-ground biomass was harvested "% days after treatment
and oven dried. The outcome of interest is the dry weight percentage relative to an untreated
control. This and a second multistage growth experiment for common lambsquarter
(Chenopodium album L.) and several single growth stage experiments are explained in more
detail in Tharp, Schabenberger, and Kells (1999).
Considering an analysis of variance model for this experiment one could arrive at
]34567 œ . € ,3 € /34 € !5 € "6 € #7 €
[5.60]
a!" b56 € a!# b57 € a"# b67 € a!"# b567 € /34567.

where ]34567 denotes the dry weight percentage, ,3 denotes the 3th run a3 œ "ß #b, /34 the 4th
replicate within run 3 a4 œ "ß âß %b, !5 the effect of the 5 th herbicide a5 œ "ß #b, "6 the effect
of the 6th size class a6 œ "ß âß $b, and #7 the effect of the 7th rate a7 œ "ß âß 'b. One can
include additional interaction terms in model [5.60] but for expository purposes we will not
pursue this issue here. The analysis of variance table (Table 5.12, SAS® output not shown) is
produced in SAS® with the statements
proc glm data=VelvetFactorial;
class run rep herb size rate;
model drywtpct = run rep(run) herb size rate herb*size herb*rate
size*rate herb*size*rate;
run; quit;

The analysis of variance table shows significant Herb ‚ Rate and Size ‚ Rate inter-
actions (at the &% level) and significant Rate and Size main effects.
By declaring the rate of application a factor in model [5.60], rates are essentially discret-
ized and the continuity of rates of application is lost. Some information can be recovered by
testing for linear, quadratic, up to quintic trends of dry weight percentages. Because of the
interactions of rate with the size and herbicide factors, great care should be exercised since
these trends may differ for the two herbicides or the three size classes. Unequal spacing of the
rates of application is a further hindrance, since published tables of contrast coefficients re-
quire a balanced design with equal spacing of the levels of the quantitative factor. From
Figure 5.26 it is seen that the dose response curves cannot be described by simple linear or

© 2003 by CRC Press LLC


quadratic polynomials, although their general shape appears to vary little by herbicide or size
class.

Table 5.12. Analysis of variance for model [5.60]


Effect DF SS MS J9,= :-value
Run ,3 " &%"Þ(& &%"Þ(& #Þ%) !Þ""(
Rep w/in Run /34 ' "%&$Þ#* #%#Þ## "Þ"" !Þ$&)
Herbicide !5 " %(#Þ() %(#Þ() #Þ"' !Þ"%$
Size "6 # #!$*"Þ"$ "!"*&Þ&( %'Þ'&  !Þ!!!"
Rate #7 & $%"()$Þ*$ ')$&'Þ(* $"#Þ((  !Þ!!!"
Herb ‚ Size a!" b56 # #)&Þ$* "%#Þ'* !Þ'& !Þ&#"
Herb ‚ Rate a!# b57 & %)""Þ$# *'#Þ#' %Þ%! !Þ!!!(
Size ‚ Rate a"# b67 "! '"')Þ(! '"'Þ)( #Þ)# !Þ!!$
Herb ‚ Size ‚ Rate a!"# b567 "! $&('Þ(! $&(Þ'( "Þ'% !Þ!*(
Error /34567 #%& &$&%&Þ&) #")Þ&&
Total #)( %$$!$!Þ&)

0.0 0.2 0.4 0.6 0.8 1.0

3 leaves 3 leaves 4 leaves


Glufosinate Glyphosate Glufosinate
100
80
60
40
Dry weight %

20

4 leaves 5 leaves 5 leaves


Glyphosate Glufosinate Glyphosate
100
80
60
40
20

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Rate of Application (kg ai/ha)

Figure 5.26. Herbicide ‚ Size class (leave stage) sample means as a function of application
rate in factorial velvetleaf dose-response experiment. Data made kindly available by Dr.
James J. Kells, Department of Crop and Soil Sciences, Michigan State University. Used with
permission.

As an alternative approach to the analysis of variance, we consider the data in Figure


5.26 as the raw data for a nonlinear modeling problem. Tharp et al. (1999) analyze these data
with a four-parameter log-logistic model. Plotted against logeratef, the mean dry weight per-
centages exhibit a definite sigmoidal trend. But graphed against the actual rates, the response

© 2003 by CRC Press LLC


appears hyperbolic (Figure 5.26) and for any given Herbicide ‚ Size Class combination the
hyperbolic Langmuir model !" B# Îa" € " B# b, where B is dosage in kg ai/ha appears appro-
priate. Since the response is expressed as a percentage of the control dry weight, we can fix
the parameter ! at "!!. This reduces the problem to fitting a two-parameter model for each
Herbicide ‚ Size combination compared to a three- or four-parameter log-logistic model.
Thus, the full model that varies the two Langmuir parameters for the "# herbicide ‚ size
combinations has #‡"# œ #% parameters (compared to %$ parameters in model [5.60]). The
full model we postulate here is
#
"56 B7
Ec]567 d œ "!! # , [5.61]
" € "56 B7
where B7 is the 7th application rate common to the combination of herbicide 5 and size class
6. For expository purposes we assume here that only the " parameter varies across treatments,
so that [5.61] has thirteen parameters. Notice that an additional advantage of model [5.61]
over [5.60] is that it entails at most a single two-way interaction compared to three two-way
and one three-way interactions in [5.60].
The first hypothesis being tested in [5.61] is that all herbicides and size classes share the
same " parameter
L! À "56 œ "5w 6w a55 w ß 66w .
This eleven degree of freedom hypothesis is similar to the global ANOVA hypothesis testing
equal effects of all treatments. Under this hypothesis [5.61] reduces to a two-parameter model
#
" B7
L! À no treatment effects; Ec]567 d œ "!! # . [5.62]
" € " B7
Models [5.61] and [5.62] are compared via sum of squares reduction tests. If [5.61] is not a
significant improvement over [5.62], stop. Otherwise the next step is to investigate the effects
of Herbicide and Size Class separately. This entails two more models, both reduced versions
of [5.61]:
#
"6 B7
L! À no herbicide effects; Ec]567 d œ "!! # [5.63]
" € "6 B7
#
"5 B7 [5.64]
L! : no size class effects; Ec]567 d œ "!! #
" € "5 B7

Models [5.61] through [5.64] are fit in SAS® proc nlin with the following series of
statements. The full model [5.61] for the # ‚ $ factorial is fit first (Output 5.19). Size classes
are identified with the second subscript corresponding to the $, %, and & leave stages. For
example, beta_14 is the parameter for herbicide " (glufosinate) and size class # (% leave
stage).
title 'Full Model [5.61]';
proc nlin data=velvet noitprint;
parameters beta_13=0.05 beta_14=0.05 beta_15=0.05
beta_23=0.05 beta_24=0.05 beta_25=0.05
gamma=-1.5;

alpha = 100;

© 2003 by CRC Press LLC


t_1_3 = beta_13*(rate**gamma);
t_1_4 = beta_14*(rate**gamma);
t_1_5 = beta_15*(rate**gamma);
t_2_3 = beta_23*(rate**gamma);
t_2_4 = beta_24*(rate**gamma);
t_2_5 = beta_25*(rate**gamma);

model drywtpct = (alpha*t_1_3/(1+t_1_3))*(herb=1 and size=3) +


(alpha*t_1_4/(1+t_1_4))*(herb=1 and size=4) +
(alpha*t_1_5/(1+t_1_5))*(herb=1 and size=5) +
(alpha*t_2_3/(1+t_2_3))*(herb=2 and size=3) +
(alpha*t_2_4/(1+t_2_4))*(herb=2 and size=4) +
(alpha*t_2_5/(1+t_2_5))*(herb=2 and size=5);
run; quit;

Output 5.19.
The NLIN Procedure

NOTE: Convergence criterion met.

Estimation Summary

Method Gauss-Newton
Iterations 15
Subiterations 1
Average Subiterations 0.066667
R 5.673E-6
PPC(beta_25) 0.000012
RPC(beta_25) 0.000027
Object 2.35E-10
Objective 2463.247
Observations Read 36
Observations Used 36
Observations Missing 0

NOTE: An intercept was not specified for this model.

Sum of Mean Approx


Source DF Squares Square F Value Pr > F

Regression 7 111169 15881.3 186.97 <.0001


Residual 29 2463.2 84.9396
Uncorrected Total 36 113632
Corrected Total 35 47186.2

Approx Approximate 95% Confidence


Parameter Estimate Std Error Limits

beta_13 0.0179 0.00919 -0.00087 0.0367


beta_14 0.0329 0.0152 0.00186 0.0639
beta_15 0.0927 0.0356 0.0199 0.1655
beta_23 0.0129 0.00698 -0.00141 0.0271
beta_24 0.0195 0.00984 -0.00066 0.0396
beta_25 0.0665 0.0272 0.0109 0.1221
gamma -1.2007 0.1262 -1.4589 -0.9425

The proc nlin statements to fit the completely reduced model [5.62] and the models
without herbicide ([5.63]) and size class effects ([5.64]) are as follows.

© 2003 by CRC Press LLC


title 'Completely Reduced Model [5.62]';
proc nlin data=velvet noitprint;
parameters beta=0.05 gamma=-1.5;
alpha = 100;
term = beta*(rate**gamma);
model drywtpct = alpha*term/(1+term);
run; quit;

title 'No Herbicide Effect [5.63]';


proc nlin data=velvet noitprint;
parameters beta_3=0.05 beta_4=0.05 beta_5=0.05 gamma=-1.5;
alpha = 100;
t_3 = beta_3*(rate**gamma);
t_4 = beta_4*(rate**gamma);
t_5 = beta_5*(rate**gamma);
model drywtpct = (alpha*t_3/(1+t_3))*(size=3) +
(alpha*t_4/(1+t_4))*(size=4) +
(alpha*t_5/(1+t_5))*(size=5);
run; quit;

title 'No Size Effect [5.64]';


proc nlin data=velvet noitprint;
parameters beta_1=0.05 beta_2=0.05 gamma=-1.5;
alpha = 100;
t_1 = beta_1*(rate**gamma);
t_2 = beta_2*(rate**gamma);
model drywtpct = (alpha*t_1/(1+t_1))*(herb1=1) +
(alpha*t_2/(1+t_2))*(herb1=2);
run; quit;

Once the residual sums of squares and degrees of freedom for the models are obtained
(Table 5.13, output not shown), the sum of squares reduction tests can be carried out:
$ß "('Þ!Î&
L! À no treatment effects J9,= œ œ (Þ%() : œ PraJ&ß#*   (Þ%()b œ !Þ!!!"
#ß %'$Þ#Î#*
#%(Þ*Î$
L! À no herbicide effects J9,= œ œ !Þ*($ : œ PraJ$ß#*   !Þ*($b œ !Þ%"))
#ß %'$Þ#Î#*
#ß *%!Þ(Î%
L! : no size class effects J9,= œ œ )Þ'&& : œ PraJ%ß#*   )Þ'&&b œ !Þ!!!"
#ß %'$Þ#Î#*

Table 5.13. Residual sums of squares for full and various reduced models

Model Effects in Model .0</= .079./6 sÑ


WÐ) s Ñ  #ß %'$Þ#
WÐ)
[5.61] Herbicide, Size #* ( #ß %'$Þ#
[5.62] none $% # &ß '$*Þ# $ß "('Þ!
[5.63] Size $# % #ß (""Þ" #%(Þ*
[5.64] Herbicide $$ $ &ß %!$Þ* #ß *%!Þ(

The significant treatment effects appear to be due to a size effect alone but comparing
models [5.61] through [5.64] is somewhat unsatisfactory. For example, model [5.63] of no
Herbicide effects reduces the model degrees of freedom by three although there are only two
herbicides. Model [5.63] not only eliminates a Herbicide main effect, but also the
Herbicide ‚ Size interaction and the resulting model contains a Size Class main effect only.
A similar phenomenon can be observed with model [5.64]. There are two degrees of freedom

© 2003 by CRC Press LLC


for a Size Class main effect and two degrees of freedom for the interaction; model [5.64] con-
tains four parameters less than the full model. If one is interested in testing for main effects
and interactions in a similar fashion as in the linear analysis of variance model, a different
method is required to reduce the full model corresponding to the hypotheses
ó L! À no Herbicide main effect
ô L! À no Size Class main effect
õ L! À no Herbicide ‚ Size Class interaction.

The technique we suggest is an analog of the cell mean representation of a two-way


factorial (§4.3.1):
.34 œ . € a.3Þ  .b € a.Þ4  .b € a.34  .3Þ  .Þ4 € .b œ . € !3 € "4 € a!" b34 ,

where the four terms in the sum correspond to the grand mean, factor A main effects, factor B
main effects, and A ‚ B interactions, respectively. The absence of the main effects and
interactions in a linear model can be represented by complete sets of contrasts among the cell
means as discussed in §4.3.3 (see also Schabenberger, Gregoire, and Kong 2000). With two
herbicide levels L" and L# and three size classes W$ , W% , and W& the cell mean contrasts for
the respective effects are given in Table 5.14. Notice that the number of contrasts for each
effect equals the number of degrees of freedom for that effect.

Table 5.14. Contrasts for main effects and interactions in unfolded # ‚ $ factorial design
Effect Contrast L" W$ L" W% L" W& L# W$ L# W% L# W&
Herbicide Main H " " " " " "
Size Main S1 " " ! " " !
S2 " " # " " #
Herb ‚ Size (H ‚ S1) " " ! " " !
(H ‚ S2) " " # " " #

To test whether the Herbicide main effect is significant, we fit the full model and test
whether the linear combination
."$ € ."% € ."&  .#$  .#%  .#&
differs significantly from zero. This can be accomplished with the contrast statement of the
nlmixed procedure. As in the previous application we fix the residual degrees of freedom and
the error variance estimate to equal those for the full model obtained with proc nlin.
proc nlmixed data=velvet df=29;
parameters beta_13=0.05 beta_14=0.05 beta_15=0.05
beta_23=0.05 beta_24=0.05 beta_25=0.05
gamma=-1.5;

s2 = 84.9396;
alpha = 100;

mu_13 = alpha*beta_13*(rate**gamma)/(1+beta_13*(rate**gamma));
mu_14 = alpha*beta_14*(rate**gamma)/(1+beta_14*(rate**gamma));
mu_15 = alpha*beta_15*(rate**gamma)/(1+beta_15*(rate**gamma));
mu_23 = alpha*beta_23*(rate**gamma)/(1+beta_23*(rate**gamma));
mu_24 = alpha*beta_24*(rate**gamma)/(1+beta_24*(rate**gamma));

© 2003 by CRC Press LLC




mu_25 = alpha*beta_25*(rate**gamma)/(1+beta_25*(rate**gamma));

meanfunction = (mu_13)*(herb=1 and size=3) +


(mu_14)*(herb=1 and size=4) +
(mu_15)*(herb=1 and size=5) +
(mu_23)*(herb=2 and size=3) +
(mu_24)*(herb=2 and size=4) +
(mu_25)*(herb=2 and size=5);

model drywtpct ~ normal(meanfunction,s2);

contrast 'No Herbicide Effect' mu_13+mu_14+mu_15-mu_23-mu_24-mu_25;


contrast 'No Size Class Effect' mu_13-mu_14+mu_23-mu_24,
mu_13+mu_14-2*mu_15+mu_23+mu_24-2*mu_25;
contrast 'No Interaction' mu_13-mu_14-mu_23+mu_24,
mu_13+mu_14-2*mu_15-mu_23-mu_24+2*mu_25;

run; quit;

Output 5.20.
The NLMIXED Procedure

Specifications
Data Set WORK.VELVET
Dependent Variable drywtpct
Distribution for Dependent Variable Normal
Optimization Technique Dual Quasi-Newton
Integration Method None

Dimensions
Observations Used 36
Observations Not Used 0
Total Observations 36
Parameters 7

NOTE: GCONV convergence criterion satisfied.

Fit Statistics
-2 Log Likelihood 255.1
AIC (smaller is better) 269.1
AICC (smaller is better) 273.1
BIC (smaller is better) 280.2

Parameter Estimates
Standard
Parameter Estimate Error DF t Value Pr > |t| Lower Upper
beta_13 0.01792 0.01066 29 1.68 0.1036 -0.00389 0.0397
beta_14 0.03290 0.01769 29 1.86 0.0731 -0.00329 0.0690
beta_15 0.09266 0.04007 29 2.31 0.0281 0.01071 0.1746
beta_23 0.01287 0.008162 29 1.58 0.1257 -0.00382 0.0295
beta_24 0.01947 0.01174 29 1.66 0.1079 -0.00453 0.0435
beta_25 0.06651 0.03578 29 1.86 0.0732 -0.00667 0.1397
gamma -1.2007 0.1586 29 -7.57 <.0001 -1.5251 -0.8763

Contrasts
Num Den
Label DF DF F Value Pr > F
No Herbicide Effect 1 29 1.93 0.1758
No Size Class Effect 2 29 4.24 0.0242
No Interaction 2 29 0.52 0.6010

Based on the Contrasts table in Output 5.20 we reject the hypothesis of no Size Class
effect but fail to reject the hypothesis of no interaction and no Herbicide main effect at the &%
level. Based on these results we could fit a model in which the " parameters vary only by

© 2003 by CRC Press LLC


Size Class, that is [5.63]. Using proc nlmixed, pairwise comparisons of the " parameters
among size classes are accomplished with
proc nlmixed data=velvet df=32;
parameters beta_3=0.05 beta_4=0.05 beta_5=0.05
gamma=-1.5;
s2=84.7219;
alpha = 100;
t_3 = beta_3*(rate**gamma);
t_4 = beta_4*(rate**gamma);
t_5 = beta_5*(rate**gamma);
mu_3 = alpha*t_3/(1+t_3);
mu_4 = alpha*t_4/(1+t_4);
mu_5 = alpha*t_5/(1+t_5);
meanfunction = (mu_3)*(size=3) +
(mu_4)*(size=4) +
(mu_5)*(size=5);
model drywtpct ~ normal(meanfunction,s2);
contrast 'beta_3 - beta_4' beta_3 - beta_4;
contrast 'beta_3 - beta_5' beta_3 - beta_5;
contrast 'beta_4 - beta_5' beta_4 - beta_5;
run; quit;

The Contrasts table added to the procedure output reveals that differences between size
classes are significant at the &% level except for the $ and % leaf stages.

Output 5.21.
Contrasts

Num Den
Label DF DF F Value Pr > F

beta_3 - beta_4 1 32 2.03 0.1636


beta_3 - beta_5 1 32 6.11 0.0190
beta_4 - beta_5 1 32 5.30 0.0280

5.8.6 Modeling Hormetic Dose Response through Switching


Functions
A model commonly applied in dose-response investigations is the log-logistic model which
incorporates sigmoidal behavior symmetric about a point of inflection. A nice feature of this
model is its simple reparamaterization in terms of effective dosages, such as PH&! or PH*!
values, which can be estimated directly from empirical data by fitting the model with a non-
linear regression package. In §5.7.2 we reparameterized the log-logistic model
!$
Ec] d œ $ € ,
" € <expe" lnaBbf

where ] is the response to dosage B, in terms of -O , the dosage at which the response is O %
between the asymptotes $ and !. For example, -&! would be the dosage that achieves a resp-
onse halfway between the lower and upper asymptotes. The general formula for this reparam-
eterization is

© 2003 by CRC Press LLC




!$
Ec] lBd œ $ € ,
" € OÎa"!!  O bexpe" lnaBÎ-O bf

so that in the special case of O œ &! we get


!$
Ec] lBd œ $ € Þ
" € expe" lnaBÎ-O bf

Although the model is popular in dose-response studies it does not necessarily fit all data sets
of this type. It assumes, for example, that the trend between $ and ! is sigmoidal and mono-
tonically increases or decreases. The frequent application of log-logistic models in herbicide
dose-response studies (see, e.g., Streibig 1980, Streibig 1981, Lærke and Streibig 1995,
Seefeldt et al. 1995, Hsiao et al. 1996, Sandral et al. 1997) tends to elevate the model in the
eyes of some to a law-of-nature to which all data must comply. Whether the relationship
between dose and response is best described by a linear, log-logistic, or other model must be
re-assessed for every application and every set of empirical data. Figure 5.27 shows the
sample mean relative growth percentages of barnyardgrass (Echinochloa crus-galli (L.) P.
Beauv.) treated with glufosinate [2-amino-4-(hydroxymethylphosphinyl) butanoic
acid] € (NH% )# SO% (open circles) and glyphosate [isopropylamine salt of R -
(phosphonomethyl)glycine] € (NH% )# SO% (closed circles). Growth is expressed relative to an
untreated control and the data points shown are sample means calculated across eight
replicate values at each concentration. A log-logistic model appears appropriate for the
glufosinate response but not for the glyphosate response, which exhibits an effect known as
hormesis. The term hormesis originates from the Greek for “setting into motion” and the
notion that every toxicant is a stimulant at low levels (Schulz 1988, Thiamann 1956) is also
known as the Arndt-Schulz law.

120

100
Relative Growth %

80

60

40

20

-4 -3 -2 -1 0
ln(dose) (ln kg ae / ha)

Figure 5.27. Mean relative growth percentages for barnyard grass as a function of log
dosage. Open circles represent active ingredient glufosinate, closed circles glyphosate. Data
kindly provided by Dr. James J. Kells, Department of Crop and Soil Sciences, Michigan State
University. Used with permission.

© 2003 by CRC Press LLC


Hormetic effects can be defined as the failure of a dose-reponse relationship to behave at
small dosages as the extrapolation from larger dosages under a theoretical model or otherwise
"stylized fact" would lead one to believe. The linear no-threshold hypothesis in radiation
studies, for example, invokes a linear extrapolation to zero dose, implying that radiation
response is proportional to radiation dosage over the entire range of possible dosages
(UNSCEAR 1958). Even in the absence of beneficial effects at low dosages linear trends
across a larger dosage range are unlikely in most dose-response investigations. More common
are sigmoidal or hyperbolic relationships between dosage and average response. The failure
of the linear no-threshold hypothesis in the presence of hormetic effects is twofold: (i) a
linear model does not capture the dose-response relationship when there is no hormesis, and
(ii) extrapolations to low dosages do not account for hormetic effects.
Several authors have noted that for low herbicide dosages a hormetic effect can occur
which raises the average response for low dosages above the control value (Miller et al. 1962,
Freney 1965, Wiedman and Appleby 1972). Allender (1997) and Allender et al. (1997)
suggested that influx of Ca€# may be involved in the growth stimulation associated with hor-
mesis.The log-logistic function does not accommodate such behavior and Brain and Cousens
(1989) suggested a modification to allow for hormesis, namely,
!  $ € #B
Ec] lBd œ $ € , [5.65]
" € )expe" lnaBbf

where # measures the initial rate of increase at low dosages. The Brain-Cousens model [5.65]
is a simple modification of the log-logistic and it is perhaps somewhat surprising that adding
a term #B in the numerator should do the trick. In this parameterization it is straightforward to
test the hypothesis of hormetic effects statistically. Fit the model and observe whether the
asymptotic confidence interval fo # includes !. If the confidence interval fails to include ! the
hypothesis of the absence of a hormetic effect is rejected.
But how can a dose-response model other than the log-logistic be modified if the re-
searcher anticipates hormetic effects or wishes to test their presence? To construct hormetic
models we rely on the idea of combining mathematical switching functions. Schabenberger
and Birch (2001) proposed hormetic models constructed by this device and the Brain-Cous-
ens model is a special case thereof. In process models for plant growth switching mechanisms
are widely used (e.g., Thornley and Johnson 1990), for example, to switch on or off a
mathematical function or constant or to switch from one function to another. The switching
functions from which we build dose-response models are mathematical functions W aBb that
take values between ! and " as dosage B varies. In the log-logistic model
!$
] œ$€
" € expe" lnaBÎ-&! bf

the term c" € expe" lnaBÎ-&! bfd" is a switch-off function for " ž ! (Figure 5Þ28) and a
switch-on function for "  !.
With " ž !, $ is the lower and ! the upper asymptote of dose response. The role of the
switching function is to determine how the transition between the two extrema takes place.
This suggests the following technique to develop dose-response models. Let W aBß )b be a
switch-off function and notice that V aBß )b œ "  W aBß )b is a switch-on function. Denote the
min and max mean dose-response in the absence of any hormetic effects as .min and .max . A

© 2003 by CRC Press LLC




general dose-response model is then given by


] œ .min € a.max  .min bW aBß )b. [5.66]

1.0

Log-Logistic Switching Function S(x) 0.8

0.6

0.4

0.2

0.0

0.1 0.3 0.5 0.7 0.9


Log-Dosage ln(x)

Figure 5Þ28. Switch-off behavior of the log-logistic term c" € expe" lnaBÎ-&! bfd" for
-&! œ !Þ%, " œ %.

By choosing the switching function from a flexible family of mathematical models, the
nonhormetic dose response can take on many shapes, not necessarily sigmoidal and sym-
metric as implied by the log-logistic model. To identify possible switch-on functions one can
choose V aBß )b as the cumulative distribution function (cdf) of a continuous random variable
with unimodal density. Choosing the cdf of a random variable uniformly distributed on the
interval a+ß ,b leads to the switch-off function
, "
W aBß )b œ  B
,+ ,+
and a linear interpolation between .min and .max . A probably more useful cdf that permits a
sigmoidal transition between .min and .max is that of a two-parameter Weibull random
variable which leads to the switch-off function (Figure 5.29)

WÐBß )Ñ œ expš  aBÎ!b" ›.

Gregoire and Schabenberger (1996b) use a switching function derived from an extreme value
distribution in the context of modeling the merchantable volume in a tree bole (see
application §8.4.1), namely
W aBß )b œ exp˜  !B/"B ™.

This model was derived by considering a switch-off function derived from the Gompertz
growth model (Seber and Wild, 1989, p. 330),
WÐBß )Ñ œ expe  expe  !aB  " bff,

which is sigmoidal with inflection point at B œ " but not symmetric about the inflection
point. To model the transition in tree profile from neiloid to parabolic to cone-shaped seg-

© 2003 by CRC Press LLC


ments, Valentine and Gregoire (2001) use the switch-off function
"
W aBß )b œ ,
" € aBÎ!b"

which is derived from the family of growth models developed for modeling nutritional intake
by Morgan et al. (1975). For " ž " this switching function has a point of inflection at
B œ !ea"  "bÎa" € "bf"Î" and is hyperbolic for "  ". Swinton and Lyford (1996) use the
model by Morgan et al. (1975) to test whether crop yield as a function of weed density takes
on hyperbolic or sigmoidal shape. Figure 5.29 displays some of the sigmoidal switch-off
functions.

1.0
Gompertz
Weibull
0.8 Morgan et al. (1975)
Logistic

0.6

0.4

0.2

0.0

0.0 0.2 0.4 0.6 0.8 1.0


Dosage or Application Rate x

Figure 5.29. Some switch-off functions W aBß )b discussed in the text. The functions were
selected to have inflection points at B œ !Þ&.

120 q1) + f(x,f)S2(x,q


S1(x,q q2)

100

80
Response

Hormetic Zone
60 q2)
S2(x,q

q1)
S1(x,q
40

20

DMS LDS
0
0.0 0.2 0.4 0.6 0.8 1.0
Dosage or Application Rate x

Figure 5.30. Hormetic and nonhormetic dose response. LDS is the limiting dosage for
stimulation. DMS is the dosage of maximum stimulation.

© 2003 by CRC Press LLC




Dose-response models without hormetic effect suggest monotonic changes in the response
with increasing or decreasing dosage. A hormetic effect is the deviation from this general
pattern and in the case of reduced response with increasing dose a beneficial effect is usually
observed at low dosages (Figure 5.30).
The method proposed by Schabenberger and Birch (2001) to incorporate hormetic be-
havior consists of combining a standard model without hormetic effect and a model for the
hormetic component. If W aBß )b is a switch-off function and 0 aBß 9b is a monotonically
increasing function of dosage, then
] œ .min € a.max  .min bW" aBß )" b € 0 aBß 9bW# aBß )# b [5.67]

is a hormetic model. The switching functions W" ab and W# ab will often be of the same func-
tional form but this is not necessary. One might, for example, combine a Weibull switching
function to model the dose-response trend without hormesis with a hormetic component
0 aBß 9bW# aBß )b where W# aBß )b is a logistic switching function. The Brain-Cousens model
(Brain and Cousens 1989)
!  $ € 9B
] œ$€
" € )expa" lnaBbb

is a special case of [5.67] where W" aBß )b œ W# aBß )b is a log-logistic switch-off function and
0 aBß 9b œ # B. When constructing hormetic models, 0 aBß 9b should be chosen so that
0 aBß 9b œ ! for a known set of parameters. The absence of a hormetic effect can then be
tested. To prevent a beneficial effect at zero dose we would furthermore require that
0 a!ß 9b œ !. The hormetic model will exhibit a maximum for some dosage (the dosage of
maximum stimulation, Figure 5.30) if the equation
`0 ÐBß 9Ñ `W# aBß )# b
W# aBß )# b œ  0 aBß 9b
`B `B
has a solution in B.
The limiting dose for stimulation and the maximum dose of stimulation are only defined
for models with a hormetic zone (Figure 5.30). Dosages beyond this zone are interpreted in
the same fashion as for a nonhormetic model. This does not imply that the researcher can ig-
nore the presence of hormetic effects when only dosages beyond the hormetic zone (such as
PH&! ) are of importance. Through simulation, Schabenberger and Birch (2001) demonstrate
the effects of ignoring hormesis. Bias of up to "$% in estimating -#! , -&! , and -(& were ob-
served when hormesis was not taken into account in modeling the growth response. The esti-
mate of the response at the limiting dose for stimulation also had severe negative bias. Once
the model accounted for hormesis through the switching function mechanism, these biases
were drastically reduced.
In the remainder of this application we fit a log-logistic model to the barnyardgrass data
from which Figure 5.27 was created. Table 5.15 shows the dosages of the two active ingre-
dients and the growth percentages averaged across eight independent replications at each
dosage. The data not averaged across replications are analyzed in detail in Schabenberger,
Tharp, Kells, and Penner (1999).

© 2003 by CRC Press LLC


Table 5.15. Relative growth data for barnyard grass as a function of the concentration of
glufosinate and glyphosate† (A control dosage with "!!% growth was added)
Glufosinate a4 œ "b Growth Glyphosate a4 œ #b Growth
3 kg ae / ha % kg ae / ha %
" !Þ!") "!*Þ) !Þ!%( "$#Þ$
# !Þ!$' "!!Þ# !Þ!*% "$#Þ*
$ !Þ!(" (&Þ% !Þ")) "%Þ&
% !Þ"%$ %&Þ( !Þ$(& &Þ$
& !Þ#)' (Þ( !Þ(&! "Þ*
' !Þ&(# "Þ% "Þ&!! "Þ)

Data made kindly available by Dr. James J. Kells, Department of Crop and Soil Sciences,
Michigan State University. Used with permission.

To test whether any of the two reponses are hormetic we can fit the Brain-Cousens model
!4  $4 € #4 B34
Ec]34 d œ $4 €
" € )4 expe"4 lnaB34 bf

to the combined data, where ]34 denotes the response for ingredient 4 at dosage B34 . The main
interest in this application is to compare the dosages that lead to &!% reduction in growth re-
sponse. The Brain-Cousens model is appealing in this regard since it allows fitting a hormetic
model to the glyphosate response and a standard log-logistic model to the glufosinate res-
ponse. It does not incorporate an effective dosage as a parameter in the parameterization
[5.65], however. Schabenberger et al. (1999) changed the parameterization of the Brain-
Cousens model to enable estimation of -O using the method of defining relationships dis-
cussed in §5.7.2. The model in which -&! , for example, can be estimated whether the
response is hormetic a- ž !b or not a- œ !b, is
!4  $4 € #4 B34 -&!4
Ec]34 d œ $4 € , =4 œ " € ##4 . [5.68]
" € =4 expe"4 lneB34 Î-&!4 ff !4  $4

See Table 1 in Schabenberger et al. (1999) for other parameterizations that allow estimation
of general -O , LDS, and DMS in the Brain-Cousens model. The proc nlin statements to fit
model [5.68] are
proc nlin data=hormesis method=newton noitprint;
parameters alpha_glu=100 delta_glu=4 beta_glu=2.0 RD50_glu=0.2
alpha_gly=100 delta_gly=4 beta_gly=2.0 RD50_gly=0.2
gamma_glu=300 gamma_gly=300;
bounds gamma_glu > 0, gamma_gly > 0;
omega_glu = 1 + 2*gamma_glu*RD50_glu / (alpha_glu-delta_glu);
omega_gly = 1 + 2*gamma_gly*RD50_gly / (alpha_gly-delta_gly);
term_glu = 1 + omega_glu * exp(beta_glu*log(rate/RD50_glu));
term_gly = 1 + omega_gly * exp(beta_gly*log(rate/RD50_gly));
model barnyard =
(delta_glu + (alpha_glu - delta_glu + gamma_glu*rate) / term_glu ) *
(Tx = 'glufosinate') +
(delta_gly + (alpha_gly - delta_gly + gamma_gly*rate) / term_gly ) *
(Tx = 'glyphosate') ;
run;

The parameters are identified as *_glu for glufosinate and *_gly for the glyphosate
response. The bounds statement ensures that proc nlin constrains the estimates of the hor-

© 2003 by CRC Press LLC




mesis parameters to be positive. The omega_* statements calculate =" and =# and the term_*
statements the denominator of model [5.68]. The model statement uses logical variables to
choose between the mean functions for glufosinate and glyphosate.
At this stage we are focusing on parameter estimates for #" and ## , as they represent the
hormetic component. The parameter are coded as gamma_glu and gamma_gly (Output 5.22).
The asymptotic *&% confidence interval for #" includes ! ac  "(('Þ&ß $*!*Þ'db, whereas that
for ## does not ac$!'Þ!ß "!)'Þ)db. This confirms the hormetic effect for the glyphosate re-
sponses. The :-value for the test of L! : #" œ ! vs. L" : #" ž ! can be calculated with
data pvalue; p=1-ProbT(1066.6/1024.0,4); run; proc print data=pvalue; run;

and turns out to be : œ !Þ"()#, sufficiently large to dismiss the notion of hormesis for the
glufosinate response. The :-value for L! : ## œ ! versus L" : ## ž ! is : œ !Þ!!$) and ob-
tained similarly with the statements
data pvalue; p=1-ProbT(696.4/140.6,4); run; proc print data=pvalue; run;

Output 5.22.
The NLIN Procedure

NOTE: Convergence criterion met.

Estimation Summary
Method Newton
Iterations 11
Subiterations 7
Average Subiterations 0.636364
R 6.988E-7
PPC(beta_glu) 4.248E-8
RPC(gamma_glu) 0.00064
Object 1.281E-7
Objective 81.21472
Observations Read 14
Observations Used 14
Observations Missing 0

Sum of Mean Approx


Source DF Squares Square F Value Pr > F
Regression 10 85267.1 8526.7 198.00 <.0001
Residual 4 81.2147 20.3037
Uncorrected Total 14 85348.3
Corrected Total 13 36262.8

Approx Approximate 95% Confidence


Parameter Estimate Std Error Limits

alpha_glu 100.6 4.5898 87.8883 113.4


delta_glu -19.0717 17.9137 -68.8075 30.6642
beta_glu 1.7568 0.3125 0.8892 2.6245
RD50_glu 0.1434 0.0343 0.0482 0.2386
gamma_glu 1066.6 1024.0 -1776.5 3909.6
alpha_gly 99.9866 4.5055 87.4776 112.5
delta_gly 2.9241 2.6463 -4.4231 10.2712
beta_gly 6.2184 0.8018 3.9922 8.4445
RD50_gly 0.1403 0.00727 0.1201 0.1605
gamma_gly 696.4 140.6 306.0 1086.8

The model we focus on to compare -&! values among the two herbicides thus has a hor-
metic component for glyphosate a4 œ #b, but not for glufosinate a4 œ "b. Fitting these two

© 2003 by CRC Press LLC


apparently different response functions with proc nlin is easy; all that is required is to
constrain #" to zero in the previous code. This is not the only change we make to the proc
nlin statements. To obtain a test for L! :-&!Ð#Ñ  -&!Ð"Ñ œ !, we code a new parameter ?-
that measures this difference. In other words, instead of coding -&!Ð"Ñ as RD50_glu and -&!a#b
as RD50_gly we use
RD50_glu œ -&!a"b
RD50_gly œ -&!a"b € ?- .

We proceed similarly for other parameters, e.g.,


beta_glu œ ""
beta_gly œ "" € ?" .

The advantage of this coding method is that parameters for which proc nlin gives estimates,
estimated asymptotic standard errors, and asymptotic confidence intervals, express differen-
ces between the two herbicides. Output 5.23 was generated from the following statements.
proc nlin data=hormesis method=newton noitprint;
parameters alpha_glu=100 delta_glu=4 beta_glu=2.0 RD50_glu=0.2
alpha_dif=0 delta_dif=0 beta_dif=0 RD50_dif=0
gamma_gly=300;
bounds gamma_gly > 0;

alpha_gly = alpha_glu + alpha_dif;


delta_gly = delta_glu + delta_dif;
beta_gly = beta_glu + beta_dif;
RD50_gly = RD50_glu + RD50_dif;
gamma_glu = 0;

omega_glu = 1 + 2*gamma_glu*RD50_glu / (alpha_glu-delta_glu);


omega_gly = 1 + 2*gamma_gly*RD50_gly / (alpha_gly-delta_gly);

term_glu = 1 + omega_glu * exp(beta_glu*log(rate/RD50_glu));


term_gly = 1 + omega_gly * exp(beta_gly*log(rate/RD50_gly));

model barnyard =
(delta_glu + (alpha_glu - delta_glu + gamma_glu*rate) / term_glu ) *
(Tx = 'glufosinate') +
(delta_gly + (alpha_gly - delta_gly + gamma_gly*rate) / term_gly ) *
(Tx = 'glyphosate') ;
run;

Notice that
alpha_gly=100 delta_gly=4 beta_gly=2.0 RD50_gly=0.2

was removed from the model statement and replaced by

alpha_dif=0 delta_dif=0 beta_dif=0 RD50_dif=0.

The glyphosate parameters are then reconstructed below the bounds statement. A sideeffect of
this coding method is the ability to choose zeros as starting values for the ? parameters
assuming initially that the two treatments produce the same response.

© 2003 by CRC Press LLC




Output 5.23.
The NLIN Procedure

NOTE: Convergence criterion met.

Estimation Summary
Method Newton
Iterations 11
Subiterations 10
Average Subiterations 0.909091
R 1.452E-8
PPC 7.24E-9
RPC(alpha_dif) 0.000055
Object 9.574E-9
Objective 125.6643
Observations Read 14
Observations Used 14
Observations Missing 0

Sum of Mean Approx


Source DF Squares Square F Value Pr > F

Regression 9 85222.6 9469.2 179.73 <.0001


Residual 5 125.7 25.1329
Uncorrected Total 14 85348.3
Corrected Total 13 36262.8

Approx Approximate 95% Confidence


Parameter Estimate Std Error Limits

alpha_glu 105.5 3.5972 96.2174 114.7


delta_glu -4.0083 6.3265 -20.2710 12.2543
beta_glu 2.1651 0.3710 1.2113 3.1188
RD50_glu 0.1229 0.0128 0.0899 0.1559
alpha_dif -5.4777 6.1699 -21.3376 10.3823
delta_dif 6.9324 6.9781 -11.0050 24.8699
beta_dif 4.0533 0.9662 1.5697 6.5369
RD50_dif 0.0174 0.0152 -0.0216 0.0564
gamma_gly 696.4 156.4 294.3 1098.6

The difference in -&! values between the two herbicides is positive (Output 5.23). The
-&! estimate for glufosinate is !Þ"##* kg aeÎha and that for glyphosate is !Þ"##* € !Þ!"(% œ
!Þ"%!$ kg aeÎha. The difference is not statistically significant at the &% level, since the
asymptotic *&% confidence interval for ?- includes zero ac  Þ!#"'ß !Þ!&'%db.
Predicted values for the two herbicides are shown in Figure 5.31. The hormetic effect for
glyphosate is very pronounced. The negative estimate s$ " œ  %Þ!!)$ suggests that the lower
asymptote of relative growth percentages is negative, which is not very meaningful. Fortu-
nately, the growth responses do not achieve that lower asymptote across the range of dosages
observed which defuses this issue.

© 2003 by CRC Press LLC


glufosinate
120 glyphosate
glufosinate model (log-logistic)
glyphosate model (hormetic)
100
Relative Growth %

80

60

40

20

0
-4 -3 -2 -1 0
ln(dose) (ln kg ae / ha)

Figure 5.31. Predicted responses for glufosinate and glyphosate for barnyard grass data. Esti-
mated -&! values of !Þ"##* kg aeÎha and !Þ"%!$ kg aeÎha are also shown.

What will happen if the hormetic effect for the glyphosate response is being ignored, that
is, we fit a log-logistic model
!4  $4
Ec]34 d œ $4 € ?
" € expe"4 lneB34 Î-&!4 ff

The solid line in Figure 5.31 will be forced to monotonically decrease as the dashed line.
Because of the solid circles in excess of "!!% the glyphosate model will attempt to stay ele-
vated for as long as possible and then decline sharply toward the solid circles on the right
(Figure 5.32). The estimate of "# will be large and have very low precision. Also, the residual
sum of squares should increase dramatically.
All of these effects are apparent in Output 5.24 which was generated by the statements
proc nlin data=hormesis method=newton noitprint;
parameters alpha_glu=100 delta_glu=4 beta_glu=2.0 RD50_glu=0.122
alpha_dif=0 delta_dif=0 beta_dif=0 RD50_dif=0;

alpha_gly = alpha_glu + alpha_dif;


delta_gly = delta_glu + delta_dif;
beta_gly = beta_glu + beta_dif;
RD50_gly = RD50_glu + RD50_dif;

term_glu = 1 + exp(beta_glu*log(rate/RD50_glu));
term_gly = 1 + exp(beta_gly*log(rate/RD50_gly));

model barnyard = (delta_glu + (alpha_glu - delta_glu) / term_glu ) *


(Tx = 'glufosinate') +
(delta_gly + (alpha_gly - delta_gly) / term_gly ) *
(Tx = 'glyphosate') ;
run;

© 2003 by CRC Press LLC


Output 5.24.
The NLIN Procedure

NOTE: Convergence criterion met.

Estimation Summary
Method Newton
Iterations 77
Subiterations 16
Average Subiterations 0.207792
R 9.269E-6
PPC(beta_dif) 0.002645
RPC(beta_dif) 0.005088
Object 8.27E-11
Objective 836.5044
Observations Read 14
Observations Used 14
Observations Missing 0

Sum of Mean Approx


Source DF Squares Square F Value Pr > F
Regression 8 84511.8 10564.0 36.30 0.0002
Residual 6 836.5 139.4
Uncorrected Total 14 85348.3
Corrected Total 13 36262.8

Approx Approximate 95% Confidence


Parameter Estimate Std Error Limits

alpha_glu 105.5 8.4724 84.7330 126.2


delta_glu -4.0083 14.9006 -40.4688 32.4521
beta_glu 2.1651 0.8738 0.0268 4.3033
RD50_glu 0.1229 0.0302 0.0489 0.1969
alpha_dif 16.2837 10.8745 -10.3252 42.8926
delta_dif 7.0527 16.3860 -33.0424 47.1477
beta_dif 31.7039 4595.8 -11213.8 11277.2
RD50_dif 0.0531 1.5800 -3.8131 3.9193

glufosinate
120 glyphosate
glufosinate model (log-logistic)
glyphosate model (log-logistic)
100
Relative Growth %

80

60

40

20

0
-4 -3 -2 -1 0
ln(dose) (ln kg ae / ha)

Figure 5.32. Fit of log-logistic model to barnyard grass data.

© 2003 by CRC Press LLC


The parameter ?" is estimated as $"Þ(!$* and hence " s # œ #Þ"'&" € $"Þ(!$* œ $$Þ)'*.
This is a very unreasonable value and its estimated asymptotic standard error
s" Ñ œ %ß &*&Þ)
easeÐ?

is nonsensical. Notice, however, that the fit of the model to the glufosinate data has not
changed (compare to Output 5.23, and compare Figures 5.31 and 5.32).

5.8.7 Modeling a Yield-Density Relationship


Mead (1970) notes the long history into examinations of the relationship between crop yield
and plant densities. For considerable time the area of most substantial progress and practical
importance was the study of competition among plants of a single crop as a function of their
density. Mead (1979) classifies experiments in this area into yield-density and genotype
competition studies. In this subsection we briefly review the theory behind some common
yield-density models and provide a detailed analysis of a data set examined earlier by Mead
(1970). There is little to be added to the thorough analysis of these data. This subsection will
illustrate the main steps and their implementation with The SAS® System. Additional yield-
density models and background material can be found in §A5.9.1 and the references cited
there.
To fix ideas let ] denote the yield per plant of a species and B the density per unit
area at which the species grows. The product Y œ ] B then measures the yield per unit
area. Two main theories of intraspecies competition are reflected in common yield-density
models:
• EcY d is an increasing function of B that reaches an asymptote for some density B‡ .
Decreasing the density below B‡ will decrease area yield. For densities above B‡ , yield
per plant decreases at the same rate as the density increases, holding the yield per unit
area constant (Figure 5.33a). Yield-density relationships of this kind are termed
asymptotic and have been established for peas (Nichols and Nonnecke 1974), toma-
toes (Nichols et al. 1973), dwarf beans (Nichols 1974a), and onions (Mead 1970),
among other crops. The asymptotic yield can be interpreted as the growth limit for a
species in a particular environment.
• EcY d is a parabolic function without an asymptote, but a maximum that occurs at den-
sity Bmax (Figure 5.33b). For densities in excess of Bmax the yield per plant decreases
more quickly than the density increases, reducing the unit area yield. Parabolic rela-
tionships were established, for example, for parsnips (Bleasdale and Thompson 1966),
sweat corn (Nichols 1974b), and cotton (Hearn 1972).

In either case the yield per plant is a convex function of plant density. Several parameters
are of particular interest in the study of yield-density models. If ] aBb tends to a constant -= as
density tends toward !, -= reflects the species' potential in the absence of competition from
other plants. Similarly, if Y aBb tends to a constant -/ as density increases toward infinity, -/
measures the species' potential under increased competition for environmental resources.
Ratkowsky (1983, p. 50) terms -= the genetic potential and -/ the environmental potential of
the species. For asymptotic relationships agronomists are often not only interested in -= and -/
but also in the density that produces a certain percentage of the asymptotic yield. In Figure

© 2003 by CRC Press LLC


5.33a the density at which )!% of the asymptotic yield aY a_b œ "Þ#&b is achieved is shown
as B!Þ) . If the yield-density relationship is parabolic, the density related to a certain percentage
of Ymax is usually not of interest, in part because it may not be uniquely determined. The
modeler is instead interested in the density at which unit area yield is maximized
aBmax in Figure 5.33bb.

a) b)

1.2 a=1.0, b=0.8, q=1 a=0.7, b=0.35, q=0.5


1.2
U(x)

Yield per area (U) or plant (Y)


0.9
Yield per area (U) or plant (Y)

0.9

U(x)

0.6 0.6

0.3 0.3
Y(x)
Y(x)

x0.8 xmax
0.0 0.0

0 2 4 6 8 0 2 4 6 8
Density x Density x

Figure 5.33. Asymptotic (a) and parabolic (b) yield-density relationships. Y aBb denotes yield
per unit area at density B, ] aBb denotes yield per plant at density B. Both models are based on
the Bleasdale-Nelder model Y aBb œ Ba! € " Bb"Î) discussed in the text and in §A5.9.1.

The most basic nonlinear yield-density model is the reciprocal simple linear regression
Ec] d œ a! € " Bb"

due to Shinozaki and Kira (1956). Its area yield function EcY d œ Ba! € " Bb" is strictly
asymptotic with genetic potential -= œ "Î! and asymptotic yield per unit area Y a_b œ "Î" .
In applications one may not want to restrict modeling efforts from the outset to asymptotic
relationships and employ a model which allows both asymptotic and parabolic relationships,
depending on parameter values. A simple extension of the Shinozaki-Kira model is known as
the Bleasdale-Nelder model (Bleasdale and Nelder 1960),

Ec] d œ a! € " Bb"Î)


EcY d œ Ba! € " Bb"Î) .

This model is extensively discussed in Mead (1970), Gillis and Ratkowsky (1978), Mead
(1979), and Ratkowsky (1983). A more general form with four parameters that was originally
proposed is discussed in §A5.9.1. For ) œ " the model is asymptotic and parabolic for )  "
(Figure 5.33). A one-sided statistical test of L! : ) œ " vs. L" : )  " allows testing for
asymptotic vs. parabolic structure of the relationship between Y and B. Because of its
biological relevance, such a test should always be performed when the Bleasdale-Nelder

© 2003 by CRC Press LLC


model is considered. Values of ) greater than one are not permissible since Y aBb does not
have a maximum then. In the Bleasdale-Nelder model with ) œ " the density at which O % of
the asymptotic yield a"Î" b are achieved is given by
! OÎ"!!
BOÎ"!! œ Œ 
" "  OÎ"!!

and in the parabolic case a)  "b the maximum yield per unit area of

a")bÎ)
) ")
Ymax œ Œ 
" !

is obtained at density Bmax œ a!Î" be)Îa"  )bf.


Ratkowsky (1983, 1990) established rather severe curvature effects in this model,
whereas the Shinozaki-Kira model behaves close-to-linear. Through simulation studies Gillis
and Ratkowsky (1978) showed that if the true relationship between yield and density is ] œ
a! € " Bb"Î) but the Bleasdale-Nelder model is fitted, non-negligible bias may be incurred.
These authors prefer the model
"
Ec] d œ ˆ! € " B € # B# ‰

due to Holliday (1960) which also allows parabolic and asymptotic relationships and its
parameter estimators show close-to-linear behavior.
When fitting yield-density models, care should be exercised because the variance of the
plant yield ] typically increases with the yield. Mead (1970) states that the assumption
Varc] d œ 5 # Ec] d#

is often tenable. Under these circumstances the logarithm of ] has approximately constant
variance. If 0 aBß )b is the yield-density model, one approach to estimation of the parameters )
is to fit
Eclne] fd œ lne0 aBß )bf

assuming that the errors of this model are zero mean Gaussian random variables with constant
variance 5 # . For the Bleasdale-Nelder model this leads to
"
Eclne] fd œ lnša! € " Bb"Î) › œ  lne! € " Bf.
)
Alternatively, one can fit the model ] œ 0 aBß )b assuming that ] follows a distribution with
variance proportional to Ec] d# . The family of Gamma distributions has this property, for
example. Here we will use the logarithmic transformation and revisit the fitting of yield
density models in §6.7.3 under the assumption of Gamma distributed yields.
The data in Table 5.16 and Figure 5.34 represent yields per plant of three onion varieties
grown at varying densities. There were three replicates of each density, the data values repre-
sent their averages. An exploratory graph of "Î] versus density shows that a linear relation-
ship is not unreasonable, confirmed by a loess smooth of "Î] vs. B (Figure 5.34).

© 2003 by CRC Press LLC


Table 5.16. Onion yield-density data of Mead (1970)
(C represents observed yield per plant in grams, B the density in plants ‚ ft# )
Variety 1 Variety 2 Variety 3
C B C B C B
"!&Þ' $Þ!( "$"Þ' #Þ"% ""'Þ) #Þ%)
)*Þ% $Þ$" "!*Þ" #Þ'& *"Þ' $Þ&$
("Þ! &Þ*( *$Þ( $Þ)! (#Þ( %Þ%&
'!Þ$ 'Þ** (#Þ# &Þ#% &#Þ) 'Þ#$
%(Þ' )Þ'( &$Þ" (Þ)$ %)Þ) )Þ#$
$(Þ( "$Þ$* %*Þ( )Þ(# $*Þ" *Þ&*
$!Þ$ "(Þ)' $(Þ) "!Þ"" $!Þ$ "'Þ)(
#%Þ# #"Þ&( $$Þ$ "'Þ!) #%Þ# ")Þ'*
#!Þ) #)Þ(( #%Þ& #"Þ## #!Þ! #&Þ(%
")Þ& $"Þ!) ")Þ$ #&Þ(" "'Þ$ $!Þ$$

Reproduced from Table 2 in Mead (1970). Copyright © 1970 by the Royal Statistical Society.
Used with permission of Blackwell Publishers, Oxford, UK.

Variety: 1
0.06

0.03
Inverse plant yield in grams-1

0.00 Variety: 2
0.06

0.03

Variety: 3
0.00
0.06

0.03

0.00
0 5 10 15 20 25 30
Density (x)

Figure 5.34. Relationships between inverse plant yield and plant densities for three onion
varieties. Dashed line is a nonparametric loess fit. Data from Mead (1970).

The relationship between Y and density is likely of an asymptotic nature for any of the
three varieties. The analysis commences with a fit of the full model

lne]34 f œ lnša!3 € "3 B34 b"Î)3 › € /34 , [5.69]

where the subscript 3 œ "ß âß $ denotes the varieties and B34 is the 4th density a4 œ "ß âß "!b
at which the yield of variety 3 was observed. The following hypotheses are to be addressed
subsequently:

© 2003 by CRC Press LLC


ó L! À )" œ )# œ )$ œ " Í the three varieties do not differ in the parameter ) and the
relationship is asymptotic.
ô If the hypothesis in ó is rejected then we test separately L! :)3 œ " to find out which
variety exhibits parabolic behavior. The alternative hypothesis for these tests should
be L" :)3  ". If the test in ó was not rejected, we proceed with a modified full model
lne]34 f œ  ln˜!3 € "3 B34 ™ € /34 Þ

õ L! : !" œ !# œ !$ Í The varieties do not differ in the parameter !. Depending on


"Î) "Î)
the outcomes of ó and ô the genetic potential is estimated as !"
3 , !3 or !3 3 . If
the hypothesis L! : !" œ !# œ !$ is rejected one can proceed to test the parameters in
pairs.
ö L! : "" œ "# œ "$ Í The varieties do not differ in the parameter " . Proceed as in õ.

Prior to tackling these hypotheses the full model should be tested against the most
reduced model

lne]34 f œ lnša! € " B34 b"Î) › € /34

or even
lne]34 f œ ln˜a! € " B34 b" ™ € /34

to determine whether there are any differences among the three varieties. In addition we are
also interested in obtaining confidence intervals for the genetic potential -= , for the density
Bmax should the relationship be parabolic, for the density B!Þ* and the asymptote Y a_b
should the relationship be asymptotic. In order to obtain these intervals one could proceed by
reparameterizing the model such that -= , Bmax , B!Þ* , and Y a_b are parameters and refit the
resulting model(s). Should the relationship be parabolic this proves difficult since -= , for
example, is a function of both ! and ). Instead we fit the final model with the nlmixed
procedure of The SAS® System that permits the estimation of arbitrary functions of the model
parameters and calculates the standard errors of the estimated functions by the delta method.
The full model is fit with the SAS® statements
proc nlin data=onions method=marquardt;
parameters a1=5.4 a2=5.4 a3=5.4
b1=1.7 b2=1.7 b3=1.7
t1=1 t2=1 t3=1;
term1 = (-1/t1)*log(a1 + b1*density);
term2 = (-1/t2)*log(a2 + b2*density);
term3 = (-1/t3)*log(a3 + b3*density);
model logyield = term1*(variety=1)+term2*(variety=2)+term3*(variety=3);
run;

Starting values for the !3 and "3 were obtained by assuming )" œ )# œ )$ œ " and fitting
the inverse relationship "ÎÐ] Î"!!!Ñ œ ! € " B with a simple linear regression package. For
scaling purposes, the plant yield was expressed in kilograms rather than in grams as in Table
5.16. The full model achieves a residual sum of squares of WÐs)Ñ œ !Þ"!!% on #" degrees of
freedom. The asymptotic model loge]34 f œ  lne! € " B34 f € /34 with common potentials
has a residual sum of square of !Þ#"%! on #) degrees of freedom. The initial test for determ-

© 2003 by CRC Press LLC




ining whether there are any differences in yield-density response among the three varieties is
rejected:
a!Þ#"%!  !Þ"!!%bÎ(
J9,= œ œ $Þ$*%ß : œ !Þ!"$*
!Þ"!!%Î#"

The model restricted under L! : )" œ )# œ )$ œ " is fit with the statements
proc nlin data=onions method=marquardt;
parameters a1=5.4 a2=5.4 a3=5.4 b1=1.7 b2=1.7 b3=1.7;
term1 = -log(a1 + b1*density); term2 = -log(a2 + b2*density);
term3 = -log(a3 + b3*density);
model logyield = term1*(variety=1)+term2*(variety=2)+term3*(variety=3);
run;

and achieves WÐs)ÑL! œ !Þ"$&! on #" degrees of freedom. The J -test has test statistic J9,= œ
a!Þ"$&!  !Þ"!!%bÎÐ$‡!Þ"!!%Î#"Ñ œ #Þ%"# and :-value : œ PraJ$ß#" ž #Þ%"#b œ !Þ!*&. At
the &% significance level L! cannot be rejected and the model
lne]34 f œ  ln˜!3 € "3 B34 ™ € /34

is used as the full model henceforth. Varying " and fixing ! at a common value for the varie-
ties is accomplished with the statements
proc nlin data=onions method=marquardt;
parameters a=4.5 b1=1.65 b2=1.77 b3=1.90;
term1 = -log(a + b1*density); term2 = -log(a + b2*density);
term3 = -log(a + b3*density);
model logyield = term1*(variety=1)+term2*(variety=2)+term3*(variety=3);
run;

This model has WÐ) s ÑL! œ !Þ"&"* with #' residual degrees of freedom. The test for L! :
!" œ !# œ !$ leads to J9,= œ a!Þ"&"*  !Þ"$&!bÎa#‡!Þ"$&!Î#%b œ "Þ&! a: œ !Þ#%$b. It is
reasonable to assume that the varieties share a common genetic potential. Similarly, the invar-
iance of the "3 can be tested with
proc nlin data=onions method=marquardt;
parameters a1=5.4 a2=5.4 a3=5.4 b=1.7;
term1 = -log(a1 + b*density); term2 = -log(a2 + b*density);
term3 = -log(a3 + b*density);
model logyield = term1*(variety=1)+term2*(variety=2)+term3*(variety=3);
run;

This model achieves WÐs ) ÑL! œ !Þ")%$ with #' residual degrees of freedom. The test for
L! :"" œ "# œ "$ leads to J9,= œ a!Þ")%$  !Þ"$&!bÎa#‡!Þ"$&!Î#%b œ %Þ$) a: œ !Þ!#%b.
The notion of invariant " parameters is rejected.
We are now in a position to settle on a final model for the onion yield density data,
lne]34 f œ  lne! € "3 B34 f € /34 . [5.70]

Had we started with the Holliday model instead of the Bleasdale-Nelder model, the initial test
for an asymptotic relationship would have yielded J9,= œ "Þ&(& a: œ !Þ##&b and the final
model would have been the same.
We are now interested in calculating confidence intervals for the common genetic
potential a"Î!b, the variety-specific environmental potentials a"Î"3 b, the density that produ-

© 2003 by CRC Press LLC


ces *!% of the asymptotic yield aB!Þ* b and the magnitude of *!% of the asymptotic yield
aY!Þ* b. Furthermore we want to test for significant varietal differences in the environmental
potentials and the B!Þ* values. The quantities of interest are functions of the parameters ! and
"3 . Care must be exercised to calculate the appropriate standard errors. For example, if esea! sb
is the estimated standard error of !
s, then esea"Î!sb Á "Îesea! sb. The correct standard errors of
nonlinear functions of the parameters can be obtained by the delta method. The SAS® proce-
dure nlmixed enables the estimation of any function of the parameters in a nonlinear model
by this method. Nlmixed is designed to fit nonlinear models to clustered data where the model
contains more than one random effect (§8), but can also be employed to fit regular nonlinear
models. A slight difference lies in the estimation of the residual variance Varc/34 d œ 5 #
between the nlmixed and nlin procedures. To ensure that the two analyses agree, we force
proc nlmixed to use the residual mean square estimate 5 s # œ !Þ"&"*Î#' œ !Þ!!&)% of the
finally selected model. The complete statements are
proc nlmixed data=onions df=26;
parameters a=4.5 b1=1.65 b2=1.77 b3=1.90;
s2=0.00584;
term1 = -log(a + b1*density);
term2 = -log(a + b2*density);
term3 = -log(a + b3*density);
meanmodel = term1*(variety=1)+term2*(variety=2)+term3*(variety=3);
model logyield ~ Normal(meanmodel,s2);
estimate 'x_09 (1)' (a/b1)*(0.9/0.1);
estimate 'x_09 (2)' (a/b2)*(0.9/0.1);
estimate 'x_09 (3)' (a/b3)*(0.9/0.1);

estimate 'U_09 (1)' 1000*(0.9/b1);


estimate 'U_09 (2)' 1000*(0.9/b2);
estimate 'U_09 (3)' 1000*(0.9/b3);

estimate 'genetic potential ' 1000/a; /* common genetic potential */


estimate 'U(infinity) (1) ' 1000/b1; /* U asymptote variety 1 */
estimate 'U(infinity) (2) ' 1000/b2; /* U asymptote variety 2 */
estimate 'U(infinity) (3) ' 1000/b3; /* U asymptote variety 3 */

/* Test differences among the yield per unit area asymptotes */

estimate 'U(i)(1) - U(i)(2)' 1000/b1 - 1000/b2;


estimate 'U(i)(1) - U(i)(3)' 1000/b1 - 1000/b3;
estimate 'U(i)(2) - U(i)(3)' 1000/b2 - 1000/b3;

/* Test differences among the densities producing 90% of asympt. yield */

estimate 'x_09(1) - x_09(2) ' (a/b1)*(0.9/0.1) - (a/b2)*(0.9/0.1);


estimate 'x_09(1) - x_09(3) ' (a/b1)*(0.9/0.1) - (a/b3)*(0.9/0.1);
estimate 'x_09(2) - x_09(3) ' (a/b2)*(0.9/0.1) - (a/b3)*(0.9/0.1);
run;

The final parameter estimates are ! s œ %Þ&$'%, " s " œ "Þ''"", "s # œ "Þ()'', and
s
" $ œ "Þ*"(& (Output 5.25). The predicted yield per area at density B (Figure 5.35) is thus
calculated (in grams ‚ ft# ) as
s œ "!!! ‚ Ba%Þ&$'% € "Þ''""Bb"
Variety 1: Y
s œ "!!! ‚ Ba%Þ&$'% € "Þ()''Bb"
Variety 2: Y
s œ "!!! ‚ Ba%Þ&$'% € "Þ*"(&Bb" .
Variety 3: Y

© 2003 by CRC Press LLC


Output 5.25. The NLMIXED Procedure

Specifications
Data Set WORK.ONIONS
Dependent Variable logyield
Distribution for Dependent Variable Normal
Optimization Technique Dual Quasi-Newton
Integration Method None

Iteration History

Iter Calls NegLogLike Diff MaxGrad Slope


1 4 -36.57033 0.173597 1.145899 -90.5784
2 7 -36.572828 0.002499 0.207954 -1.30563
3 9 -36.574451 0.001623 0.513823 -0.046
4 11 -36.576213 0.001763 0.072027 -0.37455
5 12 -36.576229 0.000016 0.000596 -0.00003
6 14 -36.576229 1.604E-9 9.455E-7 -3.21E-9
NOTE: GCONV convergence criterion satisfied.

Parameter Estimates

Standard
Parameter Estimate Error DF t Value Pr > |t| Lower Upper
a 4.5364 0.3467 26 13.08 <.0001 3.8237 5.2492
b1 1.6611 0.06212 26 26.74 <.0001 1.5334 1.7888
b2 1.7866 0.07278 26 24.55 <.0001 1.6370 1.9362
b3 1.9175 0.07115 26 26.95 <.0001 1.7713 2.0637

Additional Estimates

Standard t
Label Estimate Error DF Value Pr > |t| Lower Upper
x_09 (1) 24.57 2.5076 26 9.80 <.0001 19.4246 29.733
x_09 (2) 22.85 2.4217 26 9.44 <.0001 17.8741 27.829
x_09 (3) 21.29 2.1638 26 9.84 <.0001 16.8447 25.740

U_09 (1) 541.81 20.2611 26 26.74 <.0001 500.17 583.46


U_09 (2) 503.74 20.5197 26 24.55 <.0001 461.56 545.92
U_09 (3) 469.36 17.4148 26 26.95 <.0001 433.57 505.16

genetic potential 220.44 16.8493 26 13.08 <.0001 185.80 255.07

U(infinity) (1) 602.01 22.5123 26 26.74 <.0001 555.74 648.29


U(infinity) (2) 559.71 22.7997 26 24.55 <.0001 512.85 606.58
U(infinity) (3) 521.51 19.3498 26 26.95 <.0001 481.74 561.29

U(i)(1) - U(i)(2) 42.30 26.1924 26 1.62 0.1184 -11.5377 96.140


U(i)(1) - U(i)(3) 80.50 24.8330 26 3.24 0.0032 29.4559 131.55
U(i)(2) - U(i)(3) 38.19 24.5919 26 1.55 0.1324 -12.3500 88.748

x_09(1) - x_09(2) 1.72 1.0716 26 1.61 0.1191 -0.4756 3.9298


x_09(1) - x_09(3) 3.28 1.0628 26 3.09 0.0047 1.1021 5.4712
x_09(2) - x_09(3) 1.55 1.0257 26 1.52 0.1404 -0.5487 3.6679

The genetic potential is estimated as ##!Þ%% grams ‚ ft# with asymptotic *&% confi-
dence interval c")&Þ)ß #&&Þ"d. The estimated yields per unit area asymptotes for varieties 1, 2,
and 3 are '!#Þ!", &&*Þ(", and &#"Þ&" grams ‚ ft# , respectively. Only varieties 1 and 3 differ
significantly in this parameter a: œ !Þ!!$#b and the density that produces *!% of the
asymptotic yield a: œ !Þ!!%(b.

© 2003 by CRC Press LLC


Variety 1 (Observed)
Variety 1 (Predicted)
Variety 2 (Observed)
600 Variety 2 (Predicted)
Variety 3 (Observed)
Variety 1 (Predicted)
550
Yield per area (U in grams*ft-2)

500

450

400

350

300 x0.9

0 5 10 15 20 25 30
Density x

Figure 5.35. Observed and predicted values for onion yield-density data.

5.8.8 Weighted Nonlinear Least Squares Analysis with


Heteroscedastic Errors
Heterogeneity of the error variance in nonlinear models presents similar problems as in linear
models. In §5.6.2 we discussed various approaches to remedy the heteroscedasticity problem.
One relied on applying a Box-Cox type transformation to both sides of the model that
stabilizes the variance and to fit the transformed model. The second approach leaves the
response unchanged but fits the model by weighted nonlinear least squares where the weights
reflect the degree of variance heterogeneity. We generally prefer the second approach over
the power-transformation approach because it does not alter the response and the results are
interpreted on the original scale of measurement. However, it does require that the
relationship between variability and the response can be expressed in mathematical terms so
that weights can be defined. In the setting of §5.6.2 the weighted least squares approach
relied on the fact that in the model
]3 œ 0 ax3 ß )b € /3

the errors /3 are uncorrelated with variance proportional to a power of the mean,
Varc/3 d œ 5 # 0 ax3 ß )b- œ 5 # Ec]3 d- .

Consequently, /3‡ œ /3 €0 ax3 ß )b-Î# will have constant variance 5 # and a weighted least
squares approach would obtain updates of the parameter estimates as

© 2003 by CRC Press LLC




"
s)?€" œ s)? € ŠF
sw W" F sw W" Šy  fÐxß s
s‹ F )ы,

where W is a diagonal matrix of weights 0 ax3 ß )b- . Since the weights depend on ) they
should be updated whenever the parameter vector is updated, i.e., at every iteration. This
problem can be circumvented when the variance of the model errors is not proportional to a
power of the mean, but proportional to some other function 1ax3 b that does not depend on the
parameters of the model. In this case /3‡ œ /3 €È1ax3 b will be the variance stabilizing trans-
form for the errors. But how can we find this function 1ax3 b? One approach is by trial and
error. Try different weight functions and examine the weighted nonlinear residuals. Settle on
the weight function that stabilizes the residual variation. The approach we demonstrate here is
also an ad-hoc procedure but it utilizes the data more.
Before going into further details, we take a look at the data and the model we try to fit.
The Richards curve (Richards 1959) is a popular model for depicting plant growth owing to
its flexibility and simple interpretation. It is not known, however, for excellent statistical
properties of its parameter estimates. The Richards model — also known as the Chapman-
Richards model (Chapman 1961) — exists in a variety of parameterizations, for example,
#
]3 œ !ˆ"  /" B3 ‰ € /3 . [5.71]

Here, ! is the maximum growth achievable (upper asymptote), " is the rate of growth, and
the parameter # determines the shape of the curve near the origin. For # ž " the shape is sig-
moidal. The covariate B in the Richards model is often the age of an organism, or a measure
of its size. Values for # in the neighborhood of "Þ! are common as are values for " at
approximately –!Þ&.

50

30
Height (m)

10

-10

0 10 20 30 40 50
Age

Figure 5.36. Height in meters of "!! Sitka spruce trees as a function of tree age in years.
Data generated according to discussion in Rennolls (1993).

Inspired by results in Rennolls (1993) the simulated heights in meters of "!! Sitka
spruces (Picea sitchensis (Bong.) Carr.) is plotted as a function of their age in years in Figure

© 2003 by CRC Press LLC


5.36. There appears to be an upper height asymptote at approximately %& meters but it is not
obvious whether the model should be sigmoidal or not. Fitting the Richards model [5.71] with
B3 œ +1/3 , the age of the 3th tree , we can let the parameter # guide whether the shape of the
growth response is sigmoidal near the origin. There is clearly variance heterogeneity in these
data as the height of older trees varies more than the height of younger trees.
In §4.5.2 we attempted to unlock the relationship between variances and means in an
experimental design situation by fitting a linear model between the log sample standard
deviation and the log treatment means. This was made possible by having replicate values for
each treatment from which the variance in the treatment group could be estimated. In a
regression context such as this, there are none or only a few replicate values. One possibility
is to group nearby ages and estimate the sample variance within the groups. The groups
should be chosen large enough to allow a reasonably stable estimate of the variance and small
enough so that the assumption of a constant mean within the group is tenable. An alternative
approach is to fit the model by ordinary nonlinear least squares and to plot the squared fitted
residuals against the regressor, trying to glean the error variance/regressor relationship from
this plot (Figure 5.38 c).
Applying the grouping approach with six points per group, the sample variances in the
groups are plotted against the square roots of the average ages in the groups in Figure 5.37.
That the variance is proportional to È+1/ is not an unreasonable assumption based on the
figure. The diagonal weight matrix W for nonlinear weighted least square should have entries
+1/3 ½ .

30
Sample Variance by Group

20

10

1 2 3 4 5 6 7
Sqrt(Age)

Figure 5.37. Sample standard deviations for groups of six observations against the square
root of tree age.

To obtain a weighted nonlinear least squares analysis in SAS® , we call upon the
_weight_ variable in proc nlin. The _weight_ variable can refer to a variable in the data set
or to a valid SAS® expression. SAS® calculates it for each observation and assigns the recip-
rocal values as diagonal elements of the W matrix. The statements (Output 5.26) to model the
variances of an observation as a multiple of the root ages are

© 2003 by CRC Press LLC


proc nlin data=spruce;
parameters alpha=50 beta=-0.05 gamma=1.0;
model height = alpha*(1-exp(beta*age))**gamma;
_weight_ = 1/sqrt(age);
run;

Output 5.26.
The NLIN Procedure

NOTE: Convergence criterion met.

Estimation Summary
Method Gauss-Newton
Iterations 5
R 6.582E-7
PPC(gamma) 5.639E-7
RPC(gamma) 0.000015
Object 6.91E-10
Objective 211.0326
Observations Read 100
Observations Used 100
Observations Missing 0

Sum of Mean Approx


Source DF Squares Square F Value Pr > F
Regression 3 18008.7 6002.9 2759.19 <.0001
Residual 97 211.0 2.1756
Uncorrected Total 100 18219.7
Corrected Total 99 5641.2

Approx Approximate 95% Confidence


Parameter Estimate Std Error Limits
alpha 44.9065 1.5307 41.8684 47.9446
beta -0.0682 0.00822 -0.0845 -0.0519
gamma 1.5222 0.1533 1.2179 1.8265

Figure 5.38 displays the residuals from the ordinary nonlinear least squares analyses [(a)
and (c)] and the weighted residuals from the weighted analysis (b). The heterogeneous
variances in the plot of unweighted residuals is apparent and the weighted residuals show a
homogeneous band as residuals of a proper model should. The squared ordinary residuals
plotted against the regressor (panel c) do not lend themselves easily to discern how the error
variance relates to the regressor. Because of the tightness of the observations for young trees
(Figure 5.36) many residuals are close to zero, and a few large residuals for older trees over-
power the plot. It is because we find this plot of squared residuals hard to interpret for these
data that we prefer the grouping approach in this application.
What has been gained by applying weighted nonlinear least squares instead of ordinary
nonlinear least squares? It turns out that the parameter estimates differ little between the two
analyses and the predicted trends will be almost indistinguishable from each other. The
culprit, as so often is the case, is the estimation of the precision of the coefficients and the
precision of the predictions. The variance of an observation around the mean trend is assumed
to be constant in ordinary nonlinear least squares. From Figure 5.36 it is obvious that this
variability is small for young trees and grows with tree age. The estimate of the common var-
iance will then be too small for older trees and too large for younger trees. This effect be-
comes apparent when we calculate prediction (or confidence) intervals for the weighted and
unweighted analyses (Figure 5.39). The *&% prediction intervals are narrower for young trees
than the intervals from the unweighted analysis, since they take into account the actual

© 2003 by CRC Press LLC


variability around the regression line. The opposite effect can be observed for older trees. The
unweighted analysis underestimates variability about the regression and the prediction inter-
vals are not wide enough.

a) b)
Uweighted Ordinary Residuals

6 2

Weighted Residuals
1
0
0

-6 -1

-2

10 30 50 10 30 50
Age Age

c)
Uweighted Residual Squared

80

50

20

10 Age 30 50

Figure 5.38. Residuals from an ordinary unweighted nonlinear least squares analysis (a), a
weighted nonlinear least squares analysis (b), and the square of the ordinary residuals plotted
against the regressor (c).

50

30
Height (m)

10

-10

0 10 20 30 40 50
Age

Figure 5.39. Predictions of tree height in weighted nonlinear least squares analysis (solid line
in center of point cloud). Upper and lower solid lines are *&% prediction intervals in
weighted analysis, dashed lines are *&% prediction intervals from ordinary nonlinear least
squares analysis.

© 2003 by CRC Press LLC


Chapter 6

Generalized Linear Models

“The objection is primarily that the theory of errors assumes that errors
are purely random, i.e., 1. errors of all magnitudes are possible; 2.
smaller errors are more likely to occur than larger ones; 3. positive and
negative errors of equal absolute value are equally likely. The theory of
Rodewald makes insufficient accommodation of this theory; only two
errors are possible, in particular: if O is the germination percentage of
seeds, the errors are O and "  O .” J. C. Kapteyn, objecting to
Rodewald's discussion of seed germination counts in terms of binomial
probabilities. In Rodewald, H. Zur Methodik der Keimprüfungen, Die
Landwirtschaftlichen Versuchs-Stationen, vol. 49, p. 260. (Quoted in
German translated by first author.)

6.1 Introduction
6.2 Components of a Generalized Linear Model
6.2.1 Random Component
6.2.2 Systematic Component and Link Function
6.2.3 Generalized Linear Models in The SAS® System
6.3 Grouped and Ungrouped Data
6.4 Parameter Estimation and Inference
6.4.1 Solving the Likelihood Problem
6.4.2 Testing Hypotheses about Parameters and Their Functions
6.4.3 Deviance and Pearson's \ # Statistic
6.4.4 Testing Hypotheses through Deviance Partitioning
6.4.5 Generalized V # Measures of Goodness-of-Fit
6.5 Modeling an Ordinal Response
6.5.1 Cumulative Link Models
6.5.2 Software Implementation and Example

© 2003 by CRC Press LLC


6.6 Overdispersion
6.7 Applications
6.7.1 Dose-Response and LD&! Estimation in a Logistic Regression
Model
6.7.2 Binomial Proportions in a Randomized Block Design —
the Hessian Fly Experiment
6.7.3 Gamma Regression and Yield Density Models
6.7.4 Effects of Judges' Experience on Bean Canning Quality Ratings
6.7.5 Ordinal Ratings in a Designed Experiment with Factorial
Treatment Structure and Repeated Measures
6.7.6 Log-Linear Modeling of Rater Agreement
6.7.7 Modeling the Sample Variance of Scab Infection
6.7.8 A Poisson/Gamma Mixing Model for Overdispersed Poppy Counts

© 2003 by CRC Press LLC


6.1 Introduction
Box 6.1 Generalized Linear Models

• Generalized linear models (GLMs) are statistical models that combine


elements of linear and nonlinear models.

• GLMs apply if responses are distributed independently in the exponential


family of distributions, a large family that contains distributions such as the
Bernoulli, Binomial, Poisson, Gamma, Beta and Gaussian distributions.

• Each GLM has three components: the link function, the linear predictor,
and the random component.

• The Gaussian linear regression and analysis of variance models are special
cases of generalized linear models.

In the preceding chapters we explored statistical models where the response variable is
continuous. Although these models cover a wide range of situations they do not suffice for
many data in the plant and soil sciences. For example, the response may not be a continuous
variable, but a count or a frequency. The distribution of the errors may have a mean of zero,
but may be far from a Gaussian distribution. These data/model breakdowns can be addressed
by relying on asymptotic or approximate results, by transforming the data, or by using models
specifically designed for the particular response distribution. Poisson-distributed counts, for
example, can be approximated by Gaussian random variables if the average count is
sufficiently large. In binomial experiments consisting of independent and identical binary ran-
dom variables the Gaussian approximation can be invoked provided the product of sample
size and the smaller of success or failure probability is sufficiently large a   &b. When such
approximations allow discrete responses to be treated as Gaussian, the temptation to invoke
standard analysis of variance or regression analysis is understandable. The analyst must keep
in mind, however, that other assumptions may still be violated. Since for Poisson random
variables the mean equals the variance, treatments where counts are large on average will also
have large variability compared to treatments where counts are small on average. The homo-
scedasticity assumption in an experiment with count responses is likely to be violated even if
a Gaussian approximation to the response distribution holds. When Gaussian approximations
fail, transformations of the data can achieve greater symmetry, remove variance heteroge-
neity, and create a scale on which effects can be modeled additively (Table 6.1). Transform-
ing the data is not without problems, however. The transformation that establishes symmetry
may not be the one that homogenizes the variances. Results of statistical analyses are to be
interpreted on the transformed scale, which may not be the most meaningful. The square root
of weed counts or the arcsine of the proportion of infected plants is not a natural metric for
interpretation.
If the probability distribution of the response is known, one should not attempt to force
the statistical analysis in a Gaussian framework if tools are available specifically designed for
that distribution. Generalized Linear Models (GLMs) extend linear statistical modeling to re-

© 2003 by CRC Press LLC




sponse distributions that belong to a broad family of distributions, known as the exponential
family. It contains the Bernoulli, Binomial, Poisson, Negative Binomial, Gamma, Gaussian,
Beta, Weibull, and other distributions. GLM theory is based on work by Nelder and Wedder-
burn (1972) and Wedderburn (1974). It was subsequently popularized in the monograph by
McCullagh and Nelder (1989). GLMs combine elements from linear and nonlinear models
and we caution the reader on the outset not to confuse the acronym GLM with the glm
procedure of The SAS® System. The glm procedure fits linear models and conducts inference
assuming Gaussian errors, a very special case of a generalized linear model. The SAS®
acronym stands for General Linear Model, the generality being that it can fit regression,
analysis of variance, and analysis of covariance models by unweighted or weighted least
squares, not that it can fit generalized linear models. Some of the procedures in The SAS®
System that can fit generalized linear models are proc genmod, proc logistic, proc probit,
proc nlmixed, and proc catmod (see §6.2.3).

Table 6.1. Some common transformations for non-Gaussian data


to achieve symmetry and/or stabilize variance
Variable C Transformation
Continuous ÈC lneCf "ÎC
Count a ž !b ÈC ÈC € !Þ$(& lneC f
Count a   !b ÈC ÈC € !Þ$(& lneC € - f†

Proportion arcsinˆÈC‰ œ sin" ˆÈC ‰

Adding a small value - to count variables that can take on the value ! enables the
logarithmic transformation if C œ !.

The arcsine transformation is useful for binary proportions. If C is a percentage, use
sin" ˆÈCÎ"!!‰. The transformation ÈC € !Þ$(& was found to be superior to ÈC in
applications where the average count is small a  $b. For binomial data where 8  &! and
the observed proportions are not all between !Þ$ and !Þ( it is recommended to apply
sin" ÐÈÐC‡ € !Þ$(&ÑÎÐ8 € !Þ(&ÑÑ where C‡ is the binomial count (not the proportion). If
8   &! or all observed proportions lie between !Þ$ and !Þ(, use sin" ˆÈC‡ Î8‰.

Linear models are not well-suited for modeling the effects of experimental factors and
covariates on discrete outcomes for a number of reasons. The following example highlights
some of the problems encountered when a linear model is applied to a binary response.

Example 6.1. Groundwater pesticide contamination is measured in randomly selected


wells under various cropping systems to determine whether wells are above or below
acceptable contamination levels. Because of the difficulties in accurately measuring the
concentration in parts per million for very low contamination levels, it is only recorded
if the concentration at a given well exceeds the acceptable level. An observation from
well 3 thus can take on only two possible values. For purposes of the analysis it is
convenient to code the variable as
" level exceeded (well contaminated)
]3 œ œ
! level not exceeded (well not contaminated).

© 2003 by CRC Press LLC


]3 is a Bernoulli random variable with probability mass function
:aC3 b œ 13C3 a"  13 b"C3 ,

where 13 denotes the probability that the 3th well is contaminated above threshold level.
The expected value of ]3 is easily found as
Ec]3 d œ "C3 :aC3 b œ "‡Pra]3 œ "b € !‡Pra]3 œ !b œ Pra]3 œ "b œ 13 Þ

If we wish to determine whether the probability of well contamination depends on the


amount of pesticide applied (\ ), a traditional linear model
]3 œ "! € "" B3 € /3
is deficient in a number of ways. The mean response is a probability in the interval
[!ß "] but no such restriction is placed on the mean function "! € "" B3 . Predictions from
this model can fall outside the permissible range (Figure 6.1). Also, limits are typically
not approached in a linear fashion, but as asymptotes. A reasonable model for the
probability of contamination might be sigmoidal in shape. Figure 6.1 shows simulated
data for this example highlighting that the response can take on only two values, ! and
". The sigmoidal trend stems from a generalized linear model where it is assumed that
probabilities approach !Þ! at the same rate with which they approach "Þ!.

1.2

1.0

0.8

0.6

0.4

0.2

0.0

-0.2

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
x

Figure 6.1. Simulated data for well contamination example. Straight line is obtained by
fitting the linear regression model ]3 œ "! € "" B3 , the sigmoidal line from a logistic
regression model, a generalized linear model.

Generalized linear models inherit from linear models a linear combination of covariates
and parameters, xw3 ", termed the linear predictor. This additive systematic part of the model
is not an expression for the mean response Ec]3 d, but for a transformation of Ec]3 d. This trans-
formation, 1aEc]3 db, is called the link function of the generalized linear model. It maps the

© 2003 by CRC Press LLC


304 Chapter 6  Generalized Linear Models

average response on a scale where covariate effects are additive and it ensures range restric-
tions.
Every distribution in the exponential family suggests a particular link function known as
the canonical link function, but the user is at liberty to pair any suitable link function with any
distribution in the exponential family. The three components of a GLM, linear predictor, link
function, and random component are discussed in §6.2. This section also introduces various
procedures of The SAS® System that are capable of fitting generalized linear models to data.
How to estimate the parameters of a generalized linear model and how to perform statistical
inference is the focus in §6.4. Because of the importance of multinomial, in particular ordinal,
responses in agronomy, a separate section is devoted to cumulative link models for ordered
outcomes (§6.5).
With few exceptions, distributions in the exponential family have functionally related
moments. For example, the mean and variance of a Binomial random variable are Ec] d œ 81
and Varc1d œ 81a"  1b œ Ec] da"  1b where 8 is the binomial sample size and 1 is the
success probability. If ] has a Poisson distribution then Ec] d œ Varc] d. Means and variances
cannot be determined independently as is the case for Gaussian data. The modeler of discrete
data often encounters situation where the data appear more dispersed than is permissible for a
particular distribution. This overdispersion problem is addressed in §6.6.
Applications of generalized linear models to problems that arise in the plant and soil
sciences follow in §6.7. Mathematical details can be found in Appendix A on the CD-ROM
as §A6.8.
The models covered in this chapter extend the previously discussed statistical models to
non-Gaussian distributions. We assume, however, that data are uncorrelated. The case of
correlated, non-Gaussian data in the clustered data setting is discussed in §8 and in the spatial
setting in §9.

6.2 Components of a Generalized Linear Model


Box 6.2 Components of a GLM

• A generalized model consists of a random component, a systematic


component, and a link function.

• The random component is the distribution of the response chosen from the
exponential family.

• The systematic component is a linear predictor function xw " that relates a


transformation of the mean response to the covariates or effects in x.

• The link function is a transformation of the mean response so that covariate


effects are additive and range restrictions are ensured. Every distribution in
the exponential family has a natural link function.

© 2003 by CRC Press LLC


Components of a GLM 305

6.2.1 Random Component


The random component of a generalized linear model consists of independent observations
]" ß âß ]8 from a distribution that belongs to the exponential family (see §A6.8.1 for
details). A probability mass or density function is a member of the exponential family if it can
be written as
0 aCb œ expeaC )  , a)bbÎ< € - aCß <bf [6.1]

for some functions , a•b and - a•b. The parameter ) is called the natural parameter and < is a
dispersion (scale) parameter. Some important members of the exponential family are shown
in Table 6.2.

Table 6.2. Important distributions in the exponential family (The Bernoulli, Binomial,
Negative Binomial, and Poisson are discrete distributions)

Distribution ,Ð)Ñ Ec] d œ , w a)b œ . 2a.b < )a.b


expe)f
Bernoulli, F a1b ln˜" € /) ™ 1œ "€expe)f 1a"  1b " ln˜ "1 1 ™

e) f
Binomial, F a8ß 1b 8ln˜" € /) ™ 81 œ 8 "€exp
expe)f 81a"  1b " ln˜ "1 1 ™

Negative  5 ln˜"  /) ™ 5 "1 1 5 "


1#
1
" lne"  1f
Binomial, R F a5ß 1b
e) f
œ 5 "exp
expe)f

Poisson, T a-b /) - œ expe)f - " lne-f


" #
Gaussian, Ka.ß 5 # b #) .œ) " 5# .

Gamma, Gammaa.ß !b  lne  )f . œ  "Î) .# !"  ."

Exponential, Ea. b  lne  )f . œ  "Î) .# "  ."

Inverse  a  #)b!Þ& . œ a  #)b!Þ& .$ 5# .#


Gaussian, IGa.ß 5 # b

The function ,a)b is important because it relates the natural parameter to the mean and
variance of ] . We have Ec] d œ . œ , w a)b and Varc] d œ , ww a)b< where , w a)b and , ww a)b are
the first and second derivatives of , a)b, respectively. The second derivative , ww a)b is also
termed the variance function of the distribution. When the variance function is expressed in
terms of the mean ., instead of the natural parameter ), it is denoted as 2a.b. Hence,
Varc] d œ 2a.b< (Table 6.2). The variance function 2a.b depends on the mean for all distri-
butions in Table 6.2, except the Gaussian. An estimate 1 s of the success probability 1 for
Bernoulli data thus lends itself directly to a moment estimator of the variance, 1s a"  1sb. For

© 2003 by CRC Press LLC


306 Chapter 6  Generalized Linear Models

Gaussian data an estimate of the mean does not provide any information about the variability
in the data.
The natural parameter ) can be expressed as a function of the mean . œ Ec] d by
inverting the relationship ,w a)b œ .. For example, in the Bernoulli case,
expe)f .
Ec] d œ œ . Í ) œ lnœ .
" € expe)f ".

Denoted )Ð.Ñ in Table 6.2, this function is called the natural or canonical link function. It
is frequently the link function of choice, but it is not a requirement to retain the canonical link
(§6.2.2). We now discuss the important distributions shown in Table 6.2 in more detail.
The Bernoulli distribution is a discrete distribution for binary (success/failure) outcomes.
If the two possible outcomes are coded
" if outcome is a success
] œœ
! if outcome is a failure,

the probability mass function (pmf) is


:aCb œ 1C a"  1b"C ß C ­ e!ß "f, [6.2]

where 1 is the success probability 1 œ Pra] œ "b. A binomial experiment consists of a


series of 8 independent and identical Bernoulli trials. For example, germinating "!! seeds
constitutes a binomial experiment if the seeds germinate independently of each other and are
a random sample from a seed lot. Within a binomial experiment, several random variables can
be defined. The total number of successes is a Binomial random variable with probability
mass function
8 8x
:aC b œ Œ 1C a"  1b8C œ 1C a"  1b8C ß C œ !ß "ß âß 8. [6.3]
C Cxa8  C bx

A Binomial aF a8ß 1bb random variable can thus be thought of as the sum of 8 independent
Bernoulli aF a1bb random variables.

Example 6.2. Seeds are stored at four temperature regimes aX" to X% b and under addi-
tion of chemicals at four different concentrations a!ß !Þ"ß "Þ!ß "!b. To study the effects
of temperature and chemical concentration a completely randomized experiment is
conducted with a % ‚ % factorial treatment structure and four replications. For each of
the '% experimental sets, &! seeds were placed on a dish and the number of seeds that
germinated under standard conditions was recorded. The data, taken from Mead,
Curnow, and Hasted (1993, p. 325) are shown in Table 6.3.

Let ]345 denote the number of seeds germinating for the 5 th replicate of temperature 3
and chemical concentration 4. For example, C"#" œ "$ and C"## œ "# are the realized
values for the first and second replication of temperature X" and concentration !Þ".
These are realizations of F a&!ß 1"# b random variables if the seeds germinated indepen-
dently in a dish and there are no differences that affect germination between the dishes.
Alternatively, one can think of each seed for this treatment combination as a Bernoulli

© 2003 by CRC Press LLC


Components of a GLM 307

Fa1"# b random variable. Let


" if seed 6 in dish 5 for temperature 3 and concentration 4 germinates
\3456 œ œ
! otherwise

and the counts shown in Table 6.3 are !&!


6œ" B3456 œ C345 .

Table 6.3. Germination data from Mead, Curnow, and Hasted (1993, p. 325)†
(Values represent counts out of &! seeds for four replicates)
Chemical Concentration
Temperature 0 Ð4 œ "Ñ 0.1 Ð4 œ #Ñ 1.0 Ð4 œ $Ñ 10 Ð4 œ %Ñ
X" a3 œ "b *ß *ß $ß ( "$ß "#ß "%ß "& #"ß #$ß #%ß #( %!ß $#ß %$ß $%
X# a3 œ #b "*ß $!ß #"ß #* $$ß $#ß $!ß #' %$ß %!ß $(ß %" %)ß %)ß %*ß %)
X$ a3 œ $b (ß (ß #ß & "ß #ß %ß %ß )ß "!ß 'ß ( $ß %ß )ß &
X% a3 œ %b %ß *ß $ß ( "$ß 'ß "&ß ( "'ß "$ß ")ß "* "$ß ")ß ""ß "'

Used with permission.

As for an experiment with continuous response where interest lies in comparing treat-
ment means, we may be interested in similar comparisons of the form
134 œ 13w 4 ,
1"4 œ 1#4 œ 1$4 œ 1%4 ,
1Þ" œ 1Þ# œ 1Þ$ œ 1Þ% ,

corresponding to pairwise differences, a slice at concentration 4, and the marginal con-


centration effects. The analysis of these data must recognize, however, that these means
are probabilities.

The number of trials in a binomial experiment until the 5 th success occurs follows the
Negative Binomial law. One of the many ways in which the probability mass function of a
Negative Binomial random variable can be written is
5€C" C 5
: aC b œ Œ a"  1b 1 ß C œ !ß "ß â. [6.4]
C

See Johnson et al. (1992, Ch. 5) and §A6.8.1 for other parameterizations. A special case of
the Negative Binomial distribution is the Geometric distribution for which 5 œ ". Notice that
the support of the Negative Binomial distribution has no upper bound and the distribution can
thus be used to model counts without natural denominator, i.e., counts that cannot be convert-
ed to proportions. Examples are the number of weeds per 7# , the number of aflatoxin con-
taminated peanuts per 7$ and the number of earthworms per 0 >$ . Traditionally, the Poisson
distribution is more frequently applied to model such counts than the Negative Binomial
distribution. The probability mass function of a Poissona-b random variable is

© 2003 by CRC Press LLC


308 Chapter 6  Generalized Linear Models

-C -
: aC b œ / ß C œ !ß "ß â . [6.5]
Cx
A special feature of the Poisson random variable is the identity of mean and variance,
Ec] d œ Varc] d œ -. Many count data suggest variation that exceeds the mean count. The
Negative Binomial distribution has the same support aC œ !ß "ß âb as the Poisson distribution
but allows greater variability. It is a good alternative model for count data that exhibit excess
variation compared to the Poisson model. This connection between Poisson and Negative
Binomial distributions can be made more precise if one considers the parameter - a random
Gamma-distributed random variable (see §A6.8.6 for details and §6.7.8 for an application).
The family of Gamma distributions encompasses continuous, non-negative, right-
skewed probability densities (Figure 6.2). The Gamma distribution has two non-negative
parameters, ! and " , and density function
"
0 aC b œ " ! C !" expe  CÎ" fß C   !, [6.6]
>a!b
_
where >a!b œ '! >!" /> .> is known as the gamma function. The mean and variance of a
Gamma random variable are Ec] d œ !" œ . and Varc] d œ !" # œ .# Î!. The density
function can be rewritten in terms of the mean . and the scale parameter ! as
!
" C! !
0 aC b œ Œ  expœ  C ß C   !, [6.7]
>a!bC . .

from which the exponential family terms in Table 6.2 were derived. We refer to the param-
eterization [6.7] when we denote a Gammaa.ß !b random variable. The Exponential distri-
bution for which ! œ " and the Chi-squared distribution with / degrees of freedom for which
! œ / Î# and " œ # are special cases of Gamma distributions (Figure 6.2).

α = 0.5
1.2

α = 1.0
α = 4.0

α = 3.0
0.8
Density f(y)

α = 2.0

0.4

0.0

0 1 2 3 4
y

Figure 6.2. Gamma distributions in parameterization [6.7] with . œ ". As ! Ä _, the


Gamma distribution tends to a Gaussian.

© 2003 by CRC Press LLC


Components of a GLM 309

Because of its skewness, Gamma distributions are useful to model continuous, non-nega-
tive, right-skewed outcomes with heterogeneous variances and play an important role in
analyzing time-to-event data. If on average ! events occur independently in .Î! time units,
the time that elapses until the !th event occurs is a Gammaa.Î!ß !b random variable.
The variance of a Gamma random variable is proportional to the square of its mean.
Hence, the coefficient of variation of a Gamma-distributed random variable remains constant
as its mean changes. This suggests a Gamma model for data in which the standard deviation
of the outcome increases linearly with the mean and can be assessed in experiments with
replication by calculating standard deviations across replicates.

Example 6.3. McCullagh and Nelder (1989, pp. 317-320) discuss a competition experi-
ment where various seed densities of barley and the weed Sinapis alba were grown in a
competition experiment with three replications (blocks). We focus here on a subset of
their data, the monoculture barley dry weight yields (Table 6.4).

Table 6.4. Monoculture barley yields and seeding densities in three blocks
(Experimental units were individual pots)
Dry weights
Seeds sown Block 1 Block 2 Block 3 C = GZ
$ #Þ!( &Þ$# $Þ"% $Þ&" "Þ'& %(Þ"*
& "!Þ&( "$Þ&* "%Þ'* "#Þ*& #Þ"$ "'Þ%(
( #!Þ)( *Þ*( &Þ%& "#Þ!* (Þ*$ '&Þ&$
"! 'Þ&* #"Þ%! #$Þ"# "(Þ!% *Þ!* &$Þ$%
"& )Þ!) ""Þ!( )Þ#) *Þ"% "Þ'( ")Þ#)
#$ "'Þ(! 'Þ'' "*Þ%) "%Þ#) 'Þ(% %(Þ##
$% #"Þ## "%Þ#& $)Þ"" #%Þ&$ "#Þ#( &!Þ!#
&" #'Þ&( $*Þ$( #&Þ&$ $!Þ%* (Þ(" #&Þ#)
(( #$Þ(" #"Þ%% "*Þ(# #"Þ'# #Þ!! *Þ#'
""& #!Þ%' $!Þ*# %"Þ!# $!Þ)! "!Þ#) $$Þ$(
Reproduced from McCullagh and Nelder (1989, pp. 317-320) with permission.

12

10
Sample standard deviation

0
0 5 10 15 20 25 30 35
Sample mean dry weight

Figure 6.3. Sample standard deviation as a function of sample mean in barley yield
monoculture.

© 2003 by CRC Press LLC


310 Chapter 6  Generalized Linear Models

Sample standard deviations calculated from only three observations are not reliable.
When plotting = against C, however, a trend between standard deviation and sample
mean is obvious and a linear trend ÈVarc] d œ # Ec] d does not seem unreasonable.

In §5.8.7 it was stated that the variability of the yield per plant ] generally increases with
] in yield-density studies and competition experiments. Mead (1970), in a study of the
Bleasdale-Nelder yield-density model (Bleasdale and Nelder 1960), concludes that it is
reasonable for yield data to assume that Varc] d º Ec] d# , precisely the mean-variance
relationship implied by the Gamma distribution. Many research workers would choose a
logarithmically transformed model
Eclne] fd œ lne0 aBß " bf,

where B denotes plant density, assuming that Varclne] fd is constant and lne] f is Gaussian.
As an alternative, one could model the relationship
Ec] d œ 0 aBß " b

assuming that ] is a Gamma random variable. An application of the Gamma distribution to


yield-density models is presented in §6.7.3.
Right-skewed distributions should also be expected when the outcome of interest is a dis-
persion parameter. For example, when modeling variances it is hardly appropriate to assume
that these are Gaussian-distributed. The following device can be invoked instead. If ]" ß âß ]8
are a random sample from a Gaussian distribution with variance 5 # and
8
" #
W# œ "ˆ]3  ] ‰
8  " 3œ"

denotes the sample variance, then


a8  "bW # Î5 #

follows a Chi-squared distribution with 8  " degrees of freedom. Consequently, W # follows


a Gamma distribution with parameters ! œ a8  "bÎ#, " œ #5 # Îa8  "b. In the parameteri-
zation [6.7], one obtains
W # µ Gammaˆ. œ 5 # ß ! œ Ð8  "ÑÎ#‰.

When modeling sample variances, one should not resort to a Gaussian distribution, but
instead draw on a properly scaled Gamma distribution.

Example 6.4. Hart and Schabenberger (1998) studied the variability of the aflatoxin
deoxynivalenol (DON) on truckloads of wheat kernels. DON is a toxic secondary
metabolite produced by the fungus Gibberella zeae during the infection process. Data
were gathered in 1996 by selecting at random ten trucks arriving at mill elevators. For
each truck ten double-tubed probes were inserted at random from the top in the truck-
load and the kernels trapped in the probe extracted, milled, and submitted to enzyme-

© 2003 by CRC Press LLC


Components of a GLM 311

linked immunosorbent assay (ELISA). Figure 6.4 shows the truck probe-to-probe
sample variance as a function of the sample mean toxin concentration per truck.

1.2

Probe-To-Probe Variance (ppm2)

0.8

0.4

0.0

0 2 4 6 8 10 12
DON concentration (ppm)

Figure 6.4. Probe-to-probe variances appm# b as a function of average truck toxin


concentration. Data kindly provided by Dr. L. Patrick Hart, Department of Crop and
Soil Sciences, Michigan State University. Used with permission.

To model the probe-to-probe variability, we choose


"!  "
W # µ GammaŒ. œ 1axw3 " bß ! œ 
#

where 1a•b is a properly chosen link function. These data are analyzed in §6.7.7.

The Log-Gaussian distribution is also right-skewed and has variance proportional to the
squared expectation. It is popular in modeling bioassay data or growth data where a logarith-
mic transformation establishes Gaussianity. If lne] f is Gaussian with mean . and variance
5 # , then ] is a Log-Gaussian random variable with mean Ec] d œ expe. € 5 # Î#f and
variance Varc] d œ expe#. € 5 # faexpe5 # f  "b. It is also a reasonable model for right-
skewed data that can be transformed to symmetry by taking logarithms. Unfortunately, it is
not a member of the exponential family and does not permit a generalized linear model. In the
GLM framework, the Gamma distributions provide an excellent alternative in our opinion.
Amemiya (1973) and Firth (1988) discuss testing Log-Gaussian vs. Gamma distributions and
vice versa.
The Inverse Gaussian distribution is also skew-symmetric with a long right tail and is a
member of the exponential family. It is not really related to the Gaussian distribution although
there are some parallel developments for the two families. For example, the independence of
sample mean and sample variance that is true for the Gaussian is also true for the Inverse
Gaussian distribution. The name stems from the inverse relationship between the cumulative
generating functions of the two distributions. Also, the formulas of the Gaussian and Inverse
Gaussian probability density functions bear some resemblance to each other. Folks and
Chhikara (1978) suggested naming it the Tweedie distribution in recognition of fundamental

© 2003 by CRC Press LLC


312 Chapter 6  Generalized Linear Models

work on this distribution by Tweedie (1945, 1957a, 1957b). A random variable ] is said to
have an Inverse Gaussian distribution if its probability density function (Figure 6.5) is given
by

" " aC  .b#


0 aC b œ expž  # ß C ž !. [6.8]
È#1C$ 5 # 5 #.# C Ÿ

µ = 1.0

1.0

0.8 µ = 2.0
µ = 3.0
Density f(y)

0.6
µ = 5.0

0.4

0.2

0.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0


y

Figure 6.5. Inverse Gaussian distributions with 5 # œ ".

The distribution finds application in stochastic processes as the distribution of the first
passage time in a Brownian motion, and in analysis of lifetime and reliability data. The rela-
tionship with a passage time in Brownian motion suggests its use as the time a tracer remains
in an organism. Folks and Chhikara (1978) pointed out that the skewness of the Inverse
Gaussian distribution makes it an attractive candidate for right-skewed, non-negative, con-
tinuous outcomes whether or not the particular application relates to passage time in a
stochastic process. The mean and variance of the Inverse Gaussian are given by Ec] d œ .
and Varc] d œ .$ 5 # . Notice that 5 # is not an independent scale parameter. The variability of
] is determined by . and 5 # jointly, whereas for a Gaussian distribution Varc] d œ 5 # . The
variance of the Inverse Gaussian increases more sharply in the mean than that of the Gamma
distribution (for 5 # œ !" ).

6.2.2 Systematic Component and Link Function


The systematic component of a generalized linear model is a linear combination of covariates
(or design effects) and fixed effects parameters. As in linear models we denote the systematic
component for the 3th observation as

© 2003 by CRC Press LLC


Components of a GLM 313

"! € "" B"3 € "# B#3 € ⠀ "5 B53 œ xw3 "

and term it the linear predictor (3 . The linear predictor is chosen in generalized linear
models in much the same way as the mean function is built in classical linear regression or
classification models. For example, if the binomial counts or proportions are analyzed in the
completely randomized design of Example 6.2 (p. 306), the linear predictor contains an inter-
cept, temperature, concentration effects, and their interactions. In contrast to the classical
linear model where (3 is the mean of an observation the linear predictor in a generalized
linear model is set equal to a transformation of the mean,
1aEc]3 db œ (3 œ xw3 " Þ [6.9]

This transformation 1a•b is called the link function and serves several purposes. It is a
transformation of the mean .3 onto a scale where the covariate effects are additive. In the
terminology of §5.6.1, the link function is a linearizing transform and the generalized linear
model is intrinsically linear. If one studies a model with mean function
.3 œ expe"! € "" B3 f œ expe"! fexpe"" B3 f,

then lne.3 f is a linearizing transformation of the nonlinear mean function. A second purpose
of the link function is to confine predictions under the model to a suitable range. If ]3 is a
Bernoulli outcome then Ec]3 d œ 13 is a success probability which must lie between ! and ".
Since no restrictions are placed on the parameters in the linear predictor xw3 " , the linear
predictor can range from  _ to € _. To ensure that the predictions are in the proper range
one chooses a link function that maps from a!ß "b to a  _ß _b. One such possibility is the
logit transformation
1
logita1b œ lnš ›, [6.10]
"1
which is also the canonical link for the Bernoulli and Binomial distributions (see Table 6.2).
Models with logit link are termed logistic models and can be expressed as
1
lnš › œ xw ". [6.11]
"1

Link functions must be monotonic and invertible. If 1a.3 b œ (3 , inversion leads to


.3 œ 1" a(3 b, and the function 1" a•b is called the inverse link function. In the case of the
logistic model the inverted relationship between link and linear predictor is
expexw " f "
1œ w
œ . [6.12]
" € expex " f " € expe  xw " f

Once parameter estimates in the generalized linear model have been obtained (§6.4), the
mean of the outcome at any value of x is predicted as
s ‹.
s œ 1" Šxw "
.

Several link functions can properly restrict the expectation but provide different scales on
which the covariate effects are additive. The representation of the probability (mass) density
functions in Table 6.2 suggests canonical link functions shown there as )a.b. The canonical

© 2003 by CRC Press LLC


314 Chapter 6  Generalized Linear Models

link for Binomial data is thus the logit link, for Poisson counts the log link, and for Gaussian
data the identity link (no transformation). Although relying on the canonical link leads to
some simplifications in parameter estimation, these are not of concern to the user of
generalized linear models in practice. Functions other than the canonical link may be of
interest. We now review popular link functions for binary data and proportions, counts, and
continuous variables.

Link Functions for Binary Data and Binomial Proportions


The expected values of binary outcomes and binomial proportions are probabilities confined
to the c!ß "d interval. Since the linear predictor (3 œ xw3 " can range from –_ to _, the link
function must be a mapping a!ß "b È a–_ß _b. Similarly, the inverse link function is a map-
ping a–_ß _b È a!ß "b. A convenient way of deriving such link functions is as follows. Let
J aC b œ Pra] Ÿ C b denote a cumulative distribution function (cdf) of a random variable ]
that ranges over the entire real line (from –_ to _). Since ! Ÿ J aC b Ÿ ", the cdf could be
used as an inverse link function. Unfortunately, the inverse cdf (or quantile function) J " a•b
does not exist in closed form for many continuous random variables, although numerically
accurate methods exist to calculate J " a•b.
The most popular generalized linear model for Bernoulli and Binomial data is probably
the logistic regression model. It applies the logit link which is the inverse of the cumulative
distribution function of a Logistic random variable. If ] is a Logistic random variable with
parameters . and !, then
expeaC  .bÎ!f
J aC b œ 1 œ ß  _  C  _, [6.13]
" € expeaC  .bÎ!f

with mean . and variance a!1b# Î$. If . œ !, ! œ ", the distribution is called the Standard
Logistic with cdf 1 œ expeC fÎa" € expeC fb. Inverting the standard logistic cdf yields the
logit function
1
J " a1b œ lnš ›.
"1
In terms of a generalized linear model for ]3 with Ec]3 d œ 13 and linear predictor (3 œ xw3 " we
obtain
13
J " a13 b œ logita13 b œ lnœ w
 œ x3 " Þ [6.14]
"  13

In the logistic model the parameters have a simple interpretation in terms of log odds
ratios. Consider a two-group comparison of successes and failures. Define a dummy variable
as
" if 3th observation is in the treated group a4 œ "b
B34 œ œ
! if 3th observation is in the control group a4 œ !b,

and the response as

© 2003 by CRC Press LLC


Components of a GLM 315

" if 3th observation in group 4 results in a success


]34 œ œ
! if 3th observation in group 4 results in a failure.

Because B34 is a dummy variable, the logistic model


logita14 b œ "! € "" B34

reduces to
"! € "" 4 œ " (treated group)
logita14 b œ œ
"! 4 œ ! (control group).

The gradient "" measures the change in the logit between the control and the treated group. In
terms of the success and failure probabilities one can construct a # ‚ # table.

Table 6.5. Success and failure probabilities in two-group logistic model

Control Group Treated Group


" "
Success 1! œ "€expe"! f 1" œ "€expe"! "" f

expe"! f expe"! "" f


Failure "  1! œ "€expe"! f "  1" œ "€expe"! "" f

The odds S are defined as the ratio of the success and failure probabilities in a particular
group,
1!
Scontrol œ œ / "!
"  1!
1"
Streated œ œ /"! €"" .
"  1"
Successes are expe"! f times more likely in the control group than failures and expe"! € "" f
times more likely than failures in the treated group. How much the odds have changed by
applying the treatment is expressed by the odds ratio
Streated expe"! € "" f
SV œ œ œ / "" ,
Scontrol expe"! f

or the log odds ratio lnaSV b œ "" . If the log odds ratio is zero, the success/failure ratio is the
same in both groups. Successes are then no more likely relative to failures under the treatment
than under the control. A test of L! :"" œ ! thus tests for equal success/failure odds in the two
groups. From Table 6.5 it is seen that this implies equal success probabilities in the groups.
These ideas generalize to comparisons of more than two groups.
Why is the logistic distribution our first choice to develop a link function and not the
omnipotent Gaussian distribution? Assume we choose the standard Gaussian cdf
(
" "
1 œ Fa(b œ ( expœ  ># .> [6.15]
_ È#1 #

as the inverse link function. The link function is then given by F" a1b œ ( but neither

© 2003 by CRC Press LLC


316 Chapter 6  Generalized Linear Models

F" a1b nor Fa(b exist in closed form and must be evaluated numerically. Models using the
inverse Gaussian cdf are termed probit models (please note that the inverse Gaussian func-
tion refers to the inverse cdf of a Gaussian random variable and is not the same as the Inverse
Gaussian random variable which is a variable with a right-skewed density). Often, there is
little to be gained in practice using a probit over a logistic model. The cumulative distribution
functions of the standard logistic and standard Gaussian are very similar (Figure 6.6). Both
are sigmoidal and symmetric about ( œ !. The main difference is that the Gaussian tails are
less heavy than the Logistic tails and thus approach probability ! and " more quickly. If the
distributions are scaled to have the same mean and variance, the cdfs agree even more than is
evident in Figure 6.6. Since logita1b is less cumbersome numerically than F" a1b, it is often
preferred.

1.0
Type-1 Extreme (min)
0.9

0.8 Standard Gaussian

0.7

0.6

π 0.5 Standard Logistic

0.4

0.3
Type-1 Extreme (max)
0.2

0.1

0.0

-5 -4 -3 -2 -1 0 1 2 3 4 5
η

Figure 6.6. Cumulative distribution functions as inverse link functions. The corresponding
link functions are the logit for the Standard Logistic, the probit for the Standard Gaussian, the
log-log link for the Type-1 Extreme (max) value and complementary log-log link for the
Type-1 Extreme (min) value distribution.

The symmetry of the Standard Gaussian and Logistic distributions implies that 1 œ ! is
approached at the same rate as 1 œ ". If 1 departs from ! slowly and approaches " quickly or
vice versa, the probit or logit links are not appropriate. Asymmetric link functions can be
derived from appropriate cumulative distribution functions. A Type-I Extreme value distri-
bution (Johnson et al. 1995, Ch. 23) is given by
J aC b œ expe  expe  aC  !bÎ" ff [6.16]

and has a standardized form with ! œ !, " œ ". This distribution, also known as the Gumbel
or double-exponential distribution arises as the distribution of the largest value in a random
sample of size 8. By putting B œ –C one can obtain the distribution of the smallest value.
Consider the standardized form 1 œ J a(b œ expe –expe(ff. Inverting this cdf yields the link
function
lne  lne1ff œ xw3 ", [6.17]

© 2003 by CRC Press LLC


Components of a GLM 317

known as the log-log link. Its complement lne –lne"  1ff, obtained by changing successes
to failures and vice versa, is known as the complementary log-log link derived from
J a(b œ "  expe –expe(ff. The complementary log-log link behaves like the logit for 1 near
! and has smaller values as 1 increases. The log-log link behaves like the logit for 1 near "
and yields larger values for small 1 (Figure 6.7).

2
log-log

1 probit

η 0

-1 compl. log-log
logit
-2

-3

-4

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
π

Figure 6.7. Link functions for Bernoulli and Binomial responses.

Link Functions for Counts


Here we consider counts other than Binomial counts, i.e., those without a natural denominator
(§2.2). The support of these counts has no upper bound in contrast to the Binomial that repre-
sents the number of successes out of 8 trials. Binomial counts can be converted to proportions
by dividing the number of successes by the binomial sample size 8. Counts without a natural
denominator cannot be expressed as proportions, for example, the number of weeds per 7# .
The probability models applied most frequently to such counts are the Poisson and Negative
Binomial. A suitable link function for these is an invertible, monotonic mapping
a!ß _b È a–_ß _b. The log link lneEc]3 df œ xw3 " with its inverse Ec]3 d œ expexw3 " f has this
property and is canonical for both distributions (Table 6.2). The log link leads to parameters
interpreted in terms of multiplicative, rather than additive effects. Consider the linear pre-
dictor (3 œ "! € "" B3 and a log link function. Then expa"" b measures the relative increase in
the mean if the covariate changes by one unit:
.laB3 € "b expe"! € "" aB3 € "bf
œ œ expe"" f.
.lB3 expe"! € "" B3 f

Generalized linear models with log link are often called log-linear models. They play an
important role in regression analysis of counts and in the analysis of contingency tables.
Consider the generic layout of a two-way contingency table in Table 6.6. The counts 834 in

© 2003 by CRC Press LLC


318 Chapter 6  Generalized Linear Models

row 3, column 4 of the table represent the number of times variable \ was observed at level 4
and variable ] simultaneously took on level 3.

Table 6.6. Generic layout of a two-way contingency table


(834 denotes the observed count in row œ 3, column 4)
Categorical Categorical variable \
variable ] 4œ" 4œ# 4œ$ â 4œN Row totals
3œ" 8"" 8"# 8"$ â 8"N 8"Þ
3œ# 8#" 8## 8#$ â 8#N 8#Þ
3œ$ 8$" 8$# 8$$ â 8$N 8$Þ
ã ã ã ã ã
3œM 8M" 8M# 8M$ â 8MN 8MÞ
Column totals 8Þ" 8Þ# 8Þ$ 8ÞN 8ÞÞ

If the row and column variables a] and \ b are independent, the cell counts 834 are
determined by the marginal row and column totals alone. Under a Poisson sampling model
where the count in each cell is the realization of a Poissona-34 b random variable, the row and
column totals are PoissonÐ-3Þ œ !N4œ" -34 Ñ and PoissonÐ-Þ4 œ !3œ" M
-34 Ñ variables, respective-
MßN
ly, and the total sample size is a PoissonÐ-ÞÞ œ !3ß4 -34 Ñ random variable. The expected count
-34 under independence is then related to the marginal expected counts by
-3Þ -Þ4
-34 œ
-ÞÞ
Taking logarithms leads to
lne-34 f œ  lna-ÞÞ b € lne-3Þ f € lne-Þ4 f œ . € !3 € "4 , [6.18]

a generalized linear model with log link for Poisson-distributed random variables and a linear
predictor consisting of a grand mean ., row effects !3 , and column effects "4 . The linear pre-
dictor is akin to that in a two-way layout without interactions such as a randomized block
design, which is precisely the layout of Table 6.6. We can think of !3 and "4 as main effects
of the row and column variables. The Poisson sampling scheme applies if the total number of
observations a8ÞÞ b is itself a random variable, i.e., prior to data collection the total number of
observations being cross-classified is unknown. If the total sample size is known, one is
fortuitously led to the same general decomposition for the expected cell counts as in [6.18].
Conditional on 8ÞÞ the M ‚ N counts in the table are realizations of a multinomial distribution
with cell probabilities 134 and marginal probabilities 13Þ and 1Þ4 . The expected count in cell
3ß 4, if \ and ] are independent, is
-34 œ 8ÞÞ 13Þ 1Þ4

and taking logarithms leads to


lne-34 f œ lne8ÞÞ f € lne13Þ f € lne1Þ4 f œ . € !3 € "4 . [6.19]

© 2003 by CRC Press LLC


Components of a GLM 319

In §6.7.6 the agreement between two raters of the same experimental material is analyzed
by comparing series of log-linear models that structure the interaction between the ratings.

Link Functions for Continuous Data


The identity link is historically the most common link function applied to continuous out-
comes. It is the canonical link if data are Gaussian. Hence, classical linear regression and
analysis of variance models are special cases of generalized linear models where the random
component is Gaussian and the link is the identity function, . œ xw ". Table 6.2 shows that the
identity link is not the canonical link in other cases. For Gamma-distributed data the inverse
link is the reciprocal link ( œ "Î. and for Inverse Gaussian-distributed data ( œ .# . The
reciprocal link, although canonical, is not necessarily a good choice for skewed, non-negative
data. Since the linear predictor ranges over the real line, there is no guarantee that . œ "Î( is
positive. When using the reciprocal link for Gamma-distributed outcomes, for example, the
requirement ( ž ! requires additional restrictions on the parameters in the model.
Constrained estimation may need to be employed to ensure that these restrictions hold. It is
simpler to resort to link functions that map a!ß _b onto a–_ß _b such as the log link. For
Gamma and Inverse Gaussian-distributed outcomes, the log link is thus a frequent choice.
One should bear in mind, however, that the log link implies multiplicative covariate effects
whereas the reciprocal link implies additive effects on the inverse scale. McCullagh and
Nelder (1989, p. 293) provide arguments that if the variability in data is small, it is difficult to
distinguish between Gaussian models on the logarithmic scale and Gamma models with
multiplicative covariate effects, lending additional support for modeling right-skewed, non-
negative continuous responses as Gamma variables with logarithmic link function. Whether
such data are modeled with a reciprocal or logarithmic link also depends on whether the rate
of change or the log rate of change is a more meaningful measure.
In yield density studies it is commonly assumed that yield per plant is inversely related to
plant density (see §5.8.7, §5.9.1). If \ denotes plant density and a straight-line linear
relationship holds between inverse yield and density, the basic Shinozaki and Kira model
(Shinozaki and Kira 1956)
Ec] d œ a! € " Bb" [6.20]

applies. This is a generalized linear model with inverse link and linear predictor ! € " B. Its
yield per unit area equation is
B
EcY d œ . [6.21]
! € "B
As for yield per plant we can model yield per unit area as a generalized linear model with
inverse link and linear predictor " € !ÎB. This is a hyperbolic function of plant density and
the reciprocal link is adequate. In terms of the linear predictor ( this hyperbolic model gives
rise to
( œ EcY d" œ " € !ÎB. [6.22]

We note in passing that a modification of the hyperbolic function is achieved by including a


linear term in plant density known as the inverse quadratic,

© 2003 by CRC Press LLC


320 Chapter 6  Generalized Linear Models

( œ " € !" ÎB € !# B,

a model for unit area yield due to Holliday (1960).


A flexible family of link functions for positive continuous response is given by the
family of power transformations made popular by Box and Cox (1964) (see also §5.6.2):
ˆ.-  "‰Î- -Á!
(œœ [6.23]
lne.f - œ !.

These transformations include as special cases the logarithmic transform, reciprocal trans-
form, and square root transform. If used as link functions, the inverse link functions are
"Î-
. œ œ a(- € "b -Á!
expe(f - œ !.

Figure 6.8 shows the inverse functions for - œ !Þ$ß !Þ&ß "Þ!ß "Þ&ß and #Þ!.

20
λ = 0.3
18

16

14

λ = 0.5
12
µ
10

8
λ = 1.0
6
λ = 1.5
4
λ = 2.0
2

0 1 2 3 4 5
η

Figure 6.8. Inverse Box-Cox transformation as inverse links.

6.2.3 Generalized Linear Models in The SAS® System


In this subsection we do not fit generalized linear models to data yet. The general approach to
parameter estimation and statistical inference in these models must be discussed first (§6.4)
before applying the methodology to real data. However, The SAS® System offers various
procedural alternatives for GLM analysis and we review here how to specify the various
GLM components in some of these procedures. We prefer the powerful procedures genmod
and nlmixed for fitting generalized linear models, but other procedures such as logistic
offer functionality that cannot be found in genmod and nlmixed. For example, residual-based
logistic regression diagnostics and automated variable selection routines. For cross-classified
data the catmod procedure can be used. The log-linear models of interest to us can be fit
easily with the genmod procedure.

© 2003 by CRC Press LLC


Components of a GLM 321

The LOGISTIC Procedure


One of the first procedures for generalized linear models in The SAS® System was proc
logistic. This procedure allows fitting regression type models to Bernoulli, Binomial, and
ordinal responses by maximum likelihood (§6.4.1 and §A6.8.2). The possible link functions
are the logit, probit, log-log, and complementary log-log links. For ordered responses it fits
McCullagh's proportional odds model (see §6.5). The logistic procedure offers functional-
ity for logistic regression analysis similar to the reg procedure for standard linear regression
models. It enables automated variable selection routines and regression diagnostics based on
Pearson or deviance residuals (see §6.4.3). Until Release 7.0 of The SAS® System proc
logistic did not permit a class statement; factor and treatment variables had to be coded
with a series of dummy variables. An experimental procedure in Release 7.0 (proc
tlogistic) allowed for the presence of classification variables through a class statement. In
Release 8.0, the class statement was incorporated into proc logistic and proc tlogistic
has disappeared.
Similar to the powerful genmod procedure (see below), the response variable can be
entered in two different ways in proc logistic. The single-trial syntax applies to Bernoulli
and ordinal responses. The basic model statement then is
model response = <covariates and classification effects> / options;

Binomial responses are coded in the events/trial syntax. This syntax requires two data set
variables. The number of Bernoulli trials (= the size of the binomial experiment) is coded as
variable trials and the number of successes as events. Consider the seed germination data
in Example 6.2 (Table 6.3, p. 307). The linear predictor of the full model fitted to these data
contains temperature and concentration main effects and their interactions. If the data set is
entered as shown in Table 6.3, the events/trial syntax would be used.
data germination;
input temp $ conc germnumber;
trials = 50;
datalines;
T1 0 9
T1 0 9
T1 0 3
T1 0 7
T1 0.1 13
T1 0.1 12
and so forth
;;
run;
proc logistic data=germination;
class temp conc / param=glm;
model germnumber/trials = temp conc temp*conc;
run;

The logistic procedure in Release 8.0 of The SAS® System inherits from the
experimental tlogistic procedure in Release 7.0 the ability to use different coding methods
for classification variables. The coding method is selected with the param= option of the
class statement. We prefer the coding scheme for classification variables that corresponds to
the coding method in the glm procedure. This is not the default of proc logistic and hence
we use the param=glm option in the example code above. We prefer glm-type coding because
the specification of contrast coefficients in the contrast statement of proc logistic is then
identical to the specification of contrast coefficients in proc glm (§4.3.4).

© 2003 by CRC Press LLC


322 Chapter 6  Generalized Linear Models

To analyze ordinal responses or Bernoulli variables, the single-trial syntax is used. The
next example shows rating data from a factorial experiment with a % ‚ # treatment structure
arranged in a completely randomized design. The ordered response variable has three levels,
Poor, Medium, and Good. Also, a Bernoulli response variable (medresp) is created taking the
value " if the response was Medium, ! otherwise.
data ratings;
input REP A B RESP $;
medresp = (resp='Medium');
datalines;
1 1 1 Medium
1 1 2 Medium
1 2 1 Medium
1 2 2 Medium
1 3 1 Good
1 3 2 Good
1 4 1 Good
1 4 2 Good
2 1 1 Poor
2 1 2 Medium
2 2 1 Poor
2 2 2 Medium
2 3 1 Good
2 3 2 Good
2 4 1 Medium
2 4 2 Good
3 1 1 Medium
3 1 2 Medium
3 2 1 Poor
3 2 2 Medium
3 3 1 Good
3 3 2 Good
3 4 1 Good
3 4 2 Good
4 1 1 Poor
4 1 2 Medium
4 2 1 Poor
4 2 2 Medium
4 3 1 Good
4 3 2 Good
4 4 1 Medium
4 4 2 Good
run;

The proportional odds model (§6.5) for ordered data is fit with the statements
proc logistic data=ratings;
class A B / param=glm;
model resp = A B A*B;
run;

By default, the values of the response categories are sorted according to their internal
format. For a character variable such as RESP, the sort order is alphabetical. This results in the
correct order here, since the alphabetical order corresponds to Good-Medium-Poor. If, for
example, the Medium category were renamed Average, the internal order of the categories
would be Average-Good-Poor. To ensure proper category arrangement in this case, one can
use the order= option of the proc logistic statement. For example, one can arrange the data
such that all responses rated Good appear first followed by the Average and the Poor
responses. Then, the correct order is established with the statements

© 2003 by CRC Press LLC


Components of a GLM 323

proc logistic data=ratings order=data;


class A B / param=glm;
model resp = A B A*B;
run;

For Bernoulli responses coded ! and ", proc logistic will also arrange the categories
according to internal formatting. For numeric variables this is an ascending order. Conse-
quently, proc logistic will model the probability that the variable takes on the value !.
Modeling the probability that the variable takes on the value " is usually preferred, since this
is the mean of the response. This can be achieved with the descending option of the proc
logistic statement:

proc logistic data=ratings descending;


class A B / param=glm;
model medresp = A B A*B;
run;

By default, proc logistic will use a logit link. Different link functions are selected with
the link= option of the model statement. To model the Bernoulli response medresp with a
complementary log-log link, for example, the statements are
proc logistic data=ratings descending;
class A B / param=glm;
model medresp = A B A*B / link=cloglog;
run;

The model statement provides numerous other options, for example the selection=
option to perform automated covariate selection with backwise, forward, and stepwise
methods. The ctable option produces a classification table for Bernoulli responses which
classifies observed responses depending on whether the predicted responses are above or
below some probability threshold, useful to establish the sensitivity and specificity of a
logistic model for purposes of classification. The online manuals, help files, and documenta-
tion available from SAS Institute discuss additional options and features of the procedure.

The GENMOD Procedure


The genmod procedure is the flagship of The SAS® System for fitting generalized linear
models. It is more general than the logistic procedure in that it allows fitting of models to
responses other than Bernoulli, Binomial, and Multinomial. The built-in distributions are the
Bernoulli, Binomial, Negative Binomial, Poisson, and Multinomial for discrete data and the
Gaussian, Inverse Gaussian, and Gamma distributions for continuous data (the Negative
Binomial distribution was added in Release 8.0). The built-in link functions are the identity,
log, logit, probit, power, and complementary log-log links. For responses with multinomial
distributions (e.g., ordinal responses) cumulative versions of the logit, probit, and comple-
mentary log-log links are available (see §6.5). Users who desire to use a link and/or distribu-
tion function not in this list can define their own link functions through the fwdlink and
invlink statements of the procedure and the distribution functions through the deviance and
variance statements.

Like proc logistic the genmod procedure accepts responses coded in single-trial or
events/trial syntax. The latter is reserved for grouped Binomial data (see §6.3 on grouped vs.
ungrouped data). The order= option of the proc genmod statement affects the ordering of
classification variables as in proc logistic, but not the ordering of the response variable. A

© 2003 by CRC Press LLC


324 Chapter 6  Generalized Linear Models

separate option (rorder=) of the proc genmod statement is used to determine the ordering of
the response.
The code to produce a logistic analysis in the seed germination Example 6.2 (Table 6.3,
p. 307) with proc genmod is
data germination;
input temp $ conc germnumber;
trials = 50;
datalines;
T1 0 9
T1 0 9
T1 0 3
T1 0 7
T1 0.1 13
T1 0.1 12
and so forth
;;
run;
proc genmod data=germination;
class temp conc;
model germnumber/trials = temp conc temp*conc link=logit dist=binomial;
run;

The link function and distribution are selected with the link= and dist= options of the
model statement. The next statements perform a Poisson regression with linear predictor
( œ "! € "" B" € "# B# and log link;
proc genmod data=yourdata;
model count = x1 x2 / link=log dist=poisson;
run;

For each distribution, proc genmod will apply a default link function if the link= option
is omitted. These are the canonical links for the distributions in Table 6.2 and the cumulative
logit for the multinomial distribution. This does not work the other way around. By specify-
ing a link function but not a distribution function proc genmod does not select a distribution
for which this is the canonical link. That would be impossible since the canonical link does
not identify the distribution. The Negative Binomial and Binomial distributions both have a
canonical log link, for example. Since the default distribution of proc genmod is the Gaussian
distribution (if the response is in single-trial syntax) statements such as
proc genmod data=ratings;
class A B;
model medresp = A B A*B / link=logit;
run;

do not fit a Bernoulli response with a logit link, but a Gaussian response (which is not
sensible if medresp takes on only values ! and ") with a logit link. For the analysis of a
Bernoulli random variable in proc genmod use instead
proc genmod data=ratings;
class A B;
model medresp = A B A*B / link=logit dist=binomial;
run;

If the events/trials syntax is used the distribution of the response will default to the
Binomial and the link to the logit.

© 2003 by CRC Press LLC


Components of a GLM 325

The genmod procedure has a contrast statement and an estimate statement akin to the
statements of the same name in proc glm. An lsmeans statement is also available except for
ordinal responses. The statements
proc genmod data=ratings;
class A B;
model medresp = A B A*B / dist=binomial link=logit;
lsmeans A A*B / diff;
run;

perform pairwise comparisons for the levels of factor A and the A ‚ B cell means.
Because there is only one method of coding classification variables in genmod (in contrast
to logistic, see above), and this method is identical to the one used in proc glm, coefficients
are entered in exactly the same way as in glm. Consider the ratings example above with a
% ‚ # factorial treatment structure.
proc genmod data=ratings;
class A B;
model resp = A B A*B / dist=multinomial link=cumlogit;
contrast 'A1+A2-2A3=0' A 1 1 -2 0;
run;

The contrast statement tests a hypothesis of the form A" œ 0 based on the asymptotic
distribution of the linear combination A" s and the estimate statement estimates the linear com-
w
bination a " Þ Here, " are the parameters in the linear predictor. In other words, the hypoth-
esis is tested on the scale of the linear predictor, not the scale of the mean response. In some
instances hypotheses about the mean values can be expressed as linear functions of the " s and
the contrast or estimate statement are sufficient. Consider, for example, a logistic regres-
sion model for Bernoulli data,

13
logita13 b œ lnœ  œ "! € "" B34 ,
"  13

where B34 is a dummy variable,


" if observation 3 is from treated group a4 œ "b
B34 œ œ
! if observation 3 is from control group a4 œ ! b.

The success probabilities in the two groups are then


"
1control œ
" € expe  "! f
"
1treated œ Þ
" € expe  "!  "" f

The hypothesis L! : 1control œ 1treated can be tested as the simple linear hypothesis L! :"" œ !.
In other instances hypotheses or quantities of interest do not reduce or are equivalent to
simple linear functions of the parameter. An estimate of the ratio
1control " € expe  "!  "" f
œ ,
1treated " € expe  "! f

for example, is a nonlinear function of the parameters. The variance of the estimated ratio

© 2003 by CRC Press LLC


326 Chapter 6  Generalized Linear Models

1
scontrol Î1
streated must be approximated from a Taylor series expansion, although exact methods
for this particular ratio exist (Fieller 1940, see our §A6.8.5). The nlmixed procedure,
although not specifically designed for generalized linear models, can be used to our
advantage in this case.

The NLMIXED Procedure


The nlmixed procedure was first introduced into The SAS® System as an experimental proce-
dure in Release 7.0 and has been a full production procedure since Release 8.0. It is designed
for models more general than the generalized linear models considered in this chapter. It per-
forms (approximate) maximum likelihood inference in models with multiple random effects
(mixed models) and nonlinear mean function (see §8). The procedure can be used, however,
even if all model parameters are fixed in the same fashion as linear models can be fit with
proc nlin. The nlmixed procedure allows fitting of nonlinear (fixed and mixed effects)
models to data from any distribution, provided that the log-likelihood function can be speci-
fied with SAS® programming statements. Several of the common distribution functions are
already built into the procedure, among them the Bernoulli, Binomial, Poisson, and Gaussian
distribution (the Negative Binomial since Release 8.1). The syntax of the nlmixed procedure
is akin to that of proc nlin with slight differences in the specification of the model statement.
One of the decisive advantages of proc nlmixed is its ability to estimate complicated, non-
linear functions of the parameters and obtain their estimated standard errors by the delta
method (Taylor series expansions). We illustrate with a simple example from a dose-response
investigation.
Mead et al. (1993, p. 336) provide data on a dose-response experiment with Binomial
responses. At each of seven concentrations of an insecticide, twenty larvae were exposed to
the insecticide and the number of larvae that did not survive the exposure was recorded. Mead
et al. (1993) fit a probit regression model to express the proportion of larvae killed as a
function of the log"! concentration of the insecticide. Their model was
probita1b œ F" a1b œ "! € "" log"! aBb,

where B denotes the insecticide concentration. Alternatively, a logit transformation could be


used (see our application in §6.7.1), yielding
1
logita1b œ lnŠ ‹ œ "! € "" log"! aBb.
"1
The investigators are interested in estimating the dosage at which the probability that a
randomly chosen larva will be killed is !Þ&, the so-called PH&! (dosage lethal for &!% of the
subjects). For any link funtion 1a1b the PH&! is found by solving
1a!Þ&b œ "! € "" log"! aPH&! b

for PH&! . For the probit or logit link we have 1a!Þ&b œ ! and thus
log"! aPH&! b œ  "! Î""
PH&! œ "!"! Î"" .

Once the parameters of the probit or logistic regression models have been estimated the
obvious estimates for the lethal dosages on the logarithmic and original scale of insecticide

© 2003 by CRC Press LLC


Components of a GLM 327

concentration are
s " and "!"s ! Î"s " .
s ! Î"
"

These are nonlinear functions of the parameter estimates and standard errors must be
approximated from Taylor series expansions unless one is satisfied with fiducial limits for the
PH&! (see §A6.8.5). The data set and the proc nlmixed code to fit the logistic regression
model and to estimate the lethal dosages are as follows.

data kills;
input concentration kills;
trials = 20;
logc = log10(concentration);
datalines;
0.375 0
0.75 1
1.5 8
3.0 11
6.0 16
12.0 18
24.0 20
;;
run;

proc nlmixed data=kills;


parameters intcpt=-1.7 b=4.0;
pi = 1/(1+exp(-intcpt - b*logc));
model kills ~ binomial(trials,pi);
estimate 'LD50' -intcpt/b;
estimate 'LD50 original' 10**(-intcpt/b);
run;

The parameters statement defines the parameters to be estimated and assigns starting
values, in the same fashion as the parameters statement of proc nlin. The statement pi =
1/(1+exp(-intcpt - b*logc)) calculates the probability of an insect being killed through
the inverse link function. This is a regular SAS® programming statement for 1 œ 1" a(b. The
model statement specifies that observations of the data set variable kills are realizations of
Binomial random variables with binomial sample size given by the data set variable trials
and success probability according to pi. The estimate statements calculate estimates for the
PH&! values on the logarithmic and original scale of insecticide concentrations.
If a probit analysis is desired as in Mead et al. (1993) the code changes only slightly.
Only the statement for 1 œ 1" a(b must be altered. The inverse of the probit link is the
cumulative Standard Gaussian probability density
(
" "
1 œ Fa(b œ ( expœ  ># .>,
_ È #1 #

which can be calculated with the probnorm() function of The SAS® System (the SAS®
function calculating the linked value is not surprisingly called the probit() function. A call
to probnorm(1.96) returns the result !Þ*(&, probit(0.975) returns "Þ*'). Because of the
similarities of the logit and probit link functions the same set of starting values can be used
for both analyses.
proc nlmixed data=kills;
parameters intcpt=-1.7 b=4.0;
pi = probnorm(intcpt + b*logc);

© 2003 by CRC Press LLC


328 Chapter 6  Generalized Linear Models

model kills ~ binomial(trials,pi);


estimate 'LD50' -intcpt/b;
estimate 'LD50 original' 10**(-intcpt/b);
run;

Like the genmod procedure, nlmixed allows the user to perform inference for distributions
other than the built-in distributions. In genmod this is accomplished by programming the
deviance (§6.4.3) and variance function (deviance and variance statements). In nlmixed the
model statement is altered to
model response ~ general(logl);

where logl is the log-likelihood function (see §6.4.1 and §A6.8.2) of the data constructed
with SAS® programming statements. For the Bernoulli distribution the log-likelihood for an
individual a!ß "b observation is simply 6a1à Cb œ C lne1f € a"  C blne"  1f and for the
Binomial the log-likelihood kernel for the binomial count C is 6a1à C b œ C lne1f €
a8  Cblne"  1f. The following nlmixed code also fits the probit model above.
proc nlmixed data=kills;
parameters intcpt=-1.7 b=4.0;
p = probnorm(intcpt + b*logc);
logl = kills*log(p) + (trials-kills)*log(1-p);
model kills ~ general(logl);
estimate 'LD50' -intcpt/b;
estimate 'LD50 original' 10**(-intcpt/b);
run;

6Þ3 Grouped and Ungrouped Data


Box 6.3 Grouped Data

• Data are grouped if the outcomes represent sums or averages of observa-


tions that share the same set of explanatory variables.

• Grouping data changes weights in the exponential family models and im-
pacts the asymptotic behavior of parameter estimates.

• Binomial counts are always grouped. They are sums of independent


Bernoulli variables.

So far we have implicitly assumed that each data point represents a single observation. This is
not necessarily the case. Consider an agronomic field trial in which four varieties of wheat are
to be compared with respect to their resistance to infestation with the Hessian fly (Mayetida
destructor). The varieties are arranged in a randomized block design, and each experimental
unit is a $Þ( ‚ $Þ( m field plot. 834 plants are sampled in the 4th block for variety 3, D34 of
which show damage. If plants on a plot are infected independently of each other the data from
each plot can also be considered a set of independent and identically distributed Bernoulli
variables

© 2003 by CRC Press LLC


Grouped and Ungrouped Data 329

" 5 th plant for variety 3 in block 4 shows damage


]345 œ œ ß 5 œ "ß âß 834 .
! 5 th plant for variety 3 in block 4 shows no damage

A hypothetical data set for the outcomes of such an experiment with two blocks is shown in
Table 6.7. The data set contains &' total observations, #$ of which correspond to damaged
plants.

Table 6.7. Hypothetical data for Hessian fly experiment (four varieties in two blocks)
Block 4 œ " Block 4 œ #
5 Entry 3 œ 1 Entry 2 Entry 3 Entry 4 Entry 1 Entry 2 Entry 3 Entry 4
" " ! ! " " ! ! !
# ! ! " ! ! ! ! !
$ ! " ! ! " " " !
% " ! ! ! " ! ! "
& ! " ! " " ! " !
' " ! " " ! ! !
( " " ! ! " !
) ! " "
834 ) ' ( ) ( & ( )
D34 % # $ % % " $ #

One could model the ]345 with a generalized linear model for &' binary outcomes.
Alternatively, one could model the number of damaged plants aD34 b per plot or the proportion
of damaged plants per plot aD34 Î834 b. The number of damaged plants corresponds to the sum
of the Bernoulli variables
834
D34 œ "C345
5œ"

and the proportion of damaged plants corresponds to their sample average


"
C34 œ D34 .
834

The sums D34 and averages C34 are grouped versions of the original data. The sum of indepen-
dent and identical Bernoulli variables is a Binomial random variable and of course in the
exponential family (see Table 6.2 on p. 305). It turns out that the distribution of the average
of random variables in the exponential family is also a member of the exponential family. But
the number of grouped observations is smaller than the size of the original data set. In Table
6.7 there are &' ungrouped and ) grouped observations. A generalized linear model needs to
be properly adjusted to reflect this grouping. This adjustment is made either through the va-
riance function or by introducing weights into the analysis. In the Hessian fly example, the
variance function 2a134 b of a Bernoulli observation for variety 3 in block 4 is 134 a"  134 b, but
2a134 b œ 834 134 a"  134 b for the counts D34 and 2a134 b œ 8"
34 134 a"  134 b for the proportions
C34 . The introduction of weights into the exponential family density or mass function
accomplishes the same. Instead of [6.1] we consider the weighted version

© 2003 by CRC Press LLC


330 Chapter 6  Generalized Linear Models

aC)  , a)bb
0 aC b œ expœ € - aCß <,Ab. [6.24]
<A

If C is an individual (ungrouped) observation then A œ ", if C represents an average of 8


observations choose A œ 8" , and A œ 8 if C is a sum of 8 observations. If data are grouped,
the number of data points is equal to the number of groups, denoted by 8a1b .

Example 6.5 Earthworm Counts. In 1995 earthworms (Lubricus terrestris L.) were
counted in four replications of a #% factorial experiment at the W.K. Kellogg Biological
Station in Battle Creek, Michigan. The treatment factors and levels were Tillage Ðchisel-
plow and Notill), Input Level (conventional and low), Manure application (yes/no) and
Crop (corn and soybean). Of interest was whether the L. terrestris density varies under
these management protocols and how the various factors act and interact. Table 6.8
displays the total worm count for the '% ˆ#% ‚ % replicates‰ experimental units (juvenile
and adult worms).

Table 6.8. Ungrouped worm count data (#Îft# ) in #% factorial design (numbers
in each cell of table correspond to counts on replicates)
Tillage
Chisel-Plow No Tillage
Crop Manure Input Level Input Level
Low Conventional Low Conventional
Corn Yes &ß &ß %ß # &ß "ß &ß ! )ß %ß 'ß % "%ß *ß *ß '
No $ß ""ß !ß ! #ß !ß 'ß " #ß #ß ""ß % "&ß *ß 'ß %
Soybean Yes )ß 'ß !ß $ )ß %ß #ß # #ß #ß "$ß ( &ß $ß 'ß !
No )ß &ß $ß "" #ß 'ß *ß % (ß &ß ")ß $ #$ß "#ß "(ß *

Unless the replication effects are block effects, the four observations per cell share the
same linear predictor consisting of Tillage, Input level, Crop, Manure effects, and their
interactions. Grouping to model the averages reduces the '% observations to 8a1b œ "'
observations.

Table 6.9. Grouped (averaged) worm count data; A œ "Î%

Tillage
Chisel-Plow No Tillage
Crop Manure Input Level Input Level
Low Conventional Low Conventional
Corn Yes %Þ!! #Þ(& &Þ&! *Þ&!
No $Þ&! #Þ#& %Þ(& )Þ&!
Soybean Yes %Þ#& %Þ!! 'Þ!! $Þ&!
No 'Þ(& &Þ#& )Þ#& "&Þ#&

© 2003 by CRC Press LLC


Grouped and Ungrouped Data 331

When grouping data, observations that have the same set of covariates or design effects,
i.e., share the same linear predictor xw ", are summed or averaged. In the Hessian fly example
each block ‚ entry combination is unique, but 834 observations were collected for each ex-
perimental unit. In experiments where treatments are replicated, grouping is often possible
even if only a single observation is gathered on each unit. If covariates are continuous and
their values unique, grouping is not possible.
In the previous two examples it appears as a matter of convenience whether data are
grouped or not. But this choice has subtle implications. Diagnosing the model-data disagree-
ment in generalized linear models based on residuals or goodness-of-fit measures such as
Pearson's \ # statistic or the deviance (§6.4.3) is only meaningful if data are grouped. Group-
ing is a special case of clustering where the elements of a cluster are reduced to a single
observation (the cluster total or average). Asymptotic results for grouped data can be obtained
by increasing the number of groups while holding the size of each group constant or by
assuming that the number of groups is fixed and the group size grows. The respective asymp-
totic results are not identical. For ungrouped data, it is only reasonable to consider asymptotic
results under the assumption that the sample size 8 grows to infinity. No distinction between
group size and group number is made. Finally, if data are grouped, computations are less
timeconsuming. Since generalized linear models are fit by iterative procedures, grouping
large data sets as much as possible is recommended.

6.4 Parameter Estimation and Inference


Box 6.4 Estimation and Inference

• Parameters of GLMs are estimated by maximum likelihood.

• The fitting algorithm is a weighted least squares method that is executed


iteratively, since the weights change from iteration to iteration. Hence the
name iteratively reweighted least squares (IRLS).

• Hypotheses are tested with Wald, likelihood-ratio, or score tests. The


respective test statistics have large sample (asymptotic) Chi-squared distri-
butions.

6.4.1 Solving the Likelihood Problem


Estimating the parameters in the linear predictor of a generalized linear model usually
proceeds by maximum likelihood methods. Likelihood-based inference requires knowledge
of the joint distribution of the 8 data points (or 8a1b groups). For many discrete data such as
counts and frequencies, the joint distribution is given by a simple sampling model. If an
experiment has only two possible outcomes which occur with probability 1 and a"  1b, the
distribution of the response is necessarily Bernoulli. Similarly, many count data are analyzed
under a Poisson model.

© 2003 by CRC Press LLC


332 Chapter 6  Generalized Linear Models

For random variables with distribution in the exponential family the specification of the
joint distribution of the data is made simple by the relationship between mean and variance
(§6.2.1). If the 3 œ "ß ÞÞÞß 8 observations are independent, the likelihood for the complete
response vector
y œ cC" ß ÞÞÞß C8 dw

becomes
8 8
¿a)ß <à yb œ $¿a)ß <à C3 b œ $expeaC3 )3  , a)3 bbÎ< € - aC3 ß <bf, [6.25]
3œ" 3œ"

and the log-likelihood function in terms of the natural parameter is


8
6a)ß <à yb œ "aC3 )3  , a)3 bbÎ< € - aC3 ß <b. [6.26]
3œ"

Written in terms of the vector of mean parameters, the log-likelihood becomes


8
6a.ß <à yb œ "aC3 )a.3 b  , a)a.3 bbbÎ< € - aC3 ß <bÞ [6.27]
3œ"

Since .3 œ 1" a(3 b œ 1" axw3 " b, where 1" a•b is the inverse link function, the log-likelihood
is a function of the parameter vector " and estimates are found as the solutions of
`6a.ß <à yb
œ 0Þ [6.28]
`"
Details of this maximization problem are found in §A6.8.2.
Since generalized linear models are nonlinear, the estimating equations resemble those
for nonlinear models. For a general link function these equations are
Fw V" ay  .b œ 0, [6.29]

and
Xw ay  .b œ 0 [6.30]

if the link is canonical. Here, V is a diagonal matrix containing the variances of the responses
on its diagonal aV œ VarcYdb and F contains derivatives of . with respect to " . Furthermore,
if the link is the identity, it follows that a solution to [6.30] is
s œ Xw V" y.
Xw V" X" [6.31]
‡
This would suggest a generalized least squares estimator " s œ aXw V" Xb Xw V" y. The diffi-
"

culty with generalized linear models is that the variances in V are functionally dependent on
s ‡ which determines the estimate of the mean, .
the means. In order to calculate " s, V must be
‡
s
evaluated at some estimate of .. Once " is calculated, V should be updated. The procedure
to solve the maximum likelihood problem in generalized linear models is hence iterative
where variance estimates are updated after updates of "s , and known as iteratively reweighted
least squares (IRLS, §A6.8.3). Upon convergence of the IRLS algorithm, the variance of " s is

© 2003 by CRC Press LLC


Parameter Estimation and Inference 333

estimated as
"
s ’"
Var sw V
s “ œ ŠF s" F
s‹ , [6.32]

where the variance and derivative matrix are evaluated at the converged iterate.
s and substituting
The mean response is estimated by evaluating the linear predictor at "
the result into the linear predictor,

E s ‹.
sc] d œ 1" as(b œ 1" Šxw " [6.33]

Because of the nonlinearity of most link functions, E sc] d is not an unbiased estimator of Ec] d,
s
even if " is unbiased for " , which usually it is not. An exception is the identity link function
where EÒxw " s Ó œ xw " if the estimator is unbiased and EÒxw " s Ó œ Ec] d provided the model is
correct. Estimated standard errors of the predicted mean values are usually derived from
Taylor series expansions and are approximate in the following sense. The Taylor series of
s Ñ around some value " ‡ is an approximate linearization of 1" Ðxw "
1" Ðxw " s Ñ. The variance of
this linearization is a function of the model parameters and is estimated by substituting the
parameter estimates without taking into account the uncertainty in these estimates themselves.
A Taylor series of . s Ñ around ( leads to the linearization
s œ 1" Ðs(Ñ œ 1" Ðxw "
`1" as(b `1" as(b w s
1" as(b » 1" a(b € as(  (b œ . € Šx "  xw " ‹. [6.34]
`s( l ( `s( l (

The approximate variance of the predicted mean is thus


#
`1" a(b "
• x ˆF V F‰ x
w w "
Varc.
sd » ”
`(

and is estimated as
# #
s c. `1" as(b w sw s" s
" `1" as(b w s s
Var sd » ” • x ŠF V F‹ x œ ” • x Var ’" “x. [6.35]
`s( `s(

If the link function is canonical, a simplification arises. In that case ( œ )a.b, the derivative
`1" Ð(ÑÎ` ( can be written as ` .Î` )Ð.Ñ and the standard error of the predicted mean is esti-
mated as

s c.
Var #
sb xw Var
s d » 2 a. s “x .
s ’" [6.36]

6.4.2 Testing Hypotheses about Parameters and Their


Functions
Under certain regularity conditions which differ for grouped and ungrouped data and are
s have
outlined by, e.g., Fahrmeir and Tutz (1994, p.43), the maximum likelihood estimates "
w " "
an asymptotic Gaussian distribution with variance-covariance matrix aF V Fb . Tests of
linear hypotheses of the form L0 :A" œ d are thus based on standard Gaussian theory. A

© 2003 by CRC Press LLC


334 Chapter 6  Generalized Linear Models

special case of this linear hypothesis is a test of L! :"4 œ ! where "4 is the 4th element of " .
The standard approach of dividing the estimate of "4 by its estimated standard error is useful
for generalized linear models, too.
The statistic
#
s #4
" Î s4
" Ñ
[ œ œÐ Ó [6.37]
s "
Var’ s4“ s4‹ Ò
Ï eseŠ"

has an asymptotic Chi-squared distribution with one degree of freedom. [6.37] is a special
case of a Wald test statistic. More generally, to test L0 : A" œ d compare the test statistic
w " "
[ œ ŠA" sw V
s  d‹ ŒAŠF s" F s  d‹
s‹ Aw  ŠA" [6.38]

against cutoffs from a Chi-squared distribution with ; degrees of freedom (where ; is the
rank of the matrix A). Such tests are simple to carry out because they only require fitting a
single model. The contrast statement in proc genmod of The SAS® System implements such
linear hypotheses (with d œ 0) but does not produce Wald tests by default. Instead it
calculates a likelihood ratio test statistic which is computationally more involved but also has
better statistical properties (see §1.3). Assume you fit a generalized linear model and obtain
the parameter estimates " s 0 from the IRLS algorithm. The subscript 0 denotes the full model.
A reduced model is obtained by invoking the constraint A" œ d. Call the estimates obtained
under this constraint "s < . If 6ˆ.
s0 ß <à y‰ is the log-likelihood attained in the full and 6a.
s < ß < à yb
is the log-likelihood in the reduced model, twice their difference
A œ #˜6ˆ.
s 0 ß < à y ‰  6 a.
s < ß < à y b™ [6.39]

is asymptotically distributed as a Chi-squared random variable with ; degrees of freedom,


where ; is the rank of A, i.e., ; equals the number of constraints imposed in the hypothesis.
The log-likelihood is also used to calculate a measure for the goodness-of-fit between model
and data, called the deviance. In §6.4.3 we define this measure and in §6.4.4 we apply the
likelihood ratio test idea as a series of deviance reduction tests.
Hypotheses about nonlinear functions of the parameters can often be expressed in terms
of equivalent linear hypotheses. Rather than using approximations based on series expansions
to derive standard errors and using a Wald type test, one should test the equivalent linear
hypothesis. Consider a two-group comparison of Binomial proportions with a logistic model
logita14 b œ "! € "" B34 ß a4 œ "ß #b

where
" if observation 3 is from group "
B34 œ œ
! if observation 3 is from group #.

The ratio

© 2003 by CRC Press LLC


Parameter Estimation and Inference 335

1" " € expe  "!  "" f


œ
1# " € expe  "! f

is a nonlinear function of "! and "" but the hypothesis L! : 1" Î1# œ " is equivalent to the
linear hypothesis L! : "" œ !. In other cases it may not be possible to find an equivalent linear
hypothesis. In the logistic dose-response model
logita1b œ "! € "" B,

where B is concentration (dosage) and 1 the probability of observing a success at that


concentration, PH&! œ  "! Î"" . A &% level test whether the PH&! is equal to some
concentration B! could proceed as follows. At convergence of the IRLS algorithm, estimate
s
s&! œ  " !
PH
s"
"

and obtain its estimated standard error. Calculate an approximate *&% confidence interval for
PH&! , relying on the asymptotic Gaussian distribution of "s , as

s&! „ "Þ*' eseˆPH


PH s&! ‰.

If the confidence interval does not cover the concentration B! , reject the hypothesis that
PH&! œ B! ; otherwise fail to reject. A slightly conservative approach is to replace the
standard Gaussian cutoff with a cutoff from a > distribution (which is what proc nlmixed
does). The key is to derive a good estimate of the standard error of the nonlinear function of
the parameters. For general nonlinear functions of the parameters we prefer approximate stan-
dard errors calculated from Taylor series expansions. This method is very general and typi-
cally produces good approximations. The estimate statement of the nlmixed procedure in
The SAS® System implements the calculation of standard errors by a first-order Taylor series
for nonlinear functions of the parameters. For certain functions, such as the PH&! above,
exact formulas have been developed. Finney (1978, pp. 80-82) gives formulas for fiducial
intervals for the PH&! based on a theorem by Fieller (1940) and applies the result to test the
identity of equipotent dosages for two assay formulations. Fiducial intervals are akin to confi-
dence intervals; the difference between the two approaches is largely philosophical. The
interpretation of confidence limits appeals to conceptual, repeated sampling such that the
repeatedly calculated intervals include the true parameter value with a specified frequency.
Fiducial limits are values of the parameter that would produce an observed statistic such as
PH s&! with a given probability. See Schwertman (1996) and Wang (2000) for further details
on the comparison between fiducial and frequentist inference. In §A6.8.5 Fieller's derivation
is examined and compared to the expression of the standard error of a ratio of two random
variables derived from a first-order Taylor series expansion. As it turns out the Taylor series
method results in a very good approximation provided that the slope "" in the dose-response
model is considerably different from zero. For the approximation to be satisfactory, the
standard > statistic for testing L! : "" œ !,
s"
"
>9,= œ ,
eseŠ"s"‹

should be in absolute value greater than the >-cutoff divided by È!Þ!&,

© 2003 by CRC Press LLC


336 Chapter 6  Generalized Linear Models

l>9,= l   > !# ß@ ‚È!Þ!&.

Here, @ are the degrees of freedom associated with the model deviance and ! denotes the
significance level. For a &% significance level this translates into a >9,= of about * or more in
absolute value (Finney 1978, p. 82). It should be noted, however, that the fiducial limits of
Fieller (1940) are derived under the assumption of Gaussianity and unbiasedness of the
estimators.

6.4.3 Deviance and Pearson's \ # Statistic


The quality of agreement between model and data in generalized linear models is assessed by
two statistics, the Pearson \ # statistic and the model deviance H. Pearson's statistic is defined
as
8a1b #
a C3  .
s3 b
\# œ " , [6.40]
3œ"
2 a.
s3 b

where 8a1b is the size of the grouped data set and 2a.s3 b is the variance function evaluated at
the estimated mean (see §6.2.1 for the definition of the variance function). \ # thus takes the
form of a weighted residual sum of squares. The deviance of a generalized linear model is
derived from the likelihood principle. It is proportional to twice the difference between the
maximized log-likelihood evaluated at the estimated means . s3 and the largest achievable log
likelihood obtained by setting . s3 œ C3 . Two versions of the deviance are distinguished,
depending on whether the distribution of the response involves a scale parameter < or not.
Recall from §6.3 the weighted exponential family density
0 aC3 b œ expeaC3 )3  , a)3 bbÎa<A3 b € - aC3 ß <bf.
s
. )3 œ )a.
If s3 b is the estimate of the natural parameter in the model under consideration and
)3 œ )aC3 b is the canonical link evaluated at the observations, the scaled deviance is defined
as
8a1b 8 1
Þ
‡
sb œ #"6ˆ)3 ß <à C3 ‰  #"6Šs)3 ß <à C3 ‹
H ay à .
3œ" 3œ"
8a1b
# Þ Þ
œ "šC3 Š)3  s)3 ‹  , ˆ)3 ‰ € , Šs)3 ‹›
< 3œ"
œ #e6ayß <à yb  6a.
sß <à ybf. [6.41]

When H‡ ayà . sb is multiplied by the scale parameter < , we simply refer to the deviance
s b œ < H ‡ ay à .
H ay à . sb œ <#e6ayß <à yb  6a.sß <à ybf. In [6.41], 6ayß <à yb refers to the log-
likelihood evaluated at . œ y, and 6a. sß <à yb to the log-likelihood obtained from fitting a
particular model. If the fitted model is saturated, i.e., fits the data perfectly, the (scaled)
deviance and \ # statistics are identically zero.
The utility of \ # and H‡ ayà .
sb lies in the fact that, under certain conditions, both have a
well-known asymptotic distribution. In particular,

© 2003 by CRC Press LLC


Parameter Estimation and Inference 337

. .
\ # Î< Ä ;#8a1b : , H ‡ ay à .
sb Ä ;#8a1b : , [6.42]

where : is the number of estimated model parameters (the rank of the model matrix X).
Table 6.10 shows deviance functions for various distributions in the exponential family.
The deviance in the Gaussian case is simply the residual sum of squares. The deviance-based
s œ Hayà .
scale estimate < sbÎÐ8a1b  :Ñ in this case is the customary residual mean square
error. Furthermore, the deviance and Pearson \ # statistics are identical then and their scaled
versions have exact (rather than approximate) Chi-squared distributions.

Table 6.10. Deviances for some exponential family distributions (in the Binomial case
83 denotes the Binomial sample size and C3 the number of successes)
Distribution H ay à .
sb
8
.
Bernoulli #!  C3 lnš "s
s3
. ›  lne"  .
s3 f
3
3œ"
8
Binomial #! C3 lnš .sC3 › € a83  C3 blnš <<33
C3
.s
›
3 3
3œ"
8
Negative Binomial #! C3 lnš .sC3 ›  aC3 € "Î5 blnš .sC3 €"Î5
€"Î5 ›
3 3
3œ"
8
Poisson #! C3 lnš .sC3 ›  aC3  .
s3 b
3
3œ"
8
! aC3  . #
Gaussian s3 b
3œ"
8
Gamma #!  lnš .sC3 › € aC3  .
s3 bÎ.
s3
3
3œ"
8
! aC3  . #
Inverse Gaussian s#3 C3 ‰
s 3 b Î ˆ.
3œ"

As the agreement between data aC3 b and model fit (. s3 ) improves, \ # will decrease in
value. On the contrary a model not fitting the data well will result in a large value of \ # (and
a large value for H‡ ayà .
sb). If the conditions for the asymptotic result hold, one can calculate
the :-value for L! : model fits the data as

PrŠ\ # Î< ž ;#8a1b : ‹ or PrŠH‡ ayà .


sb ž ;#8a1b : ‹.

If the :-value is sufficiently large, the model is acceptable as a description of the data
generating mechanism. Before one can rely on this goodness-of-fit test, the conditions under
which the asymptotic result holds, must be met and understood (McCullagh and Nelder 1989,
p. 118). The first requirement for the result to hold is independence of the observations. If
overdispersion arises from autocorrelation or randomly varying parameters, both of which
induce correlations, \ # and H‡ ayà .
sb do not have asymptotic Chi-squared distributions. More
importantly, it is assumed that data are grouped, the number of groups Ð8a1b Ñ remains fixed,
and the sample size in each group tends to infinity, thereby driving the within group variance
to zero. If these conditions are not met, large values of \ # or H‡ ayà .sb do not necessarily

© 2003 by CRC Press LLC


338 Chapter 6  Generalized Linear Models

indicate poor fit and should be interpreted with caution. Therefore, if data are ungrouped
(group size is ", 8Ð1Ñ œ 8), one should not rely on \ # or H‡ ayà . sb as goodness-of-fit
measures.
The asymptotic distributions of \ # Î< and Hayà .
sbÎ< suggest a simple method to esti-
mate the extra scale parameter < in a generalized linear model. Equating \ # Î< with its
asymptotic expectation,
"
E\ # ‘ ¸ 8Ð1Ñ  :,
<

suggests the estimator <s œ \ # Îa8a1 b  :b. Similarly Hayà .


sbÎÐ8Ð1Ñ  :Ñ is a deviance-based
estimate of the scale parameter <.
For distributions where < œ " a value of Hayà . sbÎÐ8Ð1Ñ  :Ñ or \ # ÎÐ8Ð1Ñ  :Ñ substan-
tially larger than one is indication that the data are more dispersed than is permissible under
the assumed probability distribution. This can be used to diagnose an improperly specified
model or overdispersion (§6.6). It should be noted that if the ratio of deviance or \ # to its
degrees of freedom is large, one should not automatically conclude that the data are overdis-
persed. Akin to a linear model where omitting an important variable or effect increases the
error variance, the deviance will increase if an important effect is unaccounted for.

6.4.4 Testing Hypotheses through Deviance Partitioning


Despite the potential problems of interpreting the deviance for ungrouped data, it has para-
mount importance in statistical inference for generalized linear models. While the Chi-
squared approximation for H‡ ayà . sb may not hold, the distribution of differences of devian-
ces among nested models can well be approximated by Chi-squared distributions, even if data
are not grouped. This forms the basis for Change-In-Deviance tests, the GLM equivalent of
the sum of squares reduction test. Consider a full model Q0 and a second (reduced) model
Q< obtained by eliminating ; parameters of Q0 . Usually this means setting one or more
parameters in the linear predictor to zero by eliminating treatment effects from Q0 . If .
s0 and
.
s< are the respective estimated means, then the increase in deviance incurred by eliminating
the ; parameters is given by
s < b  H ‡ ˆy à .
H ‡ ay à . s0 ‰, [6.43]

and has an asymptotic Chi-squared distribution with ; degrees of freedom. If [6.43] is signifi-
cantly large, reject model Q< in favor of Q0 . Since H‡ ayà .
s b œ #e 6a y ß < à y b  6 a .
sß <à ybf is
twice the difference between the maximal and the maximized log-likelihood, [6.43] can be
rewritten as
s < b  H ‡ ˆy à .
H ‡ ay à . s0 ‰ œ #˜6ˆ.
s 0 ß < à y ‰  6 a.
s< ß <à yb™ œ A, [6.44]

and it is thereby established that this procedure is also the likelihood ratio test (§A6.8.4) for
testing Q0 versus Q< .
To demonstrate the test of hypotheses through deviance partitioning, we use Example 6.2
(p. 306, data appear in Table 6.3). Recall that the experiment involves a % ‚ % factorial
treatment structure with factors temperature aX" ß X# ß X$ ß X% b and concentration of a chemical

© 2003 by CRC Press LLC


Parameter Estimation and Inference 339

a!ß !Þ"ß "Þ!ß "!b and their effect on the germination probability of seeds. For each of the "'
treatment combinations % dishes with &! seeds each are prepared and the number of germi-
nating seeds are counted in each dish. From an experimental design standpoint this is a com-
pletely randomized design with a % ‚ % treatment structure. Hence, we are interested in deter-
mining the significance of the Temperature main effect, the main effect of the chemical Con-
centration, and the Temperature ‚ Concentration interaction. If the seeds within a dish and
between dishes germinate independently of each other, the germination count in each dish can
be modeled as a Binomiala&!ß 134 b random variable where 134 denotes the germination proba-
bility if temperature 3 and concentration 4 are applied. Table 6.11 lists the models successive-
ly fit to the data.

Table 6.11. Models successively fit to seed germination data (“ ‚ ” in a cell


of the table implies that the particular effect is present in the model,
grand represents the presence of a grand mean (intercept) in the model)
Model Grand Temperature Concentration Temp. ‚ Conc.
ó ‚
ô ‚ ‚
õ ‚ ‚
ö ‚ ‚ ‚
÷ ‚ ‚ ‚ ‚

Applying a logit link, model ó is fit to the data with the genmod procedure statements:
proc genmod data=germrate;
model germ/trials = /link=logit dist=binomial;
run;

Output 6.1. The GENMOD Procedure

Model Information
Data Set WORK.GERMRATE
Distribution Binomial
Link Function Logit
Response Variable (Events) germ
Response Variable (Trials) trials
Observations Used 64
Number Of Events 1171
Number Of Trials 3200

Criteria For Assessing Goodness Of Fit


Criterion DF Value Value/DF
Deviance 63 1193.8014 18.9492
Scaled Deviance 63 1193.8014 18.9492
Pearson Chi-Square 63 1087.5757 17.2631
Scaled Pearson X2 63 1087.5757 17.2631
Log Likelihood -2101.6259
Algorithm converged.
Analysis Of Parameter Estimates
Standard Wald 95% Chi-
Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq
Intercept 1 -0.5497 0.0367 -0.6216 -0.4778 224.35 <.0001
Scale 0 1.0000 0.0000 1.0000 1.0000
NOTE: The scale parameter was held fixed.

© 2003 by CRC Press LLC


340 Chapter 6  Generalized Linear Models

The deviance for this model is ""*$Þ)! on '$ degrees of freedom (Ouput 6.1). It is
almost "* times larger than " and this clearly indicates that the model does not account for the
variability in the data. This could be due to the seed counts being more dispersed than
Binomial random variables and/or the absence of important effects in the model. The
s ! œ –!Þ&%*( translates into an estimated success probability of
intercept estimate "
"
1
sœ œ !Þ$&'.
" € expe!Þ&%*(f

This is the overall proportion of germinating seeds. Tallying all successes (= germinations) in
Table 6.3 one obtains "ß "(" germination on '% dishes containing &! seeds each and
"ß "("
1
sœ œ !Þ$&'.
'%‡&!
Notice that the degrees of freedom for the model denote the number of groups ˆ8a1b œ '%‰
minus the numbers of estimated parameters.
Models ô through ÷ are fit similarly with the SAS® statments (output not shown)
/* model ô */
proc genmod data=germrate;
class temp;
model germ/trials = temp /link=logit dist=binomial;
run;
/* model õ */
proc genmod data=germrate;
class conc;
model germ/trials = conc /link=logit dist=binomial;
run;
/* model ö */
proc genmod data=germrate;
class temp conc;
model germ/trials = temp conc /link=logit dist=binomial;
run;
/* model ÷ */
proc genmod data=germrate;
class temp conc;
model germ/trials = temp conc temp*conc /link=logit dist=binomial;
run;

Their degrees of freedom and deviances are shown in Table 6.12.

Table 6.12. Deviances and \ # for models ó  ÷ in Table 6.11


(DF denotes degrees of freedom of the deviance and \ # statistic)
Model DF Deviance Deviance/DF \# \ # ÎDF
ó '$ "ß "*$Þ)! ")Þ*% "ß !)(Þ&) "(Þ#'
ô '! %$!Þ"" (Þ"( $*#Þ&) 'Þ&%
õ '! *)!Þ!* "'Þ$$ )*(Þ#( "%Þ*&
ö &( "%)Þ"! #Þ&* "&%Þ*& #Þ(#
÷ %) &&Þ'% "Þ"& &$Þ*& "Þ"#

The deviance and \ # statistics for any of the five models are very close. To test hypoth-
eses about the various treatment factors, differences of deviances between two models are

© 2003 by CRC Press LLC


Parameter Estimation and Inference 341

compared to cutoffs from Chi-squared distributions with degrees of freedom equal to the
difference in DF for the models. For example, comparing the deviances of models ó and ô
yields "ß "*$Þ)  %$!Þ"" œ ('$Þ'* on $ degrees of freedom. The :-value for this test can be
calculated in SAS® with
data pvalue; p = 1-probchi(763.69,3); run; proc print; run;

The result is a :-value near zero. But what does this test mean? We are comparing a model
with Temperature effects only (ô) against a model without any effects (ó), that is, in the
absence of any concentration effects and/or interactions. We tested the hypothesis that a
model with Temperature effects explains the variation in the data as well as a model contain-
ing only an intercept. Similarly, the question of a significant Concentration effect in the
absence of temperature effects (and the interaction) is addressed by comparing the deviance
difference "ß "*$Þ)  *)!Þ!* œ #"$Þ(" against a Chi-squared distribution with $ degrees of
freedom a:  !Þ!!!"b. The significance of the Concentration effect in a model already con-
taining a Temperature main effect is assessed by the deviance difference %$!Þ""  "%)Þ"! œ
#)#Þ!". This value differs from the deviance reduction of #"$Þ( which was obtained by
adding Concentration effects to the null model. Because of the nonlinearity of the logistic link
function the effects in the generalized linear model are not orthogonal. The significance of a
particular effect depends on which other effects are present in the model, a feature of
sequential tests (§4.3.3) under nonorthogonality. Although either Chi-squared statistic would
be significant in this example, it is easy to see that it does make a difference in which order
the effects are tested. The most meaningful test that can be derived from Table 6.12 is that of
the Temperature ‚ Concentration interaction by comparing deviances of models ö and ÷.
Here, the full model includes all possible effects (two main effects and the interaction) and
the reduced model (ö) excludes only the interaction. From this comparison with a deviance
difference of "%)Þ"!  &&Þ'% œ *#Þ%' on &(  %) œ * degrees of freedom a :-value of
<!Þ!!!" is obtained, sufficient to declare a significant Temperature ‚ Concentration
interaction.
An approach to deviance testing that does not depend on the order in which terms enter
the model is to use partial deviances, where the contribution of an effect is evaluated as the
deviance decrement incurred by adding the effect to a model containing all other effects. In
proc genmod of The SAS® System this is accomplished by adding the type3 option to the
model statement. The type1 option of the model statement will conduct a sequential test of
model effects. The following statements request sequential (type1) and partial (type3) likeli-
hood ratio tests in the full model. The ods statement preceding the proc genmod code excludes
the lengthy table of parameter estimates from the output.
ods exclude parameterestimates;
proc genmod data=germrate;
class temp conc;
model germ/trials = temp conc temp*conc /link=logit
dist=binomial type1 type3;
run;

The sequential (Type1) and partial (Type3) deviance decrements are not identical (Output
6.2). Adding Temperature effects to a model containing no other effects yields a deviance
reduction of ('$Þ'* (as also calculated from the data in Table 6.12). Adding Temperature
effects to a model containing Concentration effects (and the interactions) yields a deviance
reduction of )!%Þ#%. The partial and sequential tests for the interaction are the same, because
this term entered the model last.

© 2003 by CRC Press LLC


342 Chapter 6  Generalized Linear Models

Wald tests for the partial or sequential hypotheses  instead of likelihood ratio tests 
are requested with the wald option of the model statement. In this case both the Chi-squared
statistics and the : -values will change because the Wald test statistics do not correspond to a
difference in deviances between a full and a reduced model. We prefer likelihood ratio over
Wald tests unless data sets are so large that obtaining the computationally more involved
likelihood ratio test statistic is prohibitive.

Output 6.2.
The GENMOD Procedure

Model Information

Data Set WORK.GERMRATE


Distribution Binomial
Link Function Logit
Response Variable (Events) germ
Response Variable (Trials) trials
Observations Used 64
Number Of Events 1171
Number Of Trials 3200

Class Level Information


Class Levels Values

temp 4 1 2 3 4
conc 4 0 0.1 1 10

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 48 55.6412 1.1592


Scaled Deviance 48 55.6412 1.1592
Pearson Chi-Square 48 53.9545 1.1241
Scaled Pearson X2 48 53.9545 1.1241
Log Likelihood -1532.5458

Algorithm converged.

LR Statistics For Type 1 Analysis


Chi-
Source Deviance DF Square Pr > ChiSq

Intercept 1193.8014
temp 430.1139 3 763.69 <.0001
conc 148.1055 3 282.01 <.0001
temp*conc 55.6412 9 92.46 <.0001

LR Statistics For Type 3 Analysis


Chi-
Source DF Square Pr > ChiSq

temp 3 804.24 <.0001


conc 3 198.78 <.0001
temp*conc 9 92.46 <.0001

© 2003 by CRC Press LLC


Parameter Estimation and Inference 343

6.4.5 Generalized V # Measures of Goodness-of-Fit


In the nonlinear regression models of Chapter 5 we faced the problem of determining a V # -
type summary measure that expresses the degree to which the model and data agree. The
rationale of V # -type measures is to express the degree of variation in the data that is
explained or unexplained by a particular model. That has led to the Pseudo-V # measure
suggested for nonlinear models
WWV
Pseudo-V # œ "  , [6.45]
WWX7

where WWX7 œ !83œ" aC3  C b# is the total sum of squares corrected for the mean and
WWV œ !83œ" aC3  sC 3 b# is the residual (error) sum of squares. Even in the absence of an
intercept in the model, WWX7 is the correct denominator since it is the sample mean C that
would be used to predict C if the response were unrelated to the covariates in the model. The
ratio WWVÎWWX7 can be interpreted as the proportion of variation unexplained by the model.
Generalized linear models are also nonlinear models (unless the link function is the identity
link) and a goodness-of-fit measure akin to [6.45] seems reasonable to measure model-data
agreement. Instead of sums of squares, the measure should rest on deviances, however. Since
for Gaussian data with identity link the deviance is an error sum of squares (Table 6.10), the
measure should in this case also reduce to the standard V # measure in linear models. It thus
seems natural to build a goodness-of-fit measure that involves the deviance of the model that
is fit and compares it to the deviance of a null model not containing any explanatory
variables. Since differences of scaled deviances are also differences in log likelihoods
between full and reduced models we can use
6ˆ.
s 0 ß < à y ‰  6 a.
s ! ß < à y b,

where 6ˆ. s0 ß <à y‰ is the log likelihood in the fitted model and 6a.
s! ß <à yb is the log likelihood
in the model containing only an intercept (the null model). For binary response models a
generalized V # measure was suggested by Maddala (1983). Nagelkerke (1991) points out that
it was also proposed for any model fit by the maximum likelihood principle by Cox and Snell
(1989, pp. 208-209) and Magee (1990), apparently independently:
#
 ln˜"  V‡# ™ œ ˜6ˆ.
s 0 ß < à y ‰  6 a.
s ! ß < à y b™
8
#
V‡# œ "  expœ  ˆ6ˆ.s 0 ß < à y ‰  6 a.
s! ß <à yb‰. [6.46]
8

Nagelkerke (1991) discusses that this generalized V # measure has several appealing proper-
ties. If the covariates in the fitted model have no explanatory power, the log likelihood of the
fitted model, 6ˆ.s0 ß <à y‰ will be close to the likelihood of the null model and V‡# approaches
zero. V‡# has a direct interpretation in terms of explained variation in the sense that it
partitions the contributions of covariates in nested models. But unlike the V # measure in
linear models, [6.46] is not bounded by " from above. Its maximum value is
#
max˜V‡# ™ œ "  expœ 6a.
s! ß <à yb. [6.47]
8

© 2003 by CRC Press LLC


344 Chapter 6  Generalized Linear Models

Nagelkerke (1991) thus recommends scaling V‡# and using the measure
# V‡#
V œ [6.48]
maxeV‡# f

instead which is bounded between ! and " and referred to as the rescaled generalized V # . The
logistic procedure of The SAS® System calculates both generalized V # measures if
requested by the rsquare option of the model statement.

6.5 Modeling an Ordinal Response


Box 6.5 The Proportional Odds Model

• The proportional odds model (POM) is a statistical model for ordered


responses developed by McCullagh (1980). It belongs to the family of
cumulative link models and is not a generalized linear model in the narrow
sense.

• The POM can be thought of as a series of logistic curves and in the two-
category case reduces to logistic regression.

• The POM can be fit to data with the genmod procedure of The SAS®
System that enables statistical inference very much akin to what
practitioners expect from an ANOVA-based package.

Ordinal responses arise frequently in the study of soil and plant data. An ordinal (or ordered)
response is a categorical variable whose values are related in a greater/lesser sense. The
assessment of turf quality in nine categories from best to worst results in an ordered response
variable as does the grouping of annual salaries in income categories. The difference between
the two types of ordered responses is that salary categories stem from categorizing an under-
lying (latent) continuous variable. Anderson (1984) terms this a grouped ordering. The
assignment to a category can be made without error and different interpreters will assign
salaries to the same income categories provided they use the same grouping. Assessed
orderings, on the contrary, involve a more complex process of determining the outcome of an
observation. A turf scientist rating the quality of a piece of turf combines information about
the time of day, the brightness of the sun, the expectation for the particular grass species, past
experience, and the disease and management history of the experimental area. The final
assessment of turf quality is a complex aggregate and compilation of these various factors. As
a result, there will be variability in the ratings among different interpreters that complicates
the analysis of such data. The development of clearcut rules for category assignment helps to
reduce the interrater variability in assessed orderings, some room for interpretation of these
rules invariably remains.
In this section we are concerned with fitting statistical models to ordinal responses in
general and side-step the issue of rater agreement. Log-linear models for contingency tables
can be used to describe and infer the degree to which interpreters of the same material rate

© 2003 by CRC Press LLC


Modeling an Ordinal Response 345

independently or interact. An application of log-linear modeling of rater association and


agreement can be found in §6.7.6. Applications of modeling ordered outcomes with the
methods presented in this subsection appear in §6.7.4 and §6.7.5.
The categories of an ordinal variable are frequently coded with numerical values. In the
turf sciences, for example, it is customary to rate plant quality or color on a scale between "
(worst case) and * (best case) with integer or even half steps in between. Plant injury is often
assessed in "!% (or coarser) categories. Farmers report on a questionnaire whether they
perform low, medium, or high input whole-field management and the responses are coded as
", #, and $ in the data file. The obvious temptation is to treat such ordinal outcomes as if they
represent measurements of continuous variables. Rating the plant quality of three replications
of a particular growth regulator application as %, &, and ) then naturally leads to estimates of
the mean quality such as Ð"Î$чÐ% € & € )Ñ œ &Þ''. If the category values are coded with
letters +ß ,ß -ß .ß â instead, it is obvious that the calculation of an average rating as
a"Î$b‡a. € / € 2b is meaningless. Operations like addition or subtraction require that distan-
ces between the category values be well-defined. In particular for assessed orderings, this is
hardly the case. The difference between a low and medium response is not the same as that
between a medium and high response, even if the levels are coded ", #, and $. Using numeri-
cal values for ordered categories is a mere labeling convenience that does not alter the essen-
tial feature of the response as ordinal. The scoring system ", #, $, % implies the same ordering
as ", #!, $!, $" but will lead to different numerical results. Standard analysis of variance fol-
lowed by standard hypothesis testing procedures requires continuous, Gaussian, univariate,
and independent responses. Ordinal response variables violate all of these assumptions. They
are categorical rather than continuous, multinomial rather than Gaussian-distributed, multi-
variate rather than univariate and not independent. If ten responses were observed in three
categories, four of which fell into the third category, there are only six responses to be distri-
buted among the remaining categories. Given the responses in the third category, the proba-
bility that any of the remaining observations will fall into the first category has changed;
hence the counts are not independent. By assigning numerical values to the categories and
using standard analysis of variance or regression methods, the user declares the underlying
assumptions as immaterial or the procedure as sufficiently robust against their violation.
Snedecor and Cochran (1989, 8th ed., pp. 206-208) conclude that standard analyses such
as ANOVA may be appropriate for ordered outcomes if “the [...] classes constructed [...] re-
present equal gradations on a continuous scale.” Unless a latent variable can be identified,
verification of this key assumption is impossible and even then may not be tenable. In rating
of color, for example, one could construct a latent continuous color variable as a function of
red, green, and blue intensity, hue, saturation, and lightness and view the color rating as its
categorization in equally spaced intervals. But assessed color rating categories do not
necessarily represent equal gradations of this process, even if it can be constructed. In the
case of assessed orderings where a latent variable may not exist at all, the assumption of equal
gradations on a continuous scale is most questionable. Even if it holds, the ordered response
is nevertheless discrete and multivariate rather than continuous and univariate.
Instead of appealing to the restrictive conditions under which analysis of variance might
be appropriate, we prefer statistical methods that have been specifically developed for ordinal
data that take into account the distributional properties of ordered responses, and perhaps
most importantly, that do not depend on the actual scoring system in use. The model we rely
on most heavily is McCullagh's proportional odds model (POM), which belongs to the class

© 2003 by CRC Press LLC


346 Chapter 6  Generalized Linear Models

of cumulative link models (McCullagh 1980, McCullagh 1984, McCullagh and Nelder 1989).
It is not a bona fide generalized linear model but very closely related to logistic regression
models, which justifies its discussion in this chapter (the correspondence of these models to
GLMs can be made more precise by using composite link functions, see e.g., Thompson and
Baker 1981). For only two ordered categories the POM reduces to a standard GLM for
Bernoulli or Binomial outcomes because the Binomial distribution is a special case of the
multinomial distribution where one counts the number of outcomes out of R independent
Bernoulli experiments that fall into one of N categories. Like other generalized linear models
cumulative link models apply a link function to map the parameter of interest onto a scale
where effects are linear. Unlike the models for Bernoulli or Binomial data, the link function is
not applied to the probability that the response takes on a certain value, but to the cumulative
probability that the response occurs in a particular category or below. It is this ingenious
construction from which essential simplifications arise. Our focus on the proportional odds
model is not only motivated by its elegant formulation, convenient mathematics, and straight-
forward interpretation. It can furthermore be easily fitted with the logistic and genmod
procedures of The SAS® System and is readily available to those familiar with fitting
generalized linear models.

6.5.1 Cumulative Link Models


Assume there exists a continuous variable \ with probability density function 0 aBb and the
support of \ is divided into categories by a series of cutoff parameters !4 ; this establishes a
grouped ordering in the sense of Anderson (1984). Rather than the latent variable \ we
observe the response ] and assign it the value 4 whenever ] falls between the cutoffs !4"
and !4 (Figure 6.9). If the cutoffs are ordered in the sense that !4"  !4 , the response ] is
ordinal. Cumulative link models do not require that a latent continuous variable actually
exists, or that the ordering is grouped (rather than assessed). They are simply motivated most
easily in the case where ] is a grouping of a continuous, unobserved variable \ .
The distribution of the latent variable in Figure 6.9 is Ka%ß "b and cutoffs were placed at
!" œ #Þ$, !# œ $Þ(, and !$ œ &Þ). These cutoffs define a four-category ordinal response ] .
The probability to observe ] œ ", for example, is obtained as the difference
Pra\  #Þ$b  Pra\   _b œ Pra^   "Þ(b  ! œ !Þ!%&,

where ^ is a standard Gaussian random variable. Similarly, the probability to observe ] in at


most category # is
Pra] Ÿ #b œ Pra\ Ÿ $Þ(b œ Pra^ Ÿ  !Þ$b œ !Þ$)#.

For a given distribution of the latent variable, the placement of the cutoff parameters deter-
mines the probabilities to observe the ordinal variable ] . When fitting a cumulative link
model to data, these parameters are estimated along with the effects of covariates and experi-
mental factors. Notice that the number of cutoff parameters that need to be estimated is one
less than the number of ordered categories.
To motivate an application consider an experiment conducted in a completely
randomized design. If 73 denotes the effect of the 3th treatment and 5 indexes the replications,
we put

© 2003 by CRC Press LLC


Modeling an Ordinal Response 347

\35 œ . € 73 € /35
as the model for the latent variable \ observed for replicate 5 of treatment 3. The probability
that ]35 , the ordinal outcome for replicate 5 of treatment 3, is at most in category 4 is now
determined by the distribution of the experimental errors /35 as
Pra]35 Ÿ 4b œ Pra\35 Ÿ !4 b œ Pra/35 Ÿ !4  .  73 b œ Prˆ/35 Ÿ !‡4  73 ‰.

f(x) Distribution of
latent variable X
0.4

Pr(
Pr(
0.3

α 2<
Y=
) = 2) =
38

X<
=2 <α

3) =
0.3

α 3)
X

0.5
1<

=

82
0.2
Pr
(Y
Pr

Pr(X < α1) = Pr(X > α3) =


0.1
Pr(Y = 1) = 0.045 Pr(Y = 4) = 0.035

0.0 X
1 2 3 4 5 6 7
α1=2.3 α2=3.7 α3=5.8

Figure 6.9. Relationship between latent variable \ µ Ka%ß "b and an ordinal outcome ] . The
probability to observe a particular ordered value depends on the distribution of the latent
variable and the spacing of the cutoff parameters !4 ; !! œ  _, !% œ _.

The cutoff parameters !4 and the grand mean . have been combined into a new cutoff
!‡4 œ !4  . in the last equation. The probability that the ordinal outcome for replicate 5 of
treatment 3 is at most in category 4 is a cumulative probability, denoted #354 .
Choosing a probability distribution for the experimental errors is as easy or difficult as in
a standard analysis. The most common choices are to assume that the errors follow a Logistic
distribution Pra/ Ÿ >b œ "Îa" € /> b or a Gaussian distribution. The Logistic error model
leads to a model with a logit link function, the Gaussian model results in a probit link
function. With a logit link function we are led to

Pra]35 Ÿ 4b ‡
logitaPra]35 Ÿ 4bb œ logita#354 b œ lnœ  œ !4  73 . [6.49]
Pra]35 ž 4b

The term cumulative link model is now apparent, since the link is applied to the cumulative
probabilities #354 . This model was first described by McCullagh (1980, 1984) and termed the
proportional odds model (see also McCullagh and Nelder, 1989, §5.2.2). The name stems

© 2003 by CRC Press LLC


348 Chapter 6  Generalized Linear Models

from the fact that #354 is a measure of cumulative odds (Agresti 1990, p. 322) and hence the
logarithm of the cumulative odds ratio for two treatments is (proportional to) the treatment
difference
logita#354 b  logita#3w 54 b œ 73w  73 .

In a regression example where logita#34 b œ !4  " B3 , the cutoff parameters serve as separate
intercepts on the logit scale. The slope " measures the change in the cumulative logit if the
regressor B changes by one unit. The change in cumulative logits between B3 and B3w is
logita#34 b  logita#3w 4 b œ " aB3w  B3 b

and proportional to the difference in the regressors. Notice that this effect of the regressors or
treatments on the logit scale does not depend on 4; it is the same for all categories.
By inverting the logit transform the probability to observe at most category 4 for replicate
5 of treatment 3 is easily calculated as
"
Pra]35 Ÿ 4b œ ,
" € expš  !‡45 € 73 ›

and category probabilities are obtained from differences:


Ú Pra]35 Ÿ "b œ #35" 4œ"
1354 œ Pra]35 œ 4b œ Û Pra]35 Ÿ 4b  Pra]35 Ÿ 4  "b œ #354  #354" "4N
Ü "  Pra]35 Ÿ N  "b œ "  #35N " 4œN

Here, 1354 is the probability that an outcome for replicate 5 of treatment 3 will fall into
category 4. Notice that the probability to fall into the last category aN b is obtained by sub-
tracting the cumulative probability to fall into the previous category from ". In fitting the
proportional odds model to data this last probability is obtained automatically and is the
reason why only N  " cutoff parameters are needed to model an ordered response with N
categories.
The proportional odds model has several important features. It allows one to model ordi-
nal data independently of the scoring system in use. Whether categories are labeled as
+ß ,ß -ß â or "ß #ß $ß â or "ß #!ß $%ß â or mild, medium, heavy, â the analysis will be the
same. Users are typically more interested in the probabilities that an outcome is in a certain
category, rather than cumulative probabilities. The former are easily obtained from the cumu-
lative probabilities by taking differences. The proportional odds model is further invariant
under category amalgamations. If the model applies to an ordered outcome with N categories,
it also applies if neighboring categories are combined into a new response with N ‡  N cate-
gories (McCullagh 1980, Greenwood and Farewell 1988). This is an important property since
ratings may be collected on a scale finer than that eventually used in the analysis and parame-
ter interpretation should not depend on the number of categories. The development of the
proportional odds model was motivated by the existence of a latent variable. It is not a
requirement for the validity of this model that such a latent variable exists and it can be used
for grouped and assessed orderings alike (McCullagh and Nelder 1989, p.154; Schabenberger
1995).

© 2003 by CRC Press LLC


Modeling an Ordinal Response 349

Other statistical models for ordinal data have been developed. Fienberg's continuation
ratio model (Fienberg 1980) models logits of the conditional probabilities to observe category
4 given that the observation was at least in category 4 instead of cumulative probabilities (see
also Cox 1988, Engel 1988). Continuation ratio models are based on factoring marginal
probabilities into a series of conditional probabilities and standard GLMs for binomial
outcomes (Nelder and Wedderburn 1972, McCullagh and Nelder 1989) can be applied to the
terms in the factorization separately. A disadvantage is that the factorization is not unique.
Agresti (1990, p. 318) discusses adjacent category logits where probabilities are modeled
relative to a baseline category. Läärä and Matthews (1985) establish an equivalence between
continuation ratio and cumulative models if a complementary log-log instead of a logit
transform is applied. Studying a biomedical example, Greenwood and Farewell (1988) found
that the proportional odds and the continuation ratio models led to the same conclusions
regarding the significance of effects.
Cumulative link models are fit by maximum likelihood and estimates are derived by
iteratively reweighted least squares as for generalized linear models. The test of hypotheses
proceeds along similar lines as discussed in §6.4.2.

6.5.2 Software Implementation and Example


The proportional odds model can be fit to data in various statistical packages. The SAS®
System fits cumulative logit models in proc catmod, proc logistic, and proc genmod
(starting with Release 7.0). The logistic procedure is specifically designed for the propor-
tional odds model. Whenever the response variable has more than two categories, the proce-
dure defaults to the proportional odds model; otherwise it defaults to a logistic regression
model for Bernoulli or Binomial responses. Prior to Release 8.0, the logistic procedure did
not accommodate classification variables in a class statement (see §6.2.3 for details). Users
of Release 7.0 can access the experimental procedure proc tlogistic to fit the proportional
odds model with classification variables (treatments, etc.). The genmod procedure has been
extended in Release 7.0 to fit cumulative logit and other multinomial models. It allows the
use of classification variables and performs likelihood ratio and Wald tests. Although proc
genmod enables the generalized estimating equation (GEE) approach of Liang and Zeger
(1986) and Zeger and Liang (1986), only an independence working correlation matrix is
permissible for ordinal responses. The proportional odds model is also implemented in the
Minitab® package (module OLOGISTIC).
The software implementation and the basic calculations in the proportional odds model
are now demonstrated with hypothetical data from an experimental design. Assume four treat-
ments are applied in a completely randomized design with four replications and ordinal
ratings (poor, average, good) from each experimental unit are obtained at four occasions
a. œ "ß #ß $ß %b. The data are shown in Table 6.13. For example, at occasion . œ " all four
replicates of treatment E were rated in the poor category and at occasion . œ # two replicates
were rated poor, two replicates were rated average.
The model fit to these data is a logistic model for the cumulative probabilities that
contains a classification effect for the treatment variable and a continuous covariate for the
time effect:

© 2003 by CRC Press LLC


350 Chapter 6  Generalized Linear Models

logite#345 f œ !‡4 € 73 € " >35 . [6.50]

Here, >35 is the time point at which replicate 5 of treatment 3 was observed
a3 œ "ß âß %à 5 œ "ß âß %à 4 œ "ß #b.

Table 6.13. Observed category frequencies for four treatments at four dates
(shown are the total counts across four replicates at each occasion)
Treatment
Category E a 3 œ "b F a3 œ #b G a3 œ $b H a3 œ %b !
poor a4 œ "b %ß #ß %ß % %ß $ß %ß % !ß !ß !ß ! "ß !ß !ß ! *ß &ß )ß )
average a4 œ #b !ß #ß !ß ! !ß "ß !ß ! "ß !ß %ß % #ß #ß %ß % $ß &ß )ß )
good !ß !ß !ß ! !ß !ß !ß ! $ß %ß !ß ! "ß #ß !ß ! %ß 'ß !ß !

The SAS® data step and proc logistic statements are:


data ordexample;
input tx $ time rep rating $;
datalines;
A 1 1 poor
A 1 2 poor
A 1 3 poor
A 1 4 poor
A 2 1 poor
A 2 2 poor
A 2 3 average
A 2 4 average
A 3 1 poor
A 3 2 poor
A 3 3 poor
A 3 4 poor
A 4 1 poor

â and so forth â

;;
run;

proc logistic data=ordexample order=data;


class tx / param=glm;
model rating = tx time / link=logit rsquare covb;
run;

The order=data option of proc logistic ensures that the categories of the response
variable rating are internally ordered as they appear in the data set, that is, poor before
average before good. If the option were omitted proc logistic would sort the levels alpha-
betically, implying that the category order is average  good  poor, which is not the correct
ordination. The param=glm option of the class statement asks proc logistic to code the
classification variable in the same way as proc glm. This means that a separate estimate for
the last level of the treatment variable tx is not estimated to ensure the constraint !>œ% 3œ"
73 œ !. 7% will be absorbed into the cutoff parameters and the estimates for the other treatment
effects reported by proc logistic represent differences with 7% .
One should always study the Response Profile table on the procedure output to make
sure that the category ordering used by proc logistic (and influenced by the order= option)

© 2003 by CRC Press LLC


Modeling an Ordinal Response 351

agrees with the intended ordering. The Class Level Information Table shows the levels
of all variables listed in the class statement as well as their coding in the design matrix
(Output 6.3).
The Score Test for the Proportional Odds Assumption is a test of the assumption
that changes in cumulative logits are proportional to changes in the explanatory variables.
Two models are compared to calculate this test, a full model in which the slopes and gradients
vary by category and a reduced model in which the slopes are the same across categories.
Rather than actually fitting the two models, proc logistic performs a score test that requires
only the reduced model to be fit (see §A6.8.4). The reduced model in this case is the
proportional odds model and rejecting the test leads to the conclusion that it is not appropriate
for these data. In this example, the score test cannot be rejected and the :-value of !Þ"&$) is
sufficiently large not to call into doubt the proportionality assumption (Output 6.3).

Output 6.3. The LOGISTIC Procedure

Model Information
Data Set WORK.ORDEXAMPLE
Response Variable rating
Number of Response Levels 3
Number of Observations 64
Link Function Logit
Optimization Technique Fisher's scoring

Response Profile
Ordered Total
Value rating Frequency
1 poor 30
2 average 24
3 good 10

Class Level Information


Design Variables
Class Value 1 2 3 4
tx A 1 0 0 0
B 0 1 0 0
C 0 0 1 0
D 0 0 0 1

Score Test for the Proportional Odds Assumption

Chi-Square DF Pr > ChiSq


6.6808 4 0.1538

Model Fit Statistics


Intercept
Intercept and
Criterion Only Covariates
AIC 133.667 69.124
SC 137.985 82.077
-2 Log L 129.667 57.124

R-Square 0.6781 Max-rescaled R-Square 0.7811

Type III Analysis of Effects


Wald
Effect DF Chi-Square Pr > ChiSq
tx 3 24.5505 <.0001
time 1 6.2217 0.0126

© 2003 by CRC Press LLC


352 Chapter 6  Generalized Linear Models

Output 6.3 (continued).


Analysis of Maximum Likelihood Estimates
Standard
Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -5.6150 1.5459 13.1920 0.0003


Intercept2 1 -0.4865 0.9483 0.2632 0.6079
tx A 1 5.7068 1.4198 16.1557 <.0001
tx B 1 6.5190 1.5957 16.6907 <.0001
tx C 1 -1.4469 0.8560 2.8571 0.0910
tx D 0 0 . . .
time 1 0.8773 0.3517 6.2217 0.0126

The rsquare option of the model statement in proc logistic requests the two general-
ized V # measures discussed in §6.4.5. Denoted as R-Square is the generalized measure of
Cox and Snell (1989, pp. 208-209) and Magee (1990), and Max-rescaled R-Square denotes
the generalized measure by Nagelkerke (1991) that ranges between ! and ". The log
likelihood for the null and fitted models are 6a. s! <à yb œ –"#*Þ''(Î# œ –'%Þ)$$& and
6 ˆ.
s0 ß <à y‰ œ –#)Þ&'#, respectively, so that
#
V‡# œ "  expœ  a  #)Þ&'# € '%Þ)$$&b œ !Þ'()".
'%

The Analysis of Maximum Likelihood Estimates table shows the parameter estimates
and their standard errors as well as Chi-square tests testing each parameter against zero.
Notice that the estimate for 7% is shown as ! with no standard error since this effect is
absorbed into the cutoffs. The cutoff parameters labeled Intercept and Intercept2 thus are
estimates of
!‡" € 7%
!‡# € 7%

and the parameters shown as tx A, tx B, and tx C correspond to $" œ 7"  7% , $# œ 7#  7% ,


and $$ œ 7$  7% , respectivelyÞ To estimate the probability to observe at most an average
rating for treatment F at time > œ ", for example, calculate

s‡# € s7 % € s$ # € "
! s ‡" œ  !Þ%)'& € 'Þ&"* € !Þ)(($‡" œ 'Þ*!*
"
Pra] Ÿ # at time "b œ œ !Þ***
" € expe  'Þ*!*f

Similarly, for the probability of at most a poor rating at time > œ "

s‡" € s7 % € s$ # € "
! s ‡" œ  &Þ'"& € 'Þ&"* € !Þ)(($‡" œ "Þ()"$
"
Pra] Ÿ " at time "b œ œ !Þ)&'
" € expe  "Þ()"$f

For an experimental unit receiving treatment #, there is an )&Þ'% chance to receive a poor
rating and only a **Þ*%  )&Þ'% œ "%Þ$% chance to receive an average rating at the first
time point (Table 6.14). Each block of three numbers in Table 6.14 is an estimate of the
multinomial distribution for a given treatment at a particular time point. A graph of the linear

© 2003 by CRC Press LLC


Modeling an Ordinal Response 353

predictors shows the linearity of the model in treatment effects and time on the logit scale and
the proportionality assumption which results in parallel lines on that scale (Figure 6.10).
Inverting the logit transform to calculate cumulative probabilities from the linear predictors
shows the nonlinear dependence of probabilities on treatments and the time covariate (Figure
6.11).
To compare the treatments at a given time point, we formulate linear combinations of the
cumulative logits which leads to linear combinations of the parameters. For example, compar-
ing treatments E and F at time > œ " the linear combination is
logite#"4" f  logite##4" f œ !‡4 € 7" € "  ˆ!‡4 € 7# € " ‰ œ 7"  7# .

The cutoff parameters have no effect on this comparison; the treatment difference has the
same magnitude, regardless of the category (Figure 6.10). In terms of the quantities proc
logistic estimates, the contrast is identical to $"  $# . The variance-covariance matrix
(obtained with the covb option of the model statement in proc logistic, output not given) is
shown in Table 6.15. For example, the standard error for ! s‡" € s7 % is È#Þ$)* œ "Þ&%& as
appears on the output in the Analysis of Maximum Likelihood Estimates table.

Table 6.14. Predicted category probabilities by treatment and occasion


Treatment
Category E F G H
Time 1 Poor !Þ(#& !Þ)&' !Þ!!# !Þ!!)
Average !Þ#($ !Þ"%$ !Þ#&' !Þ&))
Good !Þ!!# !Þ!!" !Þ(%# !Þ%!%
Time 2 Poor !Þ)'% !Þ*$& !Þ!!& !Þ!#"
Average !Þ"$& !Þ!'& !Þ%&! !Þ('!
Good !Þ!!" !Þ!!! !Þ&%& !Þ#"*
Time 3 Poor !Þ*$* !Þ*(# !Þ!"# !Þ!%)
Average !Þ!'" !Þ!#) !Þ'&' !Þ)%(
Good !Þ!!! !Þ!!! !Þ$$# !Þ"!&
Time 4 Poor !Þ*($ !Þ*)) !Þ!#) !Þ"!*
Average !Þ!#( !Þ!"# !Þ)!" !Þ)%&
Good !Þ!!! !Þ!!! !Þ"(" !Þ!%'

Table 6.15. Estimated variance-covariance matrix of parameter estimates as obtained from


proc logistic (covb option of the model statement)

!‡" € 7% !‡# € 7% $" $# $$ "


!‡" € 7% #Þ$)* !Þ)*"  "Þ'*'  "Þ($&  !Þ!!%  !Þ$)*
!‡# € 7% !Þ)*" !Þ)**  !Þ%'#  !Þ%)'  !Þ$"$  !Þ#%"
$"  "Þ'*&  !Þ%'# #Þ!"& "Þ%"! !Þ!** !Þ"'*
$#  "Þ($&  !Þ%)' "Þ%"! #Þ&%' !Þ!*% !Þ")#
$$  !Þ!!%  !Þ$"$ !Þ!** !Þ!*% !Þ($$  !Þ!&$
"  !Þ$)*  !Þ#%" !Þ"'* !Þ")#  !Þ!&$ !Þ"#%

© 2003 by CRC Press LLC


354 Chapter 6  Generalized Linear Models

10

B, j = 2

A, j = 2
Logit 5

B, j = 1

A, j = 1
0
C, j = 2

-5 C, j = 1

1.0 1.5 2.0 2.5 3.0 3.5 4.0


Time d

Figure 6.10. Linear predictors for treatments E, F , and G . The vertical difference between
the lines for categories 4 œ " and 4 œ # is constant for all time points and the same
a&Þ'"&!  !Þ%)'& œ &Þ"#)&b for all treatments.

A, j = 2 B, j = 2
1.0

B, j = 1

0.8
A, j = 1

0.6
Pr(Y <= j)

0.4

C, j = 2

0.2

C, j = 1
0.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0


Time d

Figure 6.11. Predicted cumulative probabilities for treatments E, F , and G . Because the line
for Eß 4 œ # lies completely above the line for Eß 4 œ " in Figure 6.10, the cumulative proba-
bility to observe a response in at most category # is greater than the cumulative probability to
observe a response in at most category ". Since the cumulative probabilities are ordered, the
category probabilities are guaranteed to be non-negative.

The estimated variance of the linear combination s$ "  s$ # is thus

Var’s$ " “ € Var’s$ # “  #Cov’s$ " ß s$ # “ œ #Þ!"& € #Þ&%'  #‡"Þ%" œ "Þ(%"

© 2003 by CRC Press LLC


Modeling an Ordinal Response 355

and the Wald test statistic becomes


a&Þ(!')  'Þ&"*b#
[ œ œ !Þ$().
"Þ(%"
From a Chi-squared distribution with one degree of freedom the :-value !Þ&$) is obtained.
This test can be performed in proc logistic by adding the statement
contrast 'A vs. B' tx 1 -1 0 0;

to the proc logistic code above. The test for the treatment main effect can be performed by
using a set of orthogonal contrasts, for example let

s7 "  s7 # s$ "  s$ # × Ô" " ! ×Ô s$ " ×


Ô × Ô
sc œ s7 " € s7 #  #s7 $ œ s$ " € s$ #  #s$ $ Ù œ "
Ö "  # Ö s$ # Ù
Õ s7 " € s7 # € s7 $  $s7 % Ø Õ s Õ" " " ØÕ s$ $ Ø
$ " € s$ # € s$ $ Ø

which has estimated variance

Ô"" ! ×Ô #Þ!"& "Þ%"! !Þ!** ×Ô " " "×


scscd œ
Var " " # "Þ%"! #Þ&%' !Þ!*% " " "
Õ" " " ØÕ !Þ!** !Þ!*% !Þ($$ ØÕ ! # "Ø
Ô "Þ(%"  !Þ&%"  !Þ&#' ×
œ  !Þ&%" *Þ&%" &Þ(## .
Õ  !Þ&#' &Þ(## )Þ&!! Ø

The Wald statistic for the treatment main effect aL! : c œ 0b becomes

Ô !Þ&)( !Þ!"* !Þ!#$ ×Ô  !Þ)"# ×


scscd"sc œ c  !Þ)"#ß "&Þ""*ß "!Þ((*d !Þ!"*
scw Var !Þ"('  !Þ"") "&Þ""*
Õ !Þ!#$  !Þ"") !Þ"*) ØÕ "!Þ((* Ø
œ #%Þ&&.

This test is shown on the proc logistic output in the Type III Analysis of Effects table.
The same analysis can be obtained in proc genmod. The statements, including all pairwise
treatment comparisons follows.

proc genmod data=ordexample rorder=data;


class tx;
model rating = tx time / link=clogit dist=multinomial type3;
contrast 'A vs. B' tx 1 -1 0 0;
contrast 'A vs. C' tx 1 0 -1 0;
contrast 'A vs. D' tx 1 0 0 -1;
contrast 'B vs. C' tx 0 1 -1 0;
contrast 'B vs. D' tx 0 1 0 -1;
contrast 'C vs. D' tx 0 0 1 -1;
run;

More complicated proportional odds models can be fit easily with these procedures. On
occasion one may encounter a warning message regarding the separability of data points and
a possibly questionable model fit. This can occur, for example, when one treatment's
responses are all in the same category, since then there is no variability among the replicates.

© 2003 by CRC Press LLC


356 Chapter 6  Generalized Linear Models

This phenomenon is more likely for small data sets and applications where many classifica-
tion variables are involved in particular interactions. There are several possibilities to correct
this: amalgamate adjacent categories to reduce the number of categories; fit main effects and
low-order interactions only and exclude high-order interactions; include effects as continuous
covariates rather than as classification variables when they relate to some underlying
continuous metric such as rates of application or times of measurements.

6.6 Overdispersion
Box 6.6 Overdispersion

• Overdispersion is the condition by which the variability of the data exceeds


the variability expected under a particular probability distribution.
Gaussian data are never overdispersed, since the mean and variance can be
chosen freely. Overdispersion is an issue for those generalized linear models
where the mean and variance are functionally dependent.

• Overdispersion can arise from choosing the wrong distributional model,


from ignoring important explanatory variables, and from correlations
among the observations.

• A common remedy for overdispersion is to add a multiplicative


overdispersion factor to the variance function. The resulting analysis is no
longer maximum likelihood but quasi-likelihood.

If the variability of a set of data exceeds the variability expected under some reference model
we call it overdispersed (relative to that reference). Counts, for example, may exhibit more
variability than is permissible under a Binomial or Poisson probability model. Overdispersion
is a potential problem in statistical models where the first two moments of the response distri-
bution are linked and means and variances are functionally dependent. In Table 6.2 the scale
parameter < is not present for the discrete distributions and data modeled under these distri-
butions are potentially overdispersed. McCullagh and Nelder (1989, p. 124, p. 193) suggest
that overdispersion may be the norm in practice, rather than the exception. In part this is due
to the fact that users resort to a small number of probability distributions to model their data.
Almost automatically one is led to the Binomial distribution for count variables with a natural
denominator and to the Poisson distribution for counts without a natural denominator. One
remedy of the overdispersion problem lies in choosing a proper distribution that permits more
variability than these standard models such as the Beta-Binomial in place of the Binomial
model and the Negative Binomial in place of the Poisson model. Overdispersion can also be
caused by an improper choice of covariates and effects to model the data. This effect was
obvious for the seed germination data modeled in §6.4.4. When temperature and/or
concentration effects were omitted the ratio of the deviance and its degrees of freedom
exceeded the benchmark value of one considerably (Table 6.12). Such cases of overdisper-
sion must be addressed by altering the set of effects and covariates, not by postulating a
different probability distribution for the data. In what follows we assume that the mean of the

© 2003 by CRC Press LLC


Overdispersion 357

responses has been modeled correctly, but that the data nevertheless exhibit variability in
excess of our expectation under a certain reference distribution.
Overdispersion is a problem foremost because it affects the estimated precision of the
parameter estimates. In §A6.8.2 it is shown that the scale parameter < is of no consequence in
estimating " and can be dropped from the estimating equations. The (asymptotic) variance-
"
covariance matrix of the maximum likelihood estimates is given by aFw V" Fb where V is a
diagonal matrix containing the variances Varc]3 d œ 2a.3 b<. Extracting the scale parameter <
we can simplify:
s “ œ <aFw Diaga"Î2a.3 bbFb . "
Var’"

If under a given probability model the scale parameter < is assumed to be " but overdis-
persion exists aVarc]3 d ž 2a.3 bb, the variability of the estimates is larger than what is
assumed under the model. The precision of the parameter estimates is overstated, standard
error estimates are too small and as a result, test statistics are inflated and : -values are too
small. Covariates and effects may be declared significant even when they are not. It is thus
important to account for overdispersion present in the data and numerous approaches have
been developed to that end. The following four categories are sufficiently broad to cover
many overdispersion mechanisms and remedies.
• Extra scale parameters are added to the variance function of the generalized linear
model. For Binomial data one can assume Varc] d œ 981a"  1b instead of the
nominal variability Varc] d œ 81a"  1b. If 9 ž " the model is overdispersed relative
to the Binomial and if 9  " it is underdispersed. Underdispersion is far less likely
and a far less serious problem in data analysis. For count data one can model overdis-
persion relative to the Poissona-b distribution as Varc] d œ -9, Varc] d œ
-a" € 9bÎ9, or Varc] d œ - € -# Î9. These models have some stochastic foundation
in certain mixing models (see below and §A6.8.6). Models with a multiplicative over-
dispersion parameter such as Varc] d œ 981a"  1b for Binomial and Varc] d œ 9-
for Poisson data can be handled easily with proc genmod of The SAS® System. The
overdispersion parameter is then estimated from Pearson or deviance residuals by the
method discussed in §6.4.3 (pscale and dscale options of model statement).
• Positive autocorrelation among observations leads to overdispersion in sums and
averages. Let ^3 be an arbitrary random variable with mean . and variance 5 # and
assume that the ^3 are equicorrelated, Covc^3 ß ^4 d œ 35 # a3 Á 4. Assume further that
3   !. We are interested in modeling ] œ !83œ" ^3 , the sum of the ^3 . The mean and
variance of this sum follow from first principles as
Ec] d œ 8.

and
8 8
Varc] d œ 8Varc^3 d € #""Covc^3 ß ^4 d œ 85 # € 8a8  "b5 # 3
3œ" 4ž3

œ 85 # a" € a8  "b3b   85 # .

If the elements in the sum were uncorrelated, Varc] d œ 85 # but the positive autocor-
relation thus leads to an overdispersed sum, relative to the model of stochastic inde-

© 2003 by CRC Press LLC


358 Chapter 6  Generalized Linear Models

pendence. This type of overdispersion is accounted for by incorporating the stochastic


dependency in the model.
• The parameters of the reference distribution are not assumed to be constant, but random
variables and the reference distribution are reckoned conditionally. We term this the
mixing model approach (not to be confused with the mixed model approach of §7).
As a consequence, the unconditional (marginal) distribution is more dispersed than the
conditional reference distribution. If the marginal distribution is also in the expo-
nential family, this approach enables maximum likelihood estimation in a genuine
generalized linear model. A famous example is the hierarchical model for counts
where the average count - is a Gamma-distributed random variable. If the conditional
reference distribution — the distribution of ] for a fixed value of - — is Poisson, the
unconditional distribution of the counts is Negative Binomial. Since the Negative
Binomial distribution is a member of the exponential family with canonical link
loge.f, a straightforward generalized linear model for counts emerges that permits
more variability than the Poissona-b distribution (for an application see §6.7.8).
• Random effects and coefficients are added to the linear predictor. This approach gives
rise to generalized linear mixed models (GLMM, §8). For example, consider a
simple generalized linear regression model with log link,
lneEc] df œ "! € "" B Í Ec] d œ expe"! € "" Bf

and possible oversdispersion. Adding a random effect / with mean ! and variance 5 #
in the exponent, turns the linear predictor into a mixed linear predictor. The resulting
model can also be reckoned conditionally. Ec] l/d œ expe"! € "" B € /f, / µ a!ß 5 # b.
If we assume that ] l/ is Poisson-distributed, the resulting unconditional distribution
will be overdispersed relative to a Poissonaexpe"! € "" Bfb distribution. Unless the
distribution of the random effects is chosen carefully, the marginal distribution may
not be in the exponential family and maximum likelihood estimation of the parameters
may be difficult. Numerical methods for maximizing the marginal (unconditional) log-
likelihood function via linearization, quadrature integral approximation, importance
sampling and other devices exist, however. The nlmixed procedure of The SAS®
System is designed to fit such models (§8).

In this chapter we consider a remedy for overdispersion by estimating extra scale


parameters (§6.7.2) and by using mixing schemes such as the Poisson/Gamma model. An
application of the latter approach is presented in §6.7.8 where poppy counts in a randomized
complete block design are more dispersed than is expected under a Poisson model. The poppy
count data are revisited in §8.4.2 and modeled with a generalized linear mixed model.

6.7 Applications
The first two applications in this chapter model Binomial outcomes. In §6.7.1 a simple
logistic regression model with a single covariate is fit to model the mortality rate of insect
larvae exposed to an insecticide. Of particular interest is the estimation of the PH&! , the
insecticide dosage at which the probability of a randomly chosen larva to succumb to the
insecticide exposure is !Þ&. In §6.7.2 a field experiment with Binomial outcomes is examined.

© 2003 by CRC Press LLC


Applications 359

Sixteen varieties are arranged in a randomized complete block design and the number of
plants infested with the Hessian fly is recorded. Interesting aspects of this experiment are
varying binomial sample sizes among the experimental units which invalidates the variance-
stabilizing arcsine transformation and possible overdispersion. Yield density models that were
examined earlier as nonlinear regression models in §5.8.7 are revisited in §6.7.3. Rather than
relying on inverse or logarithmic transformation we treat the yield responses as Gamma-
distributed random variables and apply generalized linear model techniques. §6.7.4 and
§6.7.5 are dedicated to the analysis of ordinal data. In both cases the treatment structure is a
simple two-way factorial. The analysis in §6.7.5 is further complicated by the fact that experi-
mental units were measured repeatedly over time. The analysis of contingency tables is a
particularly fertile area for the deployment of generalized linear models. A special class of
models, log-linear models for square contingency tables, are discussed in §6.7.6. These
models allow estimation of the agreement or disagreement between interpreters of the same
material. Generalized linear models can be successfully employed when the outcome of
interest is not a mean, but a dispersion parameter, for example a variance. In §6.7.7 we use
Gamma regression to model the variability between deoxynivalenol (vomitoxin) probe
samples from truckloads of wheat kernels as a function of the toxin load. The final applica-
tion (§6.7.8) considers count data and demonstrates that the Poisson distribution is not
necessarily a suitable model for such data. In the presence of overdispersion, the Negative
Binomial distribution is a more reasonable model. We show how to fit models with Negative
Binomial responses with the nlmixed procedure of The SAS® System.

6.7.1 Dose-Response and LD&! Estimation in a Logistic


Regression Model
Mead et al. (1993, p. 336) discuss probit analysis for a small data set of the proportion of
larvae killed as a function of the concentration of an insecticide. For each of seven concentra-
tions, #! larvae were exposed to the insecticide and the number of larvae killed was recorded
(Table 6.16). Each larva's exposure to the insecticide represents a Bernoulli random variable
with outcomes larva killed and larva survived. If the #! larvae exposed to the same
concentration react independently to the insecticide and if their survival probabilities are the
same for a given concentration, then each number in Table 6.16 is a realization of a
Binomiala#!ß 1aBbb random variable, where 1aBb is the mortality probability if concentration
B is applied. If, furthermore, the concentrations are applied independently to the sets of #!
larvae, the experiment consists of seven independent Binomial random variables. Notice that
these data are grouped with 8a1b œ (, 8 œ "%!.

Table 6.16. Insecticide concentrations and number of larvae killed out of #!


Concentration B !Þ$(&% !Þ(&% "Þ&% $% '% "#% #%%
No. of larvae killed ! " ) "" "' ") #!
Data from Mead, Curnow, and Hasted (1993, p. 336) and used with permission.

Plots of the logits of the sample proportions against the concentrations and the log"!
concentrations are shown in Figure 6.12. The relationship between sample logits and concen-
trations is clearly not linear. It appears at least quadratic and would suggest a generalized
linear model

© 2003 by CRC Press LLC


360 Chapter 6  Generalized Linear Models

logita1b œ "! € "" B € "# B# .

The quadratic trend does not ensure that the logits are monotonically increasing in B and a
more reasonable model posits a linear dependence of the logits on the log"! concentration,
logita1b œ "! € "" log"! eBf. [6.51]

Model [6.51] is a classical logistic regression model with a single covariate alog"! eBfb. The
analysis by Mead et al. (1993) uses a probit link, and the results are very similar to those from
a logistic analysis, because of the similarity of the two link functions.

3 3

2 2
Logit of sample proportion

Logit of sample proportion


1 1

0 0

-1 -1

-2 -2

-3 -3

0 5 10 15 20 25 -0.55 -0.30 -0.05 0.20 0.45 0.70 0.95 1.20 1.45


Concentration Log10 Concentration

Figure 6.12. Logit of sample proportions against insecticide concentration and logarithm of
concentration. A linear relationship between the logit and the log"! -concentration is
reasonable.

The key relationship in this experiment is the dependence of the probability 1 that a larva
is killed on the insecticide concentration. Once this relationship is modeled other quantities of
interest can be estimated. In bioassay and dose-response studies one is often interested in esti-
mating dosages that produce a certain response. For example, the dosage lethal to a randomly
selected larva with probability !Þ& (the so-called PH&! ). In model [6.51] we can establish
more generally that if B! denotes the dosage with mortality rate ! Ÿ ! Ÿ ", then from
logita!b œ "! € "" log"! eB! f

the inverse prediction of dosage follows as


log"! eB! f œ alogita!b  "! bÎ""
[6.52]
B! œ "!alogita!b"! bÎ"" .

In the case of PH&! , for example, ! œ !Þ&, logita!Þ&b œ !, and log"! eB!Þ& f œ  "! Î"" . For
this particular ratio, fiducial intervals were developed by Finney (1978, pp. 80-82) based on
work by Fieller (1940) under the assumption that " s ! and "
s " are Gaussian-distributed. These
intervals are also developed in our §A6.8.5. For an estimate of the dosage B!Þ& on the original,
rather than the log"! scale, these intervals do not directly apply. We prefer obtaining standard
errors for the quantities log"! eB! f and B! based on Taylor series expansions. If the ratio

© 2003 by CRC Press LLC


Applications 361

s " ÎeseÐ"
" s " Ñ is sufficiently large, the Taylor series based standard errors are very accurate
(§6.4.2).
The first step in our analysis is to fit model [6.51] and to determine whether the relation-
ship between mortality probability and insecticide concentration is sufficiently strong. The
data step and proc genmod statements for this logistic regression problem are as follows.

data kills;
input concentration kills;
trials = 20;
logc = log10(concentration);
datalines;
0.375 0
0.75 1
1.5 8
3.0 11
6.0 16
12.0 18
24.0 20
;;
run;

proc genmod data=kills;


model kills/trials = logc / dist=binomial link=logit;
run;

The proc genmod output (Output 6.4) indicates a deviance of %Þ'#!' based on & degrees
of freedom [8a1b œ ( groups minus two estimated parameters ("! , "" )]. The deviance/df ratio
is close to one and we conclude that overdispersion is not a problem for these data. The
s ! œ  "Þ($!& and "
parameter estimates are " s " œ %Þ"'&".

Output 6.4. The GENMOD Procedure

Model Information
Data Set WORK.KILLS
Distribution Binomial
Link Function Logit
Response Variable (Events) kills
Response Variable (Trials) trials
Observations Used 7
Number Of Events 74
Number Of Trials 140

Criteria For Assessing Goodness Of Fit


Criterion DF Value Value/DF
Deviance 5 4.6206 0.9241
Scaled Deviance 5 4.6206 0.9241
Pearson Chi-Square 5 3.8258 0.7652
Scaled Pearson X2 5 3.8258 0.7652
Log Likelihood -50.0133
Algorithm converged.

Analysis Of Parameter Estimates


Standard Wald 95% Chi-
Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq

Intercept 1 -1.7305 0.3741 -2.4637 -0.9973 21.40 <.0001


logc 1 4.1651 0.6520 2.8872 5.4430 40.81 <.0001
Scale 0 1.0000 0.0000 1.0000 1.0000
NOTE: The scale parameter was held fixed.

© 2003 by CRC Press LLC


362 Chapter 6  Generalized Linear Models

The Wald test for L! : "" œ ! has test statistic [ œ %!Þ)" and the hypothesis is clearly
rejected. There is a significant relationship between larva mortality and the log"! insecticide
concentration. The positive slope estimate indicates that mortality probability increases with
the log"! concentration. For example, the probabilities that a randomly selected larva is killed
at concentrations B œ "Þ&% and B œ '% are
"
œ !Þ#'*
s s " log"! e"Þ&f›
" € expš  " !  "
"
œ !Þ)"*.
s s " log"! e'f›
" € expš  " !  "

The note that The scale parameter was held fixed at the end of the proc genmod output
indicates that no extra scale parameters were estimated.
How well does this logistic regression model fit the data? To this end we calculate the
generalized V # measures discussed in §6.4.5. The log likelihood for the full model containing
a concentration effect is shown in the output above as 6ˆ.s 0 ß < à y ‰ œ 6 ˆ.
s0 ß "à y‰ œ –&!Þ!"$$.
The log likelihood for the null model is obtained as 6a. s! ß "à yb œ  *'Þ)""* with the
statements (output not shown)
proc genmod data=kills;
model kills/trials = / dist=binomial link=logit;
run;

The generalized V # measure


#
V‡# œ "  expœ  ˆ6ˆ.
s0 ß "à y‰  6a.
s! ß "à yb‰
8

is then

#
V‡# œ "  expœ  a  &!Þ!"$$ € *'Þ)""*b œ !Þ%)(&.
"%!

This value does not appear very large but it should be kept in mind that this measure is not
bounded by ". Also notice that the denominator in the exponent is 8 œ "%!, the total number
#
of observations, rather than 8Ð1Ñ œ (, the number of groups. The rescaled measure V is
#
obtained by dividing V‡ with
# #
max˜V‡# ™ œ "  expœ 6a.
s ß "à yb œ "  expœ *'Þ)""* œ !Þ(%*,
8 ! "%!
#
hence V œ !Þ%)(&Î!Þ(%* œ !Þ'&!). With almost #Î$ of the variability in mortality propor-
tions explained by the log"! concentration and a >9,= ratio for the slope parameter of
>9,= œ %Þ"'&"Î!Þ'&#! œ 'Þ$)) we are reasonably satisfied with the model fit and proceed to
an estimation of the dosages that are lethal to &!% or )!% of the larvae. Based on the esti-
mates of "! and "" as well as [6.52] we obtain the point estimates

© 2003 by CRC Press LLC


Applications 363

s"! eB!Þ& f œ "Þ($!&Î%Þ"'&" œ !Þ%"&&


log
s!Þ& œ "!!Þ%"&& œ #Þ'!$
B
s"! eB!Þ) f œ Ðlogita!Þ)b € "Þ($!&ÑÎ%Þ"'&" œ !Þ(%)$
log
s!Þ) œ "!!Þ(%)$ œ &Þ'!".
B

To obtain standard errors and confidence intervals for these four quantities proc nlmixed is
used because of its ability to obtain standard errors for nonlinear functions of parameter esti-
mates by first-order Taylor series. As starting values in the nlmixed procedure we use the
converged iterates of proc genmod. The df=5 option was added to the nlmixed statement to
make sure that proc nlmixed uses the same degrees of freedom for the determination of :-
values as proc genmod. The complete SAS® code including the estimation of the lethal
dosages on the log"! and the original scale follows.
proc nlmixed data=kills df=5;
parameters intcpt=-1.7305 b=4.165;
p = 1/(1+exp(-intcpt - b*logc));
model kills ~ binomial(trials,p);
estimate 'LD50' -intcpt/b;
estimate 'LD50 original' 10**(-intcpt/b);
estimate 'LD80' (log(0.8/0.2)-intcpt)/b;
estimate 'LD80 original' 10**((log(0.8/0.2)-intcpt)/b);
run;

Output 6.5.
The NLMIXED Procedure

Specifications
Description Value
Data Set WORK.KILLS
Dependent Variable kills
Distribution for Dependent Variable Binomial
Optimization Technique Dual Quasi-Newton
Integration Method None

Dimensions
Description Value
Observations Used 7
Observations Not Used 0
Total Observations 7
Parameters 2

Parameters
intcpt b NegLogLike
-1.7305 4.165 9.50956257

Iteration History
Iter Calls NegLogLike Diff MaxGrad Slope
1 4 9.50956256 8.157E-9 0.00056 -0.00004
2 7 9.50956255 1.326E-8 0.000373 -6.01E-6
3 8 9.50956254 4.875E-9 5.963E-7 -9.76E-9

NOTE: GCONV convergence criterion satisfied.

© 2003 by CRC Press LLC


364 Chapter 6  Generalized Linear Models

Output 6.5 (continued).


Fit Statistics
Description Value
-2 Log Likelihood 19.0
AIC (smaller is better) 23.0
BIC (smaller is better) 22.9
Log Likelihood -9.5
AIC (larger is better) -11.5
BIC (larger is better) -11.5

Parameter Estimates
Standard
Parameter Estimate Error DF t Value Pr > |t| Alpha
intcpt -1.7305 0.3741 5 -4.63 0.0057 0.05
b 4.1651 0.6520 5 6.39 0.0014 0.05

Additional Estimates
Standard
Label Estimate Error DF t Value Pr > |t| Lower Upper
LD50 0.4155 0.06085 5 6.83 0.0010 0.2716 0.5594
LD50 original 2.6030 0.3647 5 7.14 0.0008 1.7406 3.4655
LD80 0.7483 0.07944 5 9.42 0.0002 0.5605 0.9362
LD80 original 5.6016 1.0246 5 5.47 0.0028 3.1788 8.0245

The nlmixed procedure converges after three iterations and reports the same parameter
estimates and standard errors as proc genmod (Output 6.5). The log likelihood value reported
by proc nlmixed a  *Þ&b does not agree with that of the genmod procedure a  &!Þ!"$b. The
procedures differ with respect to the inclusion/exclusion of constants in the likelihood calcu-
lations. Differences in log likelihoods between nested models will be the same for the two
procedures. The null model log likelihood reported by proc nlmixed (code and output not
shown) is  &'Þ$. The log likelihood difference between the two models is thus  &'Þ$ €
*Þ& œ  *'Þ)" € &!Þ!" œ  %'Þ) in either procedure.

1.0

0.8
Probability of kill

0.6

0.4

0.2

LD50 LD80
0.0

-0.1 0.4 0.9 1.4


log10 concentration

Figure 6.13. Predicted probabilities and observed proportions (dots) in logistic regression
model for insecticide kills. Estimated dosages lethal to &!% and )!% of the larvae are also
shown.

© 2003 by CRC Press LLC


Applications 365

The table of Additional Estimates shows the output for the four estimate statements.
The point estimates for the lethal dosages agrees with the manual calculation above and the
standard errors are obtained from a first-order Taylor series expansion. The values in the
columns Lower and Upper are asymptotic *&% confidence intervals for the estimated
quantities.
Figure 6.13 shows the predicted probabilities to kill a randomly selected larva as a
seB!Þ& f
function of the log"! concentration. The observed proportions are overlaid and the log
seB!Þ) f dosages are shown.
and log

6.7.2 Binomial Proportions in a Randomized Block Design —


the Hessian Fly Experiment
Gotway and Stroup (1997) present data from an agronomic field trial in which sixteen
varieties of wheat are to be compared with respect to their resistance to infestation with the
Hessian fly. The varieties were arranged in a randomized complete block design with four
blocks on an ) ‚ ) grid (Figure 6.14). For each of the '% experimental units the number of
plants with insect damage was counted. Let ^34 denote the number of damaged plants for
variety 3 in block 4 and 834 the number of plants on the experimental unit. The outcome of
interest is the sample proportion ]34 œ ^34 Î834 . If infestations are independent from plant to
plant on an experimental unit and the plants are equally likely to become infested, ^34 is a
Binomiala834 ß 134 b random variable. By virtue of the random assignment of varieties to grid
cells in each block the ^34 's are also independent of each other.

33.3
9 1 8 4 2 11 5 16
29.6

5 13 12 16 13 8 3 10
25.9

22.2 11 7 14 6 15 6 14 7
Latitude (m)

18.5 15 3 10 2 9 4 12 1

14.8 10 9 4 1 11 16 5 2

11.1
11 12 2 3 6 1 15 12
7.4
13 15 8 5 4 10 3 9
3.7
14 16 7 6 7 13 8 14
0.0

0.0 3.7 7.4 11.1 14.8 18.5 22.2 25.9 29.6 33.3
Longitude (m)

Figure 6.14. Design layout in Hessian fly experiment. Field plots are $Þ(7 ‚ $Þ(7. The area
of the squares is proportional to the sample proportion of damaged plants; numbers indicate
the variety. Block boundaries are shown as solid lines. Data used with permission of the
International Biometric Society.

© 2003 by CRC Press LLC


366 Chapter 6  Generalized Linear Models

A generalized linear model for this experiment can be set up with a linear predictor that
represents the experimental design and a link function for the probability that a randomly
selected plant is damaged by Hessian fly infestation. Choosing a logit link function this model
becomes
134
logita134 b œ lnœ  œ (34 œ . € 73 € 34 , [6.53]
"  134

where 73 is the effect of the 3th variety and 34 is the effect of the 4th block.
Of interest are comparisons of the treatment effects adjusted for the block effects. For
example, one may want to test the hypothesis that varieties 3 and 3w have equal probability to
be damaged by infestations, i.e, L! :13 œ 13w . These probabilities are not the same as the
block-variety specific probabilities 134 in model [6.53]. In a linear model ]34 œ . € 73 € 34 €
/34 these comparisons are based on the least squares means of the treatment effects. If . s, s7 3 ,
and s34 denote the respective least squares estimates and there are 4 œ "ß âß % blocks as in this
example, the least squares mean for treatment 3 is calculated as
"
.
s € s7 3 € as
3 €s
3# € s
3$ € s
3% b œ .
s € s7 3 € s
3Þ . [6.54]
% "
A similar approach can be taken in the generalized linear model. If . s, s7 3 , and s
34 denote the
converged IRLS estimates of the parameters in model [6.53]. The treatment specific linear
predictor is calculated as the same estimable linear function as in the standard model:
s(3Þ œ .
s € s7 3 € s
3Þ Þ

The estimate of the marginal probability that variety 3 is damaged by the Hessian fly is then
obtained by inverting the link function, 1 s3Þ œ "Îa" € expe  s(3Þ fb. Hypothesis tests can be
based on a comparison of the s(3Þ , which are linear functions of the parameter estimates, or the
1
s3Þ , which are nonlinear functions of the estimates.
The proc genmod code to fit the Binomial proportions with a logit link in a randomized
complete block design follows. The ods exclude statement suppresses the printing of various
default tables. The lsmeans entry / diff; statement requests the marginal linear predictors
s(3Þ for the varieties (entries) as well as all pairwise tests of the form L! : (3Þ œ (3w Þ . Because
lsmeandiffs is included in the ods exclude statement, the lengthy table of differences
(3Þ  (3w Þ is not included on the printed output. The ods output lsmeandiffs=diff; statement
saves the "'‡"&Î# œ "#! pairwise comparisons in the SAS® data set diff that is available
for post-processing after proc genmod concludes.
ods exclude ParmInfo ParameterEstimates lsmeandiffs;
proc genmod data=HessFly;
class block entry;
model z/n = block entry / link=logit dist=binomial type3;
lsmeans entry /diff;
ods output lsmeandiffs=diff;
run;

From the LR Statistics For Type 3 Analysis table we glean a significant variety
(entry) effect a:  !Þ!!!"b (Output 6.6). It should not be too surprising that the sixteen
varieties do not have the same tendency to be damaged by the Hessian fly. The Least
Squares Means table lists the marginal linear predictors for the varieties which can be

© 2003 by CRC Press LLC


Applications 367

converted into damage probabilities by inverting the logit link function. For variety ", for
example, this estimated marginal probability is 1 s"Þ œ "Îa" € expe  "Þ%)'%fb œ !Þ)"& and
s)Þ œ "Îa" € expe!Þ"'$*fb œ !Þ%&*.
for variety ) this probability is only 1

Output 6.6.
The GENMOD Procedure

Model Information

Data Set WORK.HESSFLY


Distribution Binomial
Link Function Logit
Response Variable (Events) z No. of damaged plants
Response Variable (Trials) n No. of plants
Observations Used 64
Number Of Events 396
Number Of Trials 736

Class Level Information

Class Levels Values


block 4 1 2 3 4
entry 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 45 123.9550 2.7546


Scaled Deviance 45 123.9550 2.7546
Pearson Chi-Square 45 106.7426 2.3721
Scaled Pearson X2 45 106.7426 2.3721
Log Likelihood -440.6593
Algorithm converged.

LR Statistics For Type 3 Analysis

Chi-
Source DF Square Pr > ChiSq
block 3 4.27 0.2337
entry 15 132.62 <.0001

Least Squares Means

Standard Chi-
Effect entry Estimate Error DF Square Pr > ChiSq
entry 1 1.4864 0.3921 1 14.37 0.0002
entry 2 1.3453 0.3585 1 14.08 0.0002
entry 3 0.9963 0.3278 1 9.24 0.0024
entry 4 0.0759 0.2643 1 0.08 0.7740
entry 5 1.3139 0.3775 1 12.12 0.0005
entry 6 0.5758 0.3180 1 3.28 0.0701
entry 7 0.8608 0.3302 1 6.80 0.0091
entry 8 -0.1639 0.2975 1 0.30 0.5816
entry 9 0.0960 0.2662 1 0.13 0.7183
entry 10 0.8413 0.3635 1 5.36 0.0206
entry 11 0.0313 0.2883 1 0.01 0.9136
entry 12 0.0423 0.2996 1 0.02 0.8876
entry 13 -2.0941 0.5330 1 15.44 <.0001
entry 14 -1.0185 0.3538 1 8.29 0.0040
entry 15 -0.6303 0.2883 1 4.78 0.0288
entry 16 -1.4645 0.3713 1 15.56 <.0001

© 2003 by CRC Press LLC


368 Chapter 6  Generalized Linear Models

To determine which entries differ significantly in the damage probabilities we post-


process the data set diff by deleting those comparisons which are not significant at a desired
significance level and sorting the data set with respect to entries. Choosing significance level
! œ !Þ!&, the statements
data diff; set diff; variety = entry+0; _variety = _entry+0;
drop entry _entry;
run;
proc sort data=diff(where=(ProbChiSq < 0.05));
by variety _variety ProbChiSq;
run;
proc print data=diff label;
var variety _variety Estimate StdErr Df ChiSq ProbChiSq;
run;

accomplish that (Output 6.7). The statements variety = entry+0; and _variety = _entry+0;
convert the values for entry into numeric format, since they are stored as character variables
by proc genmod. Entry ", for example, differs significantly from entries %, ), *, "", "#, "$, "%,
"&, and "'. Notice that the data set variable entry was renamed to variety to produce Output
6.7. The positive Estimate for the comparison of variety 1 and _variety 4, for example,
indicates that entry " has a higher damage probability than entry %. Similarly, the negative
Estimate for the comparison of variety 4 and _variety 5 indicates a lower damage
probability of variety % compared to variety &.

Output 6.7.
Chi
Obs variety _variety Estimate Std Err DF Square Pr>Chi

1 1 4 1.4104 0.4736 1 8.87 0.0029


2 1 8 1.6503 0.4918 1 11.26 0.0008
3 1 9 1.3903 0.4740 1 8.61 0.0034
4 1 11 1.4551 0.4863 1 8.95 0.0028
5 1 12 1.4440 0.4935 1 8.56 0.0034
6 1 13 3.5805 0.6614 1 29.30 <.0001
7 1 14 2.5048 0.5276 1 22.54 <.0001
8 1 15 2.1166 0.4862 1 18.95 <.0001
9 1 16 2.9509 0.5397 1 29.89 <.0001
10 2 4 1.2694 0.4454 1 8.12 0.0044
11 2 8 1.5092 0.4655 1 10.51 0.0012
12 2 9 1.2492 0.4467 1 7.82 0.0052
13 2 11 1.3140 0.4612 1 8.12 0.0044
14 2 12 1.3030 0.4669 1 7.79 0.0053
15 2 13 3.4394 0.6429 1 28.62 <.0001
16 2 14 2.3638 0.5042 1 21.98 <.0001
17 2 15 1.9756 0.4614 1 18.34 <.0001
18 2 16 2.8098 0.5158 1 29.68 <.0001
19 3 4 0.9204 0.4207 1 4.79 0.0287
20 3 8 1.1602 0.4427 1 6.87 0.0088
21 3 9 0.9003 0.4225 1 4.54 0.0331
22 3 11 0.9651 0.4370 1 4.88 0.0272
23 3 12 0.9540 0.4441 1 4.61 0.0317
24 3 13 3.0904 0.6266 1 24.32 <.0001
25 3 14 2.0148 0.4830 1 17.40 <.0001
26 3 15 1.6266 0.4378 1 13.80 0.0002
27 3 16 2.4608 0.4956 1 24.66 <.0001
28 4 5 -1.2380 0.4607 1 7.22 0.0072
29 4 13 2.1700 0.5954 1 13.28 0.0003

© 2003 by CRC Press LLC


Applications 369

This analysis of the Hessian fly experiment seems simple enough. A look at the Criteria
For Assessing Goodness Of Fit table shows that not all is well, however. The deviance of
the fitted model with "#$Þ*&& exceeds the degrees of freedom a%&b #Þ(-fold. In a proper
model the deviance is expected to be about as large as its degrees of freedom. We do not
advocate formal statistical tests of the deviance/df ratio unless data are grouped. Usually the
modeler interprets the ratio subjectively to decide whether a deviation of the ratio from one is
reason for concern. First we notice that ratios in excess of one indicate a potential overdisper-
sion problem and then inquire how overdispersion could arise. Omitting important variables
from the model leads to excess variability since the linear predictor does not account for
important effects. In an experimental design the modeler usually builds the linear predictor
from the randomization, treatment, and blocking protocol. In a randomized complete block
design, (34 œ . € 73 € 34 is the appropriate linear predictor since all other systematic effects
should have been neutralized by randomization. If all necessary effects are included in the
model, overdispersion could arise from positive correlations among the observations. Two
levels of correlations must be considered here. First, it was assumed that the counts on each
experimental unit follow the Binomial law which implies that the 834 Bernoullia134 b variables
are independent. In other words, the probability of a plant being damaged does not depend on
whether neighboring plants on the same experimental unit are infected or not. This seems
quite unlikely. We expect infestations to appear in clusters and the ^34 may not be Binomial-
distributed. Instead, a probability model that allows for overdispersion relative to the
Binomial, for example, the Beta-Binomial model could be used. Second, there are some
doubts whether the counts of neighboring units are independent as assumed in the analysis.
There may be spatial dependencies among grid cells in the sense that units near each other are
more highly correlated than units further apart. This assumption is reasonable if, for example,
propensity for infestation is linked to a soil variable that varies spatially. In other cases, such
spatial correlations induced by a spatially varying covariate have been confirmed. Randomi-
zation of the varieties to experimental units neutralizes such spatial dependencies. On
average, each treatment is affected by these effects equally and the overall effect is balanced
out. However, the variability due to these spatial effects is not removed from the data. To that
end, blocks need to be arranged in such a way that experimental units within a block are
homogeneous. Stroup et al. (1994) note that combining adjacent experimental units into
blocks in agricultural variety trials can be at variance with an assumption of homogeneity
within blocks when more than eight to twelve experimental units are grouped. Spatial trends
will then be removed only incompletely and this source of overdispersion prompted Gotway
and Stroup (1997) to analyze the Hessian fly data with a model that takes into account the
spatial dependence among counts of different experimental units explicitly. We will return to
such models and the Hessian fly data in §9.
A quick fix for overdispersed data that does not address the real cause of the overdis-
persion problem is to estimate a separate scale parameter 9 in models that would not contain
such a parameter otherwise. In the Hessian fly example, this is accomplished by adding the
dscale or pscale option to the model statement in proc genmod. The former estimates the
overdispersion parameter based on the deviance, the latter based on Pearson's statistic (see
§6.4.3). The variance of a count ^34 is then modeled as Varc^34 d œ 9834 134 a"  134 b rather
than Varc^34 d œ 834 134 a"  134 b. From the statements

© 2003 by CRC Press LLC


370 Chapter 6  Generalized Linear Models

ods exclude ParmInfo ParameterEstimates lsmeandiffs;


proc genmod data=HessFly;
class block entry;
model z/n = block entry / link=logit dist=binomial type3 dscale;
lsmeans entry /diff;
ods output lsmeandiffs=diff;
run;

a new data set of treatment differences is obtained. After post-processing of the diff data set
one obtains Output 6.8. Fewer entries are now found significantly different from entry " than
in the analysis that does not account for overdispersion. Also notice that the estimates of the
treatment differences has not changed. The additional overdispersion parameter is a multipli-
cative parameter that has no effect on the parameter estimates, only on their standard errors.
Using the dscale estimation method the overdispersion parameter is estimated as 9 sœ
"#$Þ*&&Î%& œ #Þ(&%' which is the ratio of deviance and degrees of freedom in the model
fitted initially. All standard errors in the preceding partial output are È#Þ(&%' larger than the
standard errors in the analysis without the overdispersion parameter.

Output 6.8.
Chi
Obs variety _variety Estimate Std Err DF Square Pr>Chi

1 1 8 1.6503 0.8162 1 4.09 0.0432


2 1 13 3.5805 1.0978 1 10.64 0.0011
3 1 14 2.5048 0.8756 1 8.18 0.0042
4 1 15 2.1166 0.8069 1 6.88 0.0087
5 1 16 2.9509 0.8958 1 10.85 0.0010
6 2 13 3.4394 1.0670 1 10.39 0.0013
7 2 14 2.3638 0.8368 1 7.98 0.0047
8 2 15 1.9756 0.7657 1 6.66 0.0099
9 2 16 2.8098 0.8561 1 10.77 0.0010
10 3 13 3.0904 1.0400 1 8.83 0.0030
11 3 14 2.0148 0.8016 1 6.32 0.0120
12 3 15 1.6266 0.7267 1 5.01 0.0252
13 3 16 2.4608 0.8225 1 8.95 0.0028
14 4 13 2.1700 0.9882 1 4.82 0.0281
15 4 16 1.5404 0.7575 1 4.14 0.0420
... and so forth ...

6.7.3 Gamma Regression and Yield Density Models


We now return to the yield density data of McCullagh and Nelder (1989, pp. 317-320) shown
in Table 6.4 (p. 309). The data consists of dry weights of barley sown at various seeding
densities with three replicates. Recall from §5.8.7 that a customary approach to model a yield-
density relationship is to assume that the inverse of yield per plant is a linear function of plant
density. The Shinozaki-Kira model (Shinozaki and Kira 1956)
Ec] d œ a"! € "" Bb"

and the Holliday model (Holliday 1960)


"
Ec] d œ ˆ"! € "" B € "# B# ‰

are representatives of this class of models. Here, B denotes the plant (seeding) density. A
standard nonlinear regression approach is then to model, for example, ]3 œ

© 2003 by CRC Press LLC


Applications 371

"Îa"! € "" B3 b € /3 where the /3 are independent random errors with mean ! and variance 5 # .
Figure 6.3 (p. 309) suggests that the variability is not homogeneous in these data, however.
Whereas one could accommodate variance heterogeneity in the nonlinear model by using
weighted nonlinear least squares, Figure 6.3 alerts us to a more subtle problem. The standard
deviation of the barley yields seems to be related to the mean yield. Although Figure 6.3 is
quite noisy, it is not unreasonable to assume that the standard deviations are proportional to
the mean (a regression through the origin of = on C). Figure 6.15 displays the reciprocal yields
and the reciprocal of the sample means across the three blocks against the seeding density. An
inverse quadratic relationship
"
œ "! € "" B € "# B#
Ec] d

as suggested by the Holliday model is reasonable. This model has linear predictor ( œ
"! € "" B € "# B# and reciprocal link function. For the random component we choose ] not to
be a Gaussian, but a Gamma random variable. Gamma random variables are non-negative
(such as yields), and their standard deviation is proportional to their mean. The Gamma distri-
butions are furthermore not symmetric about the mean but right-skewed (see Figure 6.2, p.
308). The canonical link of a Gamma random variable is the reciprocal link, which provides
further support to use this model for yield density investigations where inverse polynomial
relationships are common. Unfortunately, the inverse link does not guarantee that the predic-
ted means are non-negative since the linear predictor is not constrained to be positive. As an
alternative link function for Gamma-distributed random variables, the log link can be used.

0.5

0.4

0.3
Inverse Yields

0.2

0.1

0.0

0 20 40 60 80 100 120
Barley Seeding Density

Figure 6.15. Inverse yields and inverse replication averages against seeding density. Dis-
connected symbols represent observations from blocks " to $, the connected symbols the
sample averages. An inverse quadratic relationship is reasonable.

Before fitting a generalized linear model with Gamma errors we must decide whether to
fit the model to the $! observations from the three blocks or to the "! block averages. In the
former case, we must include block effects in the full model and can then test whether it is

© 2003 by CRC Press LLC


372 Chapter 6  Generalized Linear Models

reasonable that some effects do not vary by blocks. The full model we consider here has a
linear predictor
(34 œ "!4 € ""4 B34 € "#4 B#34 [6.55]

where the subscript 4 identifies the blocks. Combined with a reciprocal link function this
model will fit a separate inverse quadratic to data from each block. The proc genmod code to
fit this model with Gamma errors follows. The noint option was added to the model state-
ment to prevent the addition of an overall intercept term "! . The link=power(-1) option
invokes the reciprocal link.
ods exclude ParameterEstimates;
proc genmod data=barley;
class block;
model bardrwgt = block block*seed block*seed*seed /
noint link=power(-1) dist=gamma type3;
run;

Output 6.9.
The GENMOD Procedure

Model Information
Data Set WORK.BARLEY
Distribution Gamma
Link Function Power(-1)
Dependent Variable BARDRWGT
Observations Used 30

Class Level Information


Class Levels Values
BLOCK 3 1 2 3

Criteria For Assessing Goodness Of Fit


Criterion DF Value Value/DF
Deviance 21 7.8605 0.3743
Scaled Deviance 21 31.2499 1.4881
Pearson Chi-Square 21 6.7872 0.3232
Scaled Pearson X2 21 26.9830 1.2849
Log Likelihood -102.6490
Algorithm converged.

LR Statistics For Type 3 Analysis


Source DF Chi-Square Pr > ChiSq
BLOCK 3 52.39 <.0001
seed*BLOCK 3 9.19 0.0269
seed*seed*BLOCK 3 5.66 0.1294

The full model has a log likelihood of 6Ð. sà yÑ œ  "!#Þ'%* and the LR Statistics
s0 ß <
For Type 3 Analysis table shows that the effect which captures separate quadratic effects
for each block is not significant a: œ !Þ"#*%, Output 6.9b. To see whether a common
quadratic effect is sufficient, we fit the model as
ods exclude obstats;
proc genmod data=barley;
class block;
model bardrwgt = block block*seed seed*seed /
noint link=power(-1) dist=gamma type3 obstats;
ods output obstats=stats;
run;

© 2003 by CRC Press LLC


Applications 373

and obtain a log likelihood of  "!#Þ))"* (Output 6.10). Twice the difference of the log
likelihoods, A œ #‡a  "!#Þ'%* € "!#Þ))"*b œ !Þ%'&, is not significant and we conclude
that the quadratic effects need not be varied by blocks aPra;##   !Þ%'&b œ !Þ(*$b. The
common quadratic effect of seeding density is significant at the &% level a: œ !Þ!##(b and
will be retained in the model (Output 6.10). The obstats option of the model statement
requests a table of the linear predictors, predicted values, and various residuals to be calcu-
lated for the fitted model.

Output 6.10. The GENMOD Procedure

Model Information
Data Set WORK.BARLEY
Distribution Gamma
Link Function Power(-1)
Dependent Variable BARDRWGT
Observations Used 30

Class Level Information


Class Levels Values
BLOCK 3 1 2 3

Criteria For Assessing Goodness Of Fit


Criterion DF Value Value/DF

Deviance 23 7.9785 0.3469


Scaled Deviance 23 31.2678 1.3595
Pearson Chi-Square 23 6.7327 0.2927
Scaled Pearson X2 23 26.3856 1.1472
Log Likelihood -102.8819
Algorithm converged.

Analysis Of Parameter Estimates


Standard Wald 95% Confidence Chi- Pr >
Parameter DF Estimate Error Limits Square ChiSq

Intercept 0 0.0000 0.0000 0.0000 0.0000 . .


BLOCK 1 1 0.1014 0.0193 0.0637 0.1392 27.72 <.0001
BLOCK 2 1 0.1007 0.0189 0.0636 0.1378 28.28 <.0001
BLOCK 3 1 0.0930 0.0177 0.0582 0.1277 27.47 <.0001
seed*BLOCK 1 1 -0.0154 0.0053 -0.0257 -0.0051 8.51 0.0035
seed*BLOCK 2 1 -0.0162 0.0053 -0.0265 -0.0059 9.42 0.0021
seed*BLOCK 3 1 -0.0159 0.0052 -0.0261 -0.0057 9.38 0.0022
seed*seed 1 0.0009 0.0004 0.0002 0.0017 5.69 0.0170
Scale 1 3.9190 0.9719 2.4104 6.3719
NOTE: The scale parameter was estimated by maximum likelihood.

LR Statistics For Type 3 Analysis


Chi-
Source DF Square Pr > ChiSq

BLOCK 3 51.97 <.0001


seed*BLOCK 3 8.77 0.0324
seed*seed 1 5.19 0.0227

The ods exclude obstats; statement in conjunction with the ods output
obstats=stats; statement prevents the printing of these statistics to the output window and
saves the results in a SAS® data set (named stats here). The seeding densities were divided
by "! prior to fiting of this model to allow sufficient significant digits to be displayed in the
Analysis Of Parameter Estimates table.

© 2003 by CRC Press LLC


374 Chapter 6  Generalized Linear Models

The parameterization of the Gamma distribution chosen by proc genmod corresponds to


our [6.7] (p. 308) and the parameter labeled Scale is our ! parameter in [6.7]. In this param-
eterization we have Ec] d œ . and Varc] d œ .# Î!. We can thus estimate the mean and
variance of an observation from block 1 at seeding density $ as
"
.
sœ $ *
œ "!Þ$#%
!Þ"!"%  !Þ!"&% "! € !Þ!!!* "!!

s ] d œ "!Þ$#%# Î$Þ*"* œ #(Þ"*(.


Varc

The fitted barley yields for the three blocks are shown in Figure 6.16. If the model is cor-
rect, yields will attain a maximum at a seeding density around )!. While the seeding density
of highest yield depends little on the block, the maximum yield attained varies considerably.

40

Block 3
Fitted Barley Yields

30
Block 2

Block 1
20

10

0 20 40 60 80 100 120
Barley Seeding Density

Figure 6.16. Fitted barley yields in the three blocks based on a model with inverse linear
effects varied by blocks and a common inverse quadratic effect.

The stats data set with the output from the obstats option contains several residual
diagnostics, such as raw residuals, deviance residuals, Pearson residuals, and their standard-
ized versions. We plotted the Pearson residuals
C34  .
s34
s<34 œ ,
É 2 ˆ.
s34 ‰

where 2ˆ. s34 ‰ is the variance function evaluated at the fitted mean in Figure 6.17 (open
circles). From Table 6.2 (p. 305) the variance function of a Gamma random variable is simply
the square of the mean and the Pearson residuals take on the form
C34  . s34
s<34 œ . [6.56]
.s34

For the observation C"" œ #Þ!( from block " at seeding density $ the Pearson residual is

© 2003 by CRC Press LLC


Applications 375

s<"" œ a#Þ!(  "!Þ$#%bÎ"!Þ$#% œ  !Þ(**. Also shown as closed circles are the studentized
residuals from fitting the model
"
]34 œ € /34
"!4 € ""4 B34 € "# B#34

as a nonlinear model with symmetric and homoscedastic errors (in proc nlin). If the mean in-
creases with seeding density and the variation of the data is proportional to the mean we
expect the variation in the nonlinear regression residuals to increase with seeding density.
This effect is obvious in Figure 6.17. The assumption of homoscedastic errors underpinning
the nonlinear regression analysis is not tenable; therefore, the Gamma regression is preferred.
The tightness of the Pearson residuals in the Gamma regression model at seeding density ((
is due to the fact that these values are close to the density producing the maximum yield
(Figure 6.16). Since this critical density is very similar from block to block, but the maximum
yields differ greatly, the denominators in [6.56] shrink the raw residuals C34  . s34 most for
those blocks with high yields.

2
Student or Pearson Residual

-1

-2

0 20 40 60 80 100 120
Seeding Density

Figure 6Þ17. Pearson residuals (open circles) from generalized linear Gamma regression
model and studentized residuals from nonlinear Gaussian regression model (full circles).

6.7.4 Effects of Judges' Experience on Bean Canning Quality


Ratings
Canning quality is one of the most essential traits required in all new dry bean (Phaseolus
vulgaris L.) varieties, and selection for this trait is a critical part of bean-breeding programs.
Advanced lines that are candidates for release as varieties must be evaluated for canning
quality for at least three years from samples grown at different locations. Quality is assessed
by a panel of judges with varying levels of experience in evaluating breeding lines for visual
quality traits. In 1996, #'% bean-breeding lines from four commercial classes were canned

© 2003 by CRC Press LLC


376 Chapter 6  Generalized Linear Models

according to the procedures described by Walters et al. (1997). These included '# navy, '&
black, && kidney, and )# pinto bean-breeding lines plus checks and controls. The visual
appearance of the processed beans was determined subjectively by a panel of "$ judges on a
seven point hedonic scale (" = very undesirable, % = neither desirable nor undesirable, ( =
very desirable). The beans were presented to the panel of judges in a random order at the
same time. Prior to evaluating the samples, all judges were shown examples of samples rated
as satisfactory (%). Concern exists if certain judges, due to lack of experience, are unable to
correctly rate canned samples. From attribute-based product evaluations inferences about the
effects of experience can be drawn from the psychology literature. Wallsten and Budescu
(1981), for example, report that in the evaluation of a personality profile consisting of
fourteen factors, experienced clinical psychologists utilized four to seven factors, whereas
psychology graduate students tended to use only the two or three most salient factors. Prior to
the bean canning quality rating experiment it was postulated that less experienced judges rate
more severely than more experienced judges but also that experience should have little or no
effect for navy beans for which the canning procedure was developed. Judges are stratified
for purposes of analysis by experience ( Ÿ & years, ž & years). The counts by canning
quality, judges experience, and bean-breeding line are listed in Table 6.17.

Table 6.17. Bean rating data. Kindly made available by Dr. Jim Kelly, Department of
Crop and Soil Sciences, Michigan State University. Used with permission.
Black Kidney Navies Pinto
Score Ÿ & ys ž & ys Ÿ & ys ž & ys Ÿ & ys ž & ys Ÿ & ys ž & ys
" "$ $# ( "! "! ## "$ #
# *" () $# $" &' &" #* "(
$ "#$ "#% "$' *' )% "!( *" ')
% (# "## "!" "!% )% *) "!* "#%
& #% $" %( (" &" &# '! "!*
' # $ ' ") #% $( #& ()
( ! ! " ! " & " "#

A proportional odds model for the ordered canning scores is fit with proc genmod below.
The contrast statements test the effect of the judges' experience separately for the bean lines.
These contrasts correspond to interaction slices by bean lines. The estimate statements
calculate the linear predictors needed to derive the probabilities to rate each line in category $
or less and category % or less depending on judges experience.
ods exclude ParameterEstimates ParmInfo;
proc genmod data=beans;
class class exper;
model score = class exper class*exper /
link=cumlogit dist=multinomial type3;
contrast 'Experience effect for Black'
exper 1 -1 class*exper 1 -1 0 0 0 0 0 0 ;

contrast 'Experience effect for Kidney'


exper 1 -1 class*exper 0 0 1 -1 0 0 0 0;

contrast 'Experience effect for Navies'


exper 1 -1 class*exper 0 0 0 0 1 -1 0 0;

© 2003 by CRC Press LLC


Applications 377

contrast 'Experience effect for Pinto'


exper 1 -1 class*exper 0 0 0 0 0 0 1 -1;

estimate 'Black, < 5 years, score < 4' Intercept 0 0 1


class 1 0 0 0 exper 1 0 class*exper 1 0 0 0 0 0 0 0;

estimate 'Black, > 5 years, score < 4' Intercept 0 0 1


class 1 0 0 0 exper 0 1 class*exper 0 1 0 0 0 0 0 0;

estimate 'Black, < 5 years, score =< 4' Intercept 0 0 0 1


class 1 0 0 0 exper 1 0 class*exper 1 0 0 0 0 0 0 0;

estimate 'Black, > 5 years, score =< 4' Intercept 0 0 0 1


class 1 0 0 0 exper 0 1 class*exper 0 1 0 0 0 0 0 0;
run;

There is a significant interaction between bean lines (class) and judge experience
(Output 6.11). The results of comparing judges with more and less than & years experience
will depend on the bean line. The contrast slices address this interaction (Output 6.12). The
ratings distributions for experienced and less experienced judges is clearly not significantly
different for navy beans and at the &% level not significantly different for black beans
(: œ !Þ!*'%). There are differences betwen the ratings for kidney and pinto beans, however
(: œ !Þ!!&" and :  !Þ!!!").

Output 6.11.
The GENMOD Procedure

Model Information
Data Set WORK.BEANS
Distribution Multinomial
Link Function Cumulative Logit
Dependent Variable SCORE
Observations Used 2795

Class Level Information


Class Levels Values
CLASS 4 Black Kidney Navies Pinto
EXPER 2 Less than 5 More than 5

Response Profile
Ordered Ordered
Level Value Count
1 1 109
2 2 385
3 3 829
4 4 814
5 5 445
6 6 193
7 7 20

Criteria For Assessing Goodness Of Fit


Criterion DF Value Value/DF
Log Likelihood -4390.5392
Algorithm converged.

LR Statistics For Type 3 Analysis


Source DF Chi-Square Pr > ChiSq
CLASS 3 262.95 <.0001
EXPER 1 37.39 <.0001
CLASS*EXPER 3 29.40 <.0001

© 2003 by CRC Press LLC


378 Chapter 6  Generalized Linear Models

To develop an impression of the differences in the rating distributions we calculated the


probabilities to obtain ratings below %, of exactly %, and above %. Recall that a score of %
represents satisfactory quality and that all judges were shown examples thereof prior to the
actual canning quality assessment. The probabilities are obtained from the linear predictors
calculated with the estimate statements (Output 6.11). For black beans and raters with less
than & years of experience we get
"
PraW-9</  %b œ œ !Þ'*
" € expe  !Þ()*"f
"
PraW-9</ œ %b œ  !Þ'* œ !Þ#"
" € expe  #Þ")*(f
"
PraW-9</ ž %b œ "  œ !Þ"!.
" € expe  #Þ")*(f

Similar calculation for the other groups lead to the probability distributions shown in Figure
6.18.
Well-documented criteria and a canning procedure specifically designed for navy beans
explains the absence of differences due to the judges' experience for navy beans
(: œ !Þ($*"). Black beans in general were of poor quality and low ratings dominated, creat-
ing very similar probability distributions for experienced and less experienced judges (Figure
6.18). For kidney and pinto beans experienced and inexperienced judges classify control
quality with similar odds, probably because they were shown such quality prior to judging.
Experienced judges have a tendency to assign higher quality scores than less experienced
judges for these two commercial classes.

Output 6.12.
Contrast Estimate Results

Standard
Label Estimate Error Confidence Limits

Black, < 5 years, score < 4 0.7891 0.1007 0.5917 0.9866


Black, > 5 years, score < 4 0.5677 0.0931 0.3852 0.7501
Black, < 5 years, score =< 4 2.1897 0.1075 1.9789 2.4005
Black, > 5 years, score =< 4 1.9682 0.0996 1.7730 2.1634

Contrast Results

Chi-
Contrast DF Square Pr > ChiSq Type

Experience effect for Black 1 2.76 0.0964 LR


Experience effect for Kidney 1 7.84 0.0051 LR
Experience effect for Navies 1 0.11 0.7391 LR
Experience effect for Pinto 1 58.32 <.0001 LR

© 2003 by CRC Press LLC


Applications 379

Pr(Score < 4)
Pr(Score = 4)
Pr(Score > 4)
Black (n.s.) Kidney (**) Navy (n.s.) Pinto (**)
1.0
0.9
0.8 0.21 0.24
Category Probability

0.7
0.31 0.31
0.30 0.33
0.6 0.33

0.5

0.4 0.30

0.3 0.69 0.64 0.51 0.41

0.2 0.48 0.47


0.41 0.20
0.1
0.0
<= 5 yrs. > 5 yrs. <= 5 yrs. > 5 yrs.
> 5 yrs. <= 5 yrs. > 5 yrs. <= 5 yrs.
Class * Experience Combinations

Figure 6.18. Predicted probability distributions for the % ‚ # interaction in bean rating
experiment. Categories " to $ as well as categories & to ( are amalgamated to emphasize
deviations from satisfactory ratings (score = 4).

6.7.5 Ordinal Ratings in a Designed Experiment with Factorial


Treatment Structure and Repeated Measures
Turfgrass fertilization traditionally has been accomplished through surface applications. The
introduction of the Hydroject® (The Toro Company) has made possible subsurface placement
of soluble materials. A study was conducted during the 1997 growing season to compare
surface application and subsurface injection of nitrogen on the color of a one-year-old
creeping bentgrass (Agrostis palustris L. Huds) putting green. The treatment structure
comprised a complete % ‚ # factorial of factors Management Practice (four levels) and
Application Rate (two levels) (Table 6.18). The eight treatment combinations were arranged
in a completely randomized design with four replications. Turf color was assessed on each
experimental unit in weekly intervals for four weeks as poor, average, good, or excellent
(Table 6.19).

Table 6.18. Treatment factors in nitrogen injection study


Management Practice Rate of Application
" œ N surface applied with no supplemental water injection #Þ& g ‚ m#
# œ N surface applied with supplemental water injection &Þ! g ‚ m#
$ œ N injected with #&' nozzle ((Þ' cm depth of injection)
% œ N injected with #&$ nozzle ("#Þ( cm depth of injection)

© 2003 by CRC Press LLC


380 Chapter 6  Generalized Linear Models

Table 6.19. Number of times a combination of management practice and application


rate received a particular rating across four replicates and four sampling occasions.
Data kindly provided by Dr. Douglas E. Karcher, Department of Horticulture,
University of Arkansas. Used with permission.
Management " Management #
Quality R" R# Total Quality R" R# Total
Poor "% & "* Poor "& ) #$
Average # "" "$ Average " ) *
Good ! ! ! Good ! ! !
Excellent ! ! ! Excellent ! ! !
Total "' "' $# Total "' "' $#

Management $ Management %
Quality R" R# Total Quality R" R# Total
Poor ! ! ! Poor " ! "
Average * # "" Average "# % "'
Good ( "% #" Good $ "" "%
Excellent ! ! ! Excellent ! " "
Total "' "' $# Total "' "' $#

Of particular interest were the determination of the water injection effect, the subsurface
effect, and the comparison of injection vs. surface applications. These are contrasts among the
levels of the factor Management Practice and it first needs to be determined whether the
factor interacts with Application Rate.
We fit a proportional odds model to these data containing Management and nitrogen
Application Rate effects and their interaction as well as a continuous covariate to model the
temporal effects. This is not the most efficient method of accounting for repeated measures;
we address repeated measures data structures in more detail in §7. Inclusion of the time
variable significantly improves the model fit over a model containing only main effects and
interactions of the experimental factors, however. The basic proc genmod statements are
ods exclude ParameterEstimates ParmInfo;
proc genmod data=mgtN rorder=data;
class mgt nitro;
model resp = mgt nitro mgt*nitro date /
link=cumlogit dist=multinomial type3;
run;

The genmod output shows a nonsignificant interaction and significant Management,


Application Rate, and Time effects (Output 6.13). Since the interaction is not significant, we
can proceed to test the contrasts of interest based on the marginal management effects.
Adding the contrast statements
contrast 'WIC effect ' mgt 1 -1 ;
contrast 'Subsurface effect ' mgt 0 2 -1 -1 ;
contrast 'Injected vs surface ' mgt 1 1 -1 -1 ;

to the proc genmod code produces the additional Output 6.14.

© 2003 by CRC Press LLC


Applications 381

Output 6.13.
The GENMOD Procedure
Model Information
Data Set WORK.MGTN
Distribution Multinomial
Link Function Cumulative Logit
Dependent Variable resp
Observations Used 128

Class Level Information


Class Levels Values
mgt 4 1 2 3 4
nitro 2 1 2

Response Profile
Ordered Ordered
Level Value Count
1 Poor 43
2 Average 49
3 Good 35
4 Excellen 1

Criteria For Assessing Goodness Of Fit


Criterion DF Value Value/DF
Log Likelihood -64.4218
Algorithm converged.

LR Statistics For Type 3 Analysis


Chi-
Source DF Square Pr > ChiSq
mgt 3 140.79 <.0001
nitro 1 43.01 <.0001
mgt*nitro 3 0.90 0.8262
date 1 18.06 <.0001

Supplementing nitrogen surface application with water injection does not alter the turf
quality significantly. However, the rating distribution of the average of the nitrogen injection
treatments is significantly different from the turf quality obtained with nitrogen application
and supplemental water injection (PV œ "!*Þ&"ß :  !Þ!!!"). Similarly, the average injec-
tion treatment leads to a significantly different rating distribution than the average surface
application.

Output 6.14.
Contrast Results

Chi-
Contrast DF Square Pr > ChiSq Type

WIC effect 1 1.32 0.2512 LR


Subsurface effect 1 109.51 <.0001 LR
Injected vs surface 1 139.65 <.0001 LR

To determine these rating distributions for the marginal Management Practice effect and
the marginal Rate effect, we obtain the least squares means and convert them into probabili-
ties by inverting the link function. Unfortunately, proc genmod does not permit a lsmeans
statement in combination with the multinomial distribution, i.e., for ordinal response. The
marginal treatment means can be constructed with estimate statements, however. Adding to
the genmod code the statements

© 2003 by CRC Press LLC


382 Chapter 6  Generalized Linear Models

estimate 'nitro 1 mean (<= Poor)' intercept 4 0 0 mgt 1 1 1 1 nitro 4 0


mgt*nitro 1 0 1 0 1 0 1 0 date 10 /
divisor=4;
estimate 'nitro 1 mean (<= Aver)' intercept 0 4 0 mgt 1 1 1 1 nitro 4 0
mgt*nitro 1 0 1 0 1 0 1 0 date 10 /
divisor=4;
estimate 'nitro 1 mean (<= Good)' intercept 0 0 4 mgt 1 1 1 1 nitro 4 0
mgt*nitro 1 0 1 0 1 0 1 0 date 10 /
divisor=4;

estimate 'Mgt 1 mean (<= Poor)' intercept 2 0 0 mgt 2 0 0 0 nitro 1 1


mgt*nitro 1 1 0 0 0 0 0 0 date 5 /
divisor=2;
estimate 'Mgt 1 mean (<= Average)'intercept 0 2 0 mgt 2 0 0 0 nitro 1 1
mgt*nitro 1 1 0 0 0 0 0 0 date 5 /
divisor=2;
estimate 'Mgt 1 mean (<= Good)' intercept 0 0 2 mgt 2 0 0 0 nitro 1 1
mgt*nitro 1 1 0 0 0 0 0 0 date 5 /
divisor=2;

produces linear predictors from which the marginal probability distributions for rate #Þ&
g ‚ m# and for surface application with no supplemental water injection can be obtained.
Adding similar statements for the second application rate and the other three management
practices we obtain the linear predictors in Output 6.15.

Output 6.15.
Contrast Estimate Results

Standard
Label Estimate Error Alpha Confidence Limits

nitro 1 mean (<= Poor) -0.9468 0.6192 0.05 -2.1605 0.2669


nitro 1 mean (<= Aver) 4.7147 0.7617 0.05 3.2217 6.2077
nitro 1 mean (<= Good) 10.5043 1.4363 0.05 7.6892 13.3194
nitro 2 mean (<= Poor) -3.9531 0.6769 0.05 -5.2799 -2.6264
nitro 2 mean (<= Average) 1.7084 0.6085 0.05 0.5156 2.9011
nitro 2 mean (<= Good) 7.4980 1.2335 0.05 5.0803 9.9157

Mgt 1 mean (<= Poor) 0.6986 0.4902 0.05 -0.2621 1.6594


Mgt 1 mean (<= Average) 6.3602 1.1924 0.05 4.0232 8.6971
Mgt 1 mean (<= Good) 12.1498 1.7154 0.05 8.7877 15.5119
Mgt 2 mean (<= Poor) 1.5627 0.6049 0.05 0.3772 2.7483
Mgt 2 mean (<= Average) 7.2243 1.2780 0.05 4.7194 9.7291
Mgt 2 mean (<= Good) 13.0138 1.7854 0.05 9.5145 16.5132
Mgt 3 mean (<= Poor) -6.5180 1.1934 0.05 -8.8570 -4.1790
Mgt 3 mean (<= Average) -0.8565 0.4455 0.05 -1.7296 0.0166
Mgt 3 mean (<= Good) 4.9331 1.1077 0.05 2.7621 7.1041
Mgt 4 mean (<= Poor) -5.5433 1.1078 0.05 -7.7146 -3.3720
Mgt 4 mean (<= Average) 0.1182 0.4710 0.05 -0.8049 1.0414
Mgt 4 mean (<= Good) 5.9078 1.1773 0.05 3.6004 8.2153

A comparison of these marginal distributions is straightforward with contrast or


estimate statements. For example,
estimate 'nitro 1 - nitro 2' nitro 1 -1 ;
estimate 'mgt 1 - mgt 2' mgt 1 -1;
estimate 'mgt 1 - mgt 3' mgt 1 0 -1;
estimate 'mgt 1 - mgt 4' mgt 1 0 0 -1;
estimate 'mgt 2 - mgt 3' mgt 0 1 -1;
estimate 'mgt 2 - mgt 4' mgt 0 1 0 -1;
estimate 'mgt 3 - mgt 4' mgt 0 0 1 -1;

produces all pairwise marginal comparisons:

© 2003 by CRC Press LLC


Applications 383

Output 6.16.
Contrast Estimate Results

Standard Chi-
Label Estimate Error Alpha Square Pr > ChiSq

nitro 1 - nitro 2 3.0063 0.5525 0.05 29.61 <.0001

mgt 1 - mgt 2 -0.8641 0.7726 0.05 1.25 0.2634


mgt 1 - mgt 3 7.2167 1.2735 0.05 32.11 <.0001
mgt 1 - mgt 4 6.2419 1.1947 0.05 27.30 <.0001
mgt 2 - mgt 3 8.0808 1.3561 0.05 35.51 <.0001
mgt 2 - mgt 4 7.1060 1.2785 0.05 30.89 <.0001
mgt 3 - mgt 4 -0.9747 0.6310 0.05 2.39 0.1224

Table 6.20 shows the marginal probability distributions and indicates significant differences
among the treatment levels. The surface applications lead to poor turf quality with high
probability. Their ratings are at most average in over *!% of the cases. When nitrogen is
injected into the soil the rating distributions shift toward higher categories. The #&' nozzle
((Þ' cm depth of injection) leads to good turf quality in over #Î$ of the cases. The nozzle that
injects nitrogen up to "#Þ( cm has a higher probability of average ratings, compared to the
#&' nozzle, probably because the nitrogen is placed closer to the roots. Although there are no
significant differences in the rating distributions between the two surface applications
(: œ !Þ#'$%) and the ratings distributions of the two injection treatments (: œ !Þ"##%), the
two groups of treatments clearly separate. Based on the results of this analysis one would
recommend nitrogen injection of & g ‚ m# .

Table 6.20. Predicted marginal probability distributions (by category) for


nitrogen injection study
Management Practice N Rate
Quality " # $ % #Þ& g ‚ m# & g ‚ m#
Poor !Þ'') !Þ)#( !Þ!!" !Þ!!% !Þ#(* !Þ!#!
Average !Þ$$! !Þ"(# !Þ#*( !Þ&#& !Þ("# !Þ)#'
Good !Þ!!# !Þ!!" !Þ'*& !Þ%(# !Þ!!) !Þ"&$
Excellent !€ !€ !Þ!!( !Þ!!$ !Þ!!" !Þ!!"
+† + , , + ,

Columns with the same letter are not significantly different in their ratings distributions
at the &% significance level.

6.7.6 Log-Linear Modeling of Rater Agreement


In the applications discussed so far there is a clear distinction between the response and
explanatory variables in the model. For example, the response proportion of Hessian fly-
damaged plants was modeled as a function of (explanatory) block and treatment effects in
§6.7.2. This distinction between response and explanatory variables is not possible for certain
cross-tabulated data (contingency tables). The data in Table 6.21 represent the results of

© 2003 by CRC Press LLC


384 Chapter 6  Generalized Linear Models

rating the same #$' experimental units by two different raters on an ordinal scale from " to &.
For example, "# units were rated in category # by Rater # and in category " by Rater ". An
obvious question is whether the ratings of the two interpreters are independent. Should we
tackle this by modeling the Rater # results as a function of the Rater " results or vice versa?
There is no response variable or explanatory variable here, only two categorical variables
(Rater 1 with five categories and Rater # with five categories) and a cross-tabulation of #$'
outcomes.

Table 6.21. Observed absolute frequencies in two-rater cross-classification


Rater "
Rater 2 " # $ % & Total
" "! ' % # # #%
# "# #! "' ( # &(
$ " "# $! #! ' '*
% % & "! #& "# &'
& " $ $ ) "& $!
Total #) %' '$ '# $( #$'

Table 6.21 is a very special contingency table since the row and column variable have the
same categories. We refer to such tables as matched-pairs tables. Some of the models dis-
cussed and fitted in this subsection are specifically designed for matched-pairs tables, others
(such as the independence model) apply to any contingency table. The interested reader can
find more details on the fitting of generalized linear models to contingency tables in the
monographs by Agresti (1990) and Fienberg (1980).
A closer look at the data table suggests that the ratings are probably not independent. The
highest counts appear on the diagonal of the table. If Rater " assigns an experimental unit to
category 3, then there seems to be a high likelihood that Rater # also assigns the unit to
category 3. If we reject the notion of independence, for which we need to develop a statistical
test, our interest will shift to determining how the two rating schemes depend on each other.
Is there more agreement between the ratings in the table than is expected by chance? Is there
more disagreement in the table than expected by chance? Is there structure to the disagree-
ment; for example, does Rater " systematically assign values to higher categories?
To develop a model for independence of the ratings let \ denote the column variable, ]
the row variable, and R34 the count observed in row 3, column 4 of the contingency table. Let
M and N denote the number of rows and columns. In a square table such as Table 6.21 we
necessarily have M œ N . The independence model does not apply to square or matched-pairs
tables alone and we discuss it more generally here. The generic layout of the two-way contin-
gency table we are referring to is shown in Table 6.6 (p. 318). Recall that 834 denotes the
observed count in row 3 and column 4 of the table, 8ÞÞ denotes the total sample size and 83Þ ,
8Þ4 are the marginal totals. Under the Poisson sampling model where the count in each cell is
the realization of a Poissona-34 b random variable, the row and column totals are
PoissonÐ-3Þ œ !N4œ" -34 Ñ and PoissonÐ-Þ4 œ !3œ"
M
-34 Ñ variables, and the total sample size is a
MßN
PoissonÐ-ÞÞ œ !3ß4 -34 Ñ random variable. The expected cell count -34 under independence is
then related to the marginal expected counts by

© 2003 by CRC Press LLC


Applications 385

-3Þ -Þ4
-34 œ .
-ÞÞ
Taking logarithms leads to a generalized linear model with log link for Poisson random
variables and linear predictor
lne-34 f œ  lne-ÞÞ f € lne-3Þ f € lne-Þ4 f œ . € !3 € "4 . [6.57]

We think of !3 and "4 as the (marginal, main) effects of the row and column variables and
independence implies the absence of the a!" b34 interaction between the two variables.
There exists, of course, a well-known test for independence of categorical variables in
contingency tables based on the Chi-square distribution. It is sometimes referred to as
Pearson's Chi-square test. If 834 is the observed count in cell 3ß 4 and /34 œ 83Þ 8Þ4 Î8ÞÞ is the
expected count under independence, then
M N
a834  /34 b#
\ # œ "" [6.58]
3œ" 4œ"
/34

follows asymptotically a Chi-squared distribution with aM  "baN  "b degrees of freedom.


For the approximation to hold we need at least )!% of the expected cell counts to be at least &
and permit at most one expected cell count of " (Cochran 1954). This test and a likelihood
ratio test for independence can be calculated in proc freq of The SAS® System. The
statements
proc freq data=rating;
table rater1*rater2 /chisq nocol norow nopercent expected;
weight number;
run;

perform the Chi-square analysis of independence (option chisq). The options nocol,
nororow, and nopercent suppress the printing of column, row, and cell percentages. The
expected option requests a printout of the expectedfrequencies under independence. None of
the expected frequencies is less than " and exactly #! frequencies a)!%b exceed & ÐOutput
6Þ17).
The Chi-squared approximation holds and the Pearson test statistic is \ # œ "!$Þ"!)*
a:  !Þ!!!"b. There is significant disagreement between the observed counts and the counts
expected under an independence model. The hypothesis of independence is rejected. The
likelihood ratio test with test statistic A œ *&Þ$&(( leads to the same conclusion. The calcula-
tion of A is discussed below. The expected cell counts on the diagonal of the contingency
table reflect the degree of chance agreement and are interpreted as follows: if the counts are
distributed completely at random to the cells conditional on preserving the marginal row and
column totals, one would expect this degree of agreement between the ratings.
The independence model is rarely the best-fitting model for a contingency table and the
modeler needs to consider other models incorporating dependence between the row and
column variable. This is certainly the case here since the notion of independence has been
clearly rejected. This requires the use of statistical procedures that can fit other models than
independence, such as proc genmod.

© 2003 by CRC Press LLC


386 Chapter 6  Generalized Linear Models

Output 6.17.
The FREQ Procedure

Table of rater1 by rater2

rater1 rater2

Frequency|
Expected | 1| 2| 3| 4| 5| Total
---------+--------+--------+--------+--------+--------+
1 | 10 | 6 | 4 | 2 | 2 | 24
| 2.8475 | 4.678 | 6.4068 | 6.3051 | 3.7627 |
---------+--------+--------+--------+--------+--------+
2 | 12 | 20 | 16 | 7 | 2 | 57
| 6.7627 | 11.11 | 15.216 | 14.975 | 8.9364 |
---------+--------+--------+--------+--------+--------+
3 | 1 | 12 | 30 | 20 | 6 | 69
| 8.1864 | 13.449 | 18.419 | 18.127 | 10.818 |
---------+--------+--------+--------+--------+--------+
4 | 4 | 5 | 10 | 25 | 12 | 56
| 6.6441 | 10.915 | 14.949 | 14.712 | 8.7797 |
---------+--------+--------+--------+--------+--------+
5 | 1 | 3 | 3 | 8 | 15 | 30
| 3.5593 | 5.8475 | 8.0085 | 7.8814 | 4.7034 |
---------+--------+--------+--------+--------+--------+
Total 28 46 63 62 37 236

Statistics for Table of rater1 by rater2

Statistic DF Value Prob


------------------------------------------------------
Chi-Square 16 103.1089 <.0001
Likelihood Ratio Chi-Square 16 95.3577 <.0001
Mantel-Haenszel Chi-Square 1 59.2510 <.0001
Phi Coefficient 0.6610
Contingency Coefficient 0.5514
Cramer's V 0.3305

Sample Size = 236

We start by fitting the independence model in proc genmod to show the equivalence to
the Chi-square analysis in proc freq.
title1 'Independence model for ratings';
ods exclude ParameterEstimates obstats;
proc genmod data=rating;
class rater1 rater2;
model number = rater1 rater2 /link=log error=poisson obstats;
ods output obstats=stats;
run;

title1 'Predicted cell counts under model of independence';


proc freq data=stats;
table rater1*rater2 / nocol norow nopercent;
weight pred;
run;

The ods output obstats=stats; statement saves the observation statistics table which
contains the predicted values (variable pred). The proc freq code following proc genmod
tabulates the predicted values for the independence model.

© 2003 by CRC Press LLC


Applications 387

In the Criteria For Assessing Goodness Of Fit table we find a model deviance of
*&Þ$&(( with "' degrees of freedom (Output 6.18). This is twice the difference between the
log likelihood of a full model containing Rater " and Rater # main effects and their inter-
action and the (reduced) independence model shown here.
The full model (code and output not shown) has a log likelihood of $')Þ%($( and the
deviance of the independence model becomes #‡a$')Þ%($(  $#!Þ(*%*b œ *&Þ$&( which is
of course the likelihood ratio statistic for testing the absence of Rater " ‚ Rater # interactions
and identical to the likelihood ratio statistic reported by proc freq above. Similarly, the
Pearson residual Chi-square statistic of "!$Þ"!)* in Output 6.18 is identical to the Chi-Square
statistic calculated by proc freq. That the independence model fits these data poorly is also
conveyed by the "overdispersion" factor of &Þ*' (or 'Þ%%). Some important effects are
unaccounted for, these are the interactions between the ratings.

Output 6.18.
Independence model for ratings

The GENMOD Procedure

Model Information
Data Set WORK.RATING
Distribution Poisson
Link Function Log
Dependent Variable number
Observations Used 25

Class Level Information


Class Levels Values
rater1 5 1 2 3 4 5
rater2 5 1 2 3 4 5

Criteria For Assessing Goodness Of Fit


Criterion DF Value Value/DF
Deviance 16 95.3577 5.9599
Scaled Deviance 16 95.3577 5.9599
Pearson Chi-Square 16 103.1089 6.4443
Scaled Pearson X2 16 103.1089 6.4443
Log Likelihood 320.7949

Algorithm converged.

Analysis Of Parameter Estimates

Standard Wald 95% Confidence Chi-


Parameter DF Estimate Error Limits Square

Intercept 1 1.5483 0.2369 1.0840 2.0126 42.71


rater1 1 1 -0.2231 0.2739 -0.7599 0.3136 0.66
rater1 2 1 0.6419 0.2256 0.1998 1.0839 8.10
rater1 3 1 0.8329 0.2187 0.4043 1.2615 14.51
rater1 4 1 0.6242 0.2263 0.1807 1.0676 7.61
rater1 5 0 0.0000 0.0000 0.0000 0.0000 .
rater2 1 1 -0.2787 0.2505 -0.7696 0.2122 1.24
rater2 2 1 0.2177 0.2208 -0.2151 0.6505 0.97
rater2 3 1 0.5322 0.2071 0.1263 0.9382 6.60
rater2 4 1 0.5162 0.2077 0.1091 0.9234 6.17
rater2 5 0 0.0000 0.0000 0.0000 0.0000 .
Scale 0 1.0000 0.0000 1.0000 1.0000

© 2003 by CRC Press LLC


388 Chapter 6  Generalized Linear Models

From the Analysis of Parameter Estimates table the predicted values can be con-
structed (Output 6.19). The predicted count in cell "ß ", for example is obtained from the
estimated linear predictor
s("" œ "Þ&%)$  !Þ##$"  !Þ#()( œ "Þ!%'&
and the inverse link function
scR"" d œ /"Þ!%'& œ #Þ)%('.
E

Output 6.19.
Predicted cell counts under model of independence

The FREQ Procedure

Table of rater1 by rater2


rater1 rater2

Frequency|1 |2 |3 |4 |5 | Total
---------+--------+--------+--------+--------+--------+
1 | 2.8475 | 4.678 | 6.4068 | 6.3051 | 3.7627 | 24
---------+--------+--------+--------+--------+--------+
2 | 6.7627 | 11.11 | 15.216 | 14.975 | 8.9364 | 57
---------+--------+--------+--------+--------+--------+
3 | 8.1864 | 13.449 | 18.419 | 18.127 | 10.818 | 69
---------+--------+--------+--------+--------+--------+
4 | 6.6441 | 10.915 | 14.949 | 14.712 | 8.7797 | 56
---------+--------+--------+--------+--------+--------+
5 | 3.5593 | 5.8475 | 8.0085 | 7.8814 | 4.7034 | 30
---------+--------+--------+--------+--------+--------+
Total 28 46 63 62 37 236

Simply adding a general interaction term between row and column variable will not solve
the problem because the model
lne-34 f œ . € !3 € "4 € a!" b34 [6.59]

is saturated, that is, it fits the observed data perfectly. Just as in the case of a general two-
way layout (e.g., a randomized block design) adding interactions between the factors depletes
the degrees of freedom. The saturated model has a deviance of exactly ! and 834 œ
expš. s€! s 4 € Ð!"
s3 € " s Ñ34 ›. The deviation from independence must be structured in some
way to preserve degrees of freedom. We distinguish three forms of structured interactions:
• association that focuses on structured patterns of dependence between \ and ] ;
• agreement that focuses on the counts on the main diagonal;
• disagreement that focuses on the counts in off-diagonal cells.

Modeling association requires that the categories of \ and ] are ordered; agreement and
disagreement can be modeled with nominal and/or ordered categories. It should be noted that
cell counts can show strong association but weak agreement, for example, if one rater
consistenly assigns outcomes to higher categories than the other rater.

© 2003 by CRC Press LLC


Applications 389

Linear-By-Linear Association for Ordered Categories


The linear-by-linear association model replaces the general interaction term in [6.59] by the
term # ?3 @4 where # is an association parameter to be estimated and ?3 and @4 are scores
assigned to the ordered categories. This model, also known as the uniform association model
(Goodman 1979a, 1985; Agresti 1990, pp. 263-265), seeks to detect a particular kind of inter-
action requiring only one degree of freedom for the estimation of # beyond the model of inde-
pendence. It is in spirit akin to Tukey's one degree of freedom test for nonadditivity in two-
way layouts without replication (Tukey 1949). The term expe#?3 @4 f can be thought of as a
multiplicative factor that increases or decreases the cell counts away from the independence
model. To demonstrate the effect of the linear-by-linear association term we use centered
scores. Let ?3 œ @4 œ "ß #ß âß % and define ? to be the average of the possible \ scores and @
the corresponding average of the ] scores. Define the linear-by-linear association term as
# ?‡3 @4‡ where ?‡3 œ ?3  ?, @4‡ œ @4  @. For an association parameter of # œ !Þ!& the terms
expÖ# ?‡3 @4‡ × are shown in Table 6.22.

Table 6.22. Multiplicative factors exp˜# ?‡3 @4‡ ™ for # œ !Þ!& and centered scores
(centered scores ?‡3 and @4‡ shown in parentheses)
@4
" # $ %
?3 ?‡3 Ð  "Þ&Ñ Ð  !Þ&Ñ Ð!Þ&Ñ Ð"Þ&Ñ
" Ð  "Þ&Ñ "Þ"# "Þ!% !Þ*' !Þ)*
# Ð  !Þ&Ñ "Þ!% "Þ!" !Þ*) !Þ*'
$ Ð!Þ&Ñ !Þ*' !Þ*) "Þ!" "Þ!$
% Ð"Þ&Ñ !Þ)* !Þ*' "Þ!$ "Þ"#

The multiplicative terms are symmetric about the center of the table and increase along
the diagonal toward the corners of the table. At the same time the expected counts are decre-
mented relative to an independence model toward the upper right and lower left corners of the
table. The linear-by-linear association model assumes that high (low) values of \ pair more
frequently with high (low) values of ] than is expected under independence. At the same
time high (low) values of \ pair less frequently with low (high) values of ] . In cases where
it is more difficult to assign outcomes to categories in the middle of the scale than to extreme
categories the linear-by-linear association model will tend to fit the data well. For the model
fit and the predicted values it does not matter whether the scores are centered or not. Because
of the convenient interpretation in terms of multiplicative effects as shown in Table 6.22 we
prefer to work with centered scores.
The linear-by-linear association model with centered scores is fit to the rater agreement
data using proc genmod with the statements
title3 'Uniform association model for ratings';
data rating; set rating;
sc1_centered = rater1-3; sc2_centered = rater2-3;
run;
ods exclude obstats;
proc genmod data=rating;
class rater1 rater2;
model number = rater1 rater2 sc1_centered*sc2_centered /
link=log error=poisson type3 obstats;

© 2003 by CRC Press LLC


390 Chapter 6  Generalized Linear Models

ods output obstats=unifassoc;


run;
proc freq data=unifassoc;
table rater1*rater2 / nocol norow nopercent;
weight pred;
run;

The deviance of this model is much improved over the independence model (Output
# œ !Þ%%&& and the likeihood-ratio test
6.20). The estimate for the association parameter is s
for L! :# œ ! shows that the addition of the linear-by-linear association significantly
improves the model. The likelihood ratio Chi-square statistic of '(Þ%& equals the difference
between the two model deviances a*&Þ$&  #(Þ*!b.

Output 6.20.
Uniform association model for ratings

The GENMOD Procedure

Model Information

Data Set WORK.RATING


Distribution Poisson
Link Function Log
Dependent Variable number
Observations Used 25

Class Level Information

Class Levels Values


rater1 5 1 2 3 4 5
rater2 5 1 2 3 4 5

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF


Deviance 15 27.9098 1.8607
Scaled Deviance 15 27.9098 1.8607
Pearson Chi-Square 15 32.3510 2.1567
Scaled Pearson X2 15 32.3510 2.1567
Log Likelihood 354.5188

Algorithm converged.

Analysis Of Parameter Estimates

Standard Wald 95% Confidence


Parameter DF Estimate Error Limits
Intercept 1 0.6987 0.2878 0.1347 1.2628
rater1 1 1 -0.0086 0.3022 -0.6009 0.5837
rater1 2 1 1.1680 0.2676 0.6435 1.6924
rater1 3 1 1.4244 0.2621 0.9106 1.9382
rater1 4 1 1.0282 0.2452 0.5475 1.5088
rater1 5 0 0.0000 0.0000 0.0000 0.0000
rater2 1 1 -0.3213 0.2782 -0.8665 0.2239
rater2 2 1 0.5057 0.2477 0.0203 0.9912
rater2 3 1 0.9464 0.2369 0.4820 1.4108
rater2 4 1 0.8313 0.2228 0.3947 1.2680
rater2 5 0 0.0000 0.0000 0.0000 0.0000
sc1_cente*sc2_center 1 0.4455 0.0647 0.3187 0.5722
Scale 0 1.0000 0.0000 1.0000 1.0000

NOTE: The scale parameter was held fixed.

© 2003 by CRC Press LLC


Applications 391

Output 6.20 (continued).


LR Statistics For Type 3 Analysis

Chi-
Source DF Square Pr > ChiSq
rater1 4 57.53 <.0001
rater2 4 39.18 <.0001
sc1_cente*sc2_center 1 67.45 <.0001

Predicted cell counts for linear-by-linear association model

The FREQ Procedure

Table of rater1 by rater2

rater1 rater2

Frequency|1 |2 |3 |4 |5 | Total
---------+--------+--------+--------+--------+--------+
1 | 8.5899 | 8.0587 | 5.1371 | 1.8786 | 0.3356 | 24
---------+--------+--------+--------+--------+--------+
2 | 11.431 | 16.742 | 16.662 | 9.5125 | 2.6533 | 57
---------+--------+--------+--------+--------+--------+
3 | 6.0607 | 13.858 | 21.532 | 19.192 | 8.3574 | 69
---------+--------+--------+--------+--------+--------+
4 | 1.6732 | 5.9728 | 14.488 | 20.16 | 13.706 | 56
---------+--------+--------+--------+--------+--------+
5 | 0.2455 | 1.3683 | 5.1816 | 11.257 | 11.948 | 30
---------+--------+--------+--------+--------+--------+
Total 28 46 63 62 37 236

Comparing the predicted cell counts to those of the independence model, it is seen how
the counts increase in the upper left and lower right corner of the table and decrease toward
the upper right and lower left corners.
The fit of the linear-by-linear association model is dramatically improved over the inde-
pendence model at the cost of only one additional degree of freedom. The model fit is not
satisfactory, however. The :-value for the model deviance of PrÐ;#"&   #(Þ*!Ñ œ !Þ!## indi-
cates that a significant discrepancy between model and data remains. The interactions
between Rater " and Rater # category assignments must be structured further.
Before proceeding with modeling structured interaction as agreement we need to point
out that the linear-by-linear association model requires that scores be assigned to the ordered
categories. This introduces a subjective element into the analysis, different modelers may
assign different sets scores. Log-multiplicative models with predictor . € !3 € "4 € #93 =4
have been developed where the category scores 93 and =4 are themselves parameters to be
estimated. For more information about these log-multiplicative models see Becker (1989,
1990a, 1990b) and Goodman (1979b).

Modeling Agreement and Disagreement


Modeling agreement between ratings focuses on the diagonal cells of the table and param-
eterizes the beyond-chance agreement in the data. It does not require ordered categories as the
association models do. The simplest agreement model adds a single parameter $ that models

© 2003 by CRC Press LLC


392 Chapter 6  Generalized Linear Models

excess counts on the diagonal. Let D34 be a indicator variable such that
" 3œ4
D34 œ œ
! otherwise.

The homogeneous agreement model is defined as


lne-34 f œ . € !3 € "4 € $ D34 .

Interactions are modeled with a single degree of freedom term. A positive $ indicates that
more counts fall on the main diagonal than would be expected under a random assignment of
counts to the table (given the marginal totals). The agreement parameter $ can also be made
to vary with categories. This nonhomogeneous agreement model is defined as
lne-34 f œ . € !3 € "4 € $3 D34 .

The separate agreement parameters $3 will saturate the model on the main diagonal, that is
predicted and observed counts will agree perfectly for the diagonal cells. In proc genmod the
homogeneous and nonhomogeneous agreement models are fitted easily by defining an
indicator variable for the diagonal (output not shown).
data ratings; set ratings; diag = (rater1 = rater2); run;

title1 'Homogeneous Agreement model for ratings';


ods exclude obstats;
proc genmod data=rating;
class rater1 rater2;
model number = rater1 rater2 diag /link=log error=poisson type3;
run;

title1 'Nonhomogeneous Agreement model for ratings';


proc genmod data=rating;
class rater1 rater2;
model number = rater1 rater2 diag*rater1 /link=log error=poisson type3;
run;

Disagreement models place emphasis on cells off the main diagonal. For example, the
model
" l3  4l œ "
lne-34 f œ . € !3 € "4 € $ D34 ß D34 œ œ
! otherwise

adds an additional parameter $ to all cells adjacent to the main diagonal and the model
" 34œ " " 34œ"
lne-34 f œ . € !3 € "4 € $ € D34 € $ -34 ß D34 œ œ ß -34 œ œ
! otherwise ! otherwise

adds two separate parameters, one for the first band above the main diagonal a$€ b and one for
the first band below the main diagonal a$ b. For this and other disagreement structures see
Tanner and Young (1985).
When categories are ordered, agreement and association parameters can be combined. A
linear-by-linear association model with homogeneous agreement, for example, becomes

© 2003 by CRC Press LLC


Applications 393

" 3œ4
lne-34 f œ . € !3 € "4 € # ?3 @4 € $ D34 ß D34 œ œ
! otherwise.

Table 6.23 lists the deviances and :-values for various models fit to the data in Table 6.21.

Table 6.23. Model deviances for various log-linear models fit to data in Table 6.21
Linear-by- Non-
linear Homog. homog. Deviance
Model assoc. agreemt. agreemt. .0 H HÎ.0 :-value
ó "' *&Þ$'† &Þ*'  !Þ!!!"
ô ü "& #(Þ*! "Þ)' !Þ!###
õ ü "& %$Þ** #Þ*$ !Þ!!!"
ö ü "" $'Þ)& $Þ$& !Þ!!!"
÷ ü ü "% "'Þ'" "Þ") !Þ#(('
ø ü ü "! "&Þ'& "Þ&' !Þ""!"

The independence model

The best-fitting model is the combination of a linear-by-linear association term and a


homogeneous agreement term (: œ !Þ#(('). The agreement between the two raters is
beyond-chance and beyond what the linear association term allows. At the &% significance
level the model with association and nonhomogeneous agreement (ø) also does not exhibit a
significant deviance. The loss of four degrees of freedom for the extra agreement parameters
relative to ÷ is not offset by a sufficient decrease in the deviance, however. The more parsi-
monious model ÷ is preferred. This model also leads to the deviance/df ratio which is closest
to one. It should be noted that the null hypothesis underlying the goodness-of-fit test for
model ÷, for example, is that model ÷ fits the data. A failure to reject this hypothesis is not
evidence that model ÷ is the correct one. It simply cannot be ruled out as a possible data-
generating mechanism. The :-values for the deviance test should be sufficiently large to rule
out an error of the second kind before declaring a model as the one.

6.7.7 Modeling the Sample Variance of Scab Infection


In Example 6.4 (p. 310) a data set is displayed showing the variances in deoxynivalenol
(DON) concentration among probe samples from truckloads of wheat kernels as a function of
the average DON concentration on the truck. The trend in Figure 6.4 suggests that the varia-
bility and the mean concentration are not unrelated. If one were to make a sample size recom-
mendation as to how many probe samples are necessary to be within „ ? of the true con-
centration with a certain level of confidence, an estimate of the variance sensitive to the level
of kernel contamination is needed.
A model relating the probe-to-probe variance to the mean concentration is a first step in
this direction. As discussed in §6.2.1 it is usually not reasonable to assume that sample
variances are Gaussian-distributed. If ]" ß âß ]8 are a random sample from a Gaussian distri-
bution with variance 5 # , then a8  "bW # Î5 # is a Chi-squared random variable on 8  "
degrees of freedom. Here, W # denotes the sample variance,

© 2003 by CRC Press LLC


394 Chapter 6  Generalized Linear Models

8
" #
W# œ "ˆ]3  ] ‰ .
8  " 3œ"

Consequently, W # is distributed as a Gamma random variable with mean . œ 5 # and scale


parameter ! œ Ð8  "ÑÎ# (in the parameterization [6.7], p. 308). For 5 # œ # and 8 œ (, for
example, the skewness of the probability density function of W # is apparent in Figure 6.19.

0.4

0.3
Density of S2

0.2

0.1

0.0
0 2 4 6 8 10
S2

Figure 6.19. Probability density function of W # for 5 # œ #, 8 œ ( when a8  "bW # Î5 # is


distributed as ;#' .

Modeling W # as a Gamma random variable with ! œ a8  "bÎ#, we need to find a model


for the mean .. Figure 6.20 shows the logarithm of the truck sample variances plotted against
the (natural) logarithm of the deoxynivalenol concentration. From this plot, a model relating
the log variance against the log concentration could be considered. That is, we propose to
model
ln˜53# ™ œ "! € "" lneB3 f,

where B is the mean DON concentration on the 3th truck, and 53# œ expe"! € "" lneB3 ff is the
mean of a Gamma random variable with scale parameter ! œ Ð8  "ÑÎ# œ %Þ& (since each
sample variance is based on 8 œ "! probe samples per truck). Notice that the mean can also
be expressed in the form

53# œ expe"! € "" lneB3 ff œ /"! B"3 " œ !! B"3 " .

The proc genmod statements to fit this model (Output 6.21) are
proc genmod data=don;
model donvar = logmean / link=log dist=gamma scale=4.5 noscale;
run;

© 2003 by CRC Press LLC


Applications 395

Log Probe-To-Probe Variance


-1

-2

-3

-2 -1 0 1 2
log DON concentration

Figure 6.20. Logarithm of probe-to-probe variance against log of average DON


concentration.

The scale=4.5 option sets the Gamma scale parameter ! to a8  "bÎ# œ %Þ& and the
noscale option prevents the scale parameter from being fit iteratively by maximum likeli-
hood. The combination of the two options fixes ! œ %Þ& throughout the estimation process so
that only the parameters "! and "" are being estimated.

Output 6.21. The GENMOD Procedure

Model Information
Data Set WORK.FITTHIS
Distribution Gamma
Link Function Log
Dependent Variable donvar
Observations Used 9
Missing Values 1

Criteria For Assessing Goodness Of Fit


Criterion DF Value Value/DF
Deviance 7 2.7712 0.3959
Scaled Deviance 7 12.4702 1.7815
Pearson Chi-Square 7 2.4168 0.3453
Scaled Pearson X2 7 10.8755 1.5536
Log Likelihood 0.0675
Algorithm converged.

Analysis Of Parameter Estimates

Standard Wald 95% Chi-


Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq

Intercept 1 -1.2065 0.1938 -1.5864 -0.8266 38.74 <.0001


logmean 1 0.5832 0.1394 0.3100 0.8564 17.50 <.0001
Scale 0 4.5000 0.0000 4.5000 4.5000
NOTE: The scale parameter was held fixed.

Lagrange Multiplier Statistics


Parameter Chi-Square Pr > ChiSq
Scale 0.5098 0.4752

© 2003 by CRC Press LLC


396 Chapter 6  Generalized Linear Models

Fixing the scale parameter at a certain value imposes a constraint on the model since in
the regular Gamma regression model the scale parameter would be estimated (see the yield-
density application in §6.7.3, for example). The genmod procedure calculates a test whether
this constraint is reasonable and lists it in the Lagrange Multiplier Statistics table
(Output 6.21). Based on the :-value of !Þ%(&# we conclude that estimating the scale
parameter rather than fixing it at %Þ& would not improve the model. Fixing the parameter is
reasonable.
The estimates for the intercept and slope are "s ! œ  "Þ#!'& and "s " œ !Þ&)$#,
respectively, from which the variance at any DON concentration can be estimated. For
example, we expect the probe-to-probe variation on a truck with a deoxynivalenol
concentration of & parts per million to be
s # œ expe  "Þ#!'& € !Þ&)$#lne&ff œ !Þ('& ppm# .
5

Figure 6.21 displays the predicted variances and approximate *&% confidence bounds for
the predicted values. The model does not fit the data perfectly. The DON variance on the
truck with an average concentration of 'Þ% ppm is considerably off the trend. The model fits
rather well for concentrations up to & ppm, however. It is noteworthy how the confidence
bands widen as the DON concentration increases. Two effects are causing this. First, the
probe-to-probe variances are not homoscedastic; according to the Gamma regression model
"
VarW3# ‘ œ expe#"! € #"" lnaB3 bf
%Þ&
increases sharply with B. Second, the data are very sparse for larger concentrations.

2.0
Probe-To-Probe Variance (ppm2)

1.5

1.0

0.5

0.0

0 2 4 6 8 10 12
DON concentration (ppm)

Figure 6Þ21Þ Predicted values for probe-to-probe variance from Gamma regression (solid
line). Dashed lines are asymptotic *&% confidence intervals for the mean variance at a given
DON concentration.

© 2003 by CRC Press LLC


Applications 397

6.7.8 A Poisson/Gamma Mixing Model for Overdispersed


Poppy Counts
Mead et al. (1993, p. 144) provide the data in Table 6.24 from a randomized complete block
design with six treatments in four blocks in which the response variable was the poppy count
on an experimental unit.

Table 6.24. Poppy count data from Mead, Curnow, and Hasted (1993, p. 144)†
Treatment Block 1 Block 2 Block 3 Block 4
E &$) %## $(( $"&
F %$) %%# $"* $)!
G (( '" "&( &#
H ""& &( "!! %&
I "( $" )( "'
J ") #' (( #!

Used with permission.

Since these are count data, one could assume that the responses are Poisson-distributed
and notice that for mean counts greater than "&  #!, the Poisson distribution is closely
approximated by a Gaussian distribution (Figure 6.22). The temptation to analyze these data
by standard analysis of variance assuming Gaussian errors is thus understandable. From
Figure 6.22 it can be inferred, however, that the variance of the Gaussian distribution
approximating the Poisson mass function is linked to the mean. The distribution approxi-
mating the Poisson when the average counts are small will have a smaller variance than the
approximating distribution for large counts, thereby violating the homogeneous variance
assumption of the standard analysis of variance.

0.08
Density or mass function

0.06

0.04

0.02

0.00
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38

Figure 6.22. Poissona#!b probability mass function (bars) overlaid with probability density
function of a Gaussiana. œ #!ß 5 # œ #!b random variable.

© 2003 by CRC Press LLC


398 Chapter 6  Generalized Linear Models

Mead et al. (1993, p. 145) highlight this problem by examining the mean square error
estimate from the analysis of variance, which is 5 s # œ #ß '&$. An off-the-cuff confidence
interval for the mean poppy count for treatment J leads to
#ß '&$
CJ „ #eseaCJ b œ $&Þ#& „ #Ê œ $&Þ#& „ &"Þ&! œ c  "'Þ#&ß )$Þ(&d.
%
Based on this confidence interval one would not reject the idea that the mean poppy count for
treatment J could be  "!, say. This is a nonsensical result. The variability of the counts is
smaller for treatments with small average counts and larger for treatments with large average
counts. The analysis of variance mean square error is a pooled estimator of the residual varia-
bility being too high for treatments such as J and too small for treatments such as E.
An analysis of these data based on a generalized linear model with Poisson-distributed
outcomes and a linear predictor that incorporates block and treatment effects is more
reasonable. Using the canonical log link, the following proc genmod statements fit this GLM.
ods exclude ParameterEstimates;
proc genmod data=poppies;
class block treatment;
model count = block treatment / link=log dist=Poisson;
run;

Output 6.22.
The GENMOD Procedure

Model Information

Data Set WORK.POPPIES


Distribution Poisson
Link Function Log
Dependent Variable count
Observations Used 24

Class Level Information

Class Levels Values


block 4 1 2 3 4
treatment 6 A B C D E F

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF


Deviance 15 256.2998 17.0867
Scaled Deviance 15 256.2998 17.0867
Pearson Chi-Square 15 273.0908 18.2061
Scaled Pearson X2 15 273.0908 18.2061
Log Likelihood 19205.2442

Algorithm converged.

The troubling statistic in Output 6.22 is the model deviance of #&'Þ#* compared to its
degrees of freedom a"&b. The data are considerably overdispersed relative to a Poisson distri-
bution. A value of "(Þ!) should give the modeler pause (since the target ratio is "). A
possible reason for the considerable overdispersion could be an incorrect linear predictor. If
these data stem from an experimental design with proper blocking and randomization pro-

© 2003 by CRC Press LLC


Applications 399

cedure, no other effects apart from block and treatment effects should be necessary. Ruling
out an incorrect linear predictor, (positive) correlations among poppy counts could be the
cause of the overdispersion. Mead et al. (1993) explain the overdispersion by the fact that
whenever there is one poppy in an experimental unit, there are almost always several, hence
poppies are clustered and not distributed completely at random within (and possibly across)
experimental units.
The overdispersion problem can be fixed if one assumes that the counts have mean and
variance
Ec] d œ -
Varc] d œ 9-,

where 9 is a multiplicative overdispersion parameter. In proc genmod such an overdispersion


parameter can be estimated from either the deviance or Pearson residuals by adding the
dscale or pscale options to the model statement. The statements

proc genmod data=poppies;


class block treatment;
model count = block treatment / link=log dist=Poisson dscale;
run;

fit the same model as above but estimate an extra overdispersion parameter as

s œ #&'Þ#* œ "(Þ!)
<
"&
and report its square root as Scale in the table of parameter estimates. This approach to
accommodating overdispersion has several disadvantages in our opinion.
• It adds a parameter to the variance of the response which is not part of the Poisson
distribution. The variance Varc] d œ 9- is no longer that of a Poisson random variable
with mean -. The analysis is thus no longer a maximum likelihood analysis for a
Poisson model. It is a quasi-likelihood analysis in the sense of McCullagh (1983) and
McCullagh and Nelder (1989).
• The parameter estimates are not affected by the inclusion of the extra overdispersion
parameter 9. Only their standard errors are. For the data set and model considered
here, the standard errors of all parameter estimates will be È"(Þ!) œ %Þ"$$ times
larger in the model with Varc] d œ 9-. The predicted mean counts will remain the
same, however.
• The addition of an overdispersion parameter does not induce correlations among the
outcomes. If data are overdispersed because positive correlations among the obser-
vations are ignored, a multiplicative overdispersion parameter is the wrong remedy.

Inducing correlations in random variables is relatively simple if one considers a param-


eter of the response distribution to be a random variable rather than a constant. For example,
in a Binomiala8ß 1b population one can either assume that the binomial sample size 8 or the
success probability 1 are random. Reasonable distributions for these parameters could be a
Poisson or truncated Poisson for 8 or the family of Beta distributions for 1. In §A6.8.6 we
provide details on how these mixing models translate into marginal distributions where the
counts are correlated. More importantly, it is easy to show that the marginal variability of a

© 2003 by CRC Press LLC


400 Chapter 6 Generalized Linear Models

random variable reckoned over the possible values of a randomly varying parameter is larger
than the conditional variability one obtains if the parameter is fixed at a certain value.
This mixing approach has particular appeal if the marginal distribution of the response
variable remains a member of the exponential family. In this application we can assume that
the poppy counts for treatment 3 in block 4 are distributed as Poissona-34 b, but that -34 is a
random variable. Since the mean counts are non-negative we choose a probability distribution
for -34 that has nonzero density on the positive real line. If the mean of a Poissona-b variable
is distributed as Gammaa.ß !b, it turns out that the marginal distribution of the count follows
the Negative Binomial distribution which is a member of the exponential family (Table 6.2, p.
305). The details on how the marginal Negative Binomial distribution is derived are provided
in §A6.8.6.
Since Release 8.0 of The SAS® System, Negative Binomial outcomes can be modeled
with proc genmod using the dist=negbin option of the model statement. The canonical link of
this distribution is the log link. Starting with Release 8.1, the Negative Binomial distribution
is coded into proc nlmixed. Using an earlier version of The SAS® System (Release 8.0), we
can also fit count data with this distribution in proc nlmixed by coding the log likelihood
with SAS® programming statements and using the general() formulation of the model
statement. For the poppy count data these statements are as follows.

proc nlmixed data=poppies df=14;


parameters intcpt=3.4
bl1=0.3 bl2=0.3 bl3=0.3
tA=1.5 tB=1.5 tC=1.5 tD=1.5 tE=1.5 k=1;
if block=1 then linp = intcpt + bl1;
else if block=2 then linp = intcpt + bl2;
else if block=3 then linp = intcpt + bl3;
else if block=4 then linp = intcpt;

if treatment = 'A' then linp = linp + tA;


else if treatment = 'B' then linp = linp + tB;
else if treatment = 'C' then linp = linp + tC;
else if treatment = 'D' then linp = linp + tD;
else if treatment = 'E' then linp = linp + tE;
else if treatment = 'F' then linp = linp;

b = exp(linp)/k;
ll = lgamma(count+k) - lgamma(k) - lgamma(count + 1) +
k*log(1/(b+1)) + count*log(b/(b+1));
model count ~ general(ll);
run;

The parameters statement assigns starting values to the model parameters. Although
there are four block and six treatment effects, only three block and five treatment effects need
to be coded due to the sum-to-zero constraints on block and treatment effects. Because proc
nlmixed determines residual degrees of freedom by a different method than proc genmod we
added the df=14 option to the proc nlmixed statement. This will ensure the same degrees of
freedom as in the Poisson analysis minus one degree of freedom for the additional parameter
of the Negative Binomial that determines the degree of overdispersion relative to the Poisson
model. The several lines of if ... then ...; else ...; statements that follow the
parameters statement set up the linear predictor as a function of block effects (bl1âbl3)
and treatment effects (tAâtE). The statements

© 2003 by CRC Press LLC


Applications 401

b = exp(linp)/k;
ll = lgamma(count+k) - lgamma(k) - lgamma(count + 1) +
k*log(1/(b+1)) + count*log(b/(b+1));

code the Negative Binomial log likelihood. We have chosen a log likelihood based on a
parameterization in A6.8.6. The model statement finally instructs proc nlmixed to perform a
maximum likelihood analysis for the variable count where the log likelihood is determined
by the variable ll. This analysis results in maximum likelihood estimates for Negative Bi-
nomial responses with linear predictor
(34 œ . € 73 € 34 .

The parameterization used in this proc nlmixed code expresses the mean response as simply
expe(34 f for ease of comparison with the Poisson analysis. The abridged nlmixed output
follows.

Output 6.23 (abridged).


The NLMIXED Procedure

Specifications
Description Value
Data Set WORK.POPPIES
Dependent Variable count
Distribution for Dependent Variable General
Optimization Technique Dual Quasi-Newton
Integration Method None

NOTE: GCONV convergence criterion satisfied.

Fit Statistics
Description Value
-2 Log Likelihood 233.1
AIC (smaller is better) 253.1
BIC (smaller is better) 264.9
Log Likelihood -116.5
AIC (larger is better) -126.5
BIC (larger is better) -132.4

Parameter Estimates
Standard
Parameter Estimate Error DF t Value Pr > |t|

intcpt 3.0433 0.2121 14 14.35 <.0001


bl1 0.3858 0.1856 14 2.08 0.0565
bl2 0.2672 0.1864 14 1.43 0.1737
bl3 0.8431 0.1902 14 4.43 0.0006
tA 2.6304 0.2322 14 11.33 <.0001
tB 2.6185 0.2326 14 11.26 <.0001
tC 0.9618 0.2347 14 4.10 0.0011
tD 0.9173 0.2371 14 3.87 0.0017
tE 0.08057 0.2429 14 0.33 0.7450
k 11.4354 3.7499 14 3.05 0.0087

Although it is simple to predict the performance of a treatment in a particular block based


on this analysis, this is usually not the final analytical goal. For example, the predicted count
for treatment E in block 2 is expe$Þ!%$$ € !Þ#'(# € #Þ'$!%f œ $)!Þ#). We are more
interested in a comparison of the treatments averaged across the blocks. Before embarking on
treatment comparisons we check the significance of the treatment effects. A reduced model
with block effects only is fit with the statements

© 2003 by CRC Press LLC


402 Chapter 6  Generalized Linear Models

proc nlmixed data=poppies df=19;


parameters intcpt=3.4 bl1=0.3 bl2=0.3 bl3=0.3
k=1;
if block=1 then linp = intcpt + bl1;
else if block=2 then linp = intcpt + bl2;
else if block=3 then linp = intcpt + bl3;
else if block=4 then linp = intcpt;
b = exp(linp)/k;
ll = lgamma(count+k) - lgamma(k) - lgamma(count + 1)
+ k*log(1/(b+1)) + count*log(b/(b+1));
model count ~ general(ll);
run;

Minus twice the log likelihood for this model is #*&Þ% (output not shown) and the likeli-
hood ratio test statistic to test for equal average poppy counts among the treatments is
A œ #*&Þ%  #$$Þ" œ '#Þ$.
The :-value for this statistic is PrÐ;#& ž '#Þ$Ñ  !Þ!!!". Hence not all treatments have the
same average poppy count and we can proceed with pairwise comparisons based on treatment
averages. These averages can be calculated and compared with the estimate statement of the
nlmixed procedure. For example, to compare the predicted counts for treatments E and F , E
and G , and H and I we add the statements
estimate 'count(A)-count(B)' exp(intcpt+0.25*bl1+0.25*bl2+0.25*bl3+tA) -
exp(intcpt+0.25*bl1+0.25*bl2+0.25*bl3+tB);

estimate 'count(A)-count(C)' exp(intcpt+0.25*bl1+0.25*bl2+0.25*bl3+tA) -


exp(intcpt+0.25*bl1+0.25*bl2+0.25*bl3+tC);

estimate 'count(D)-count(E)' exp(intcpt+0.25*bl1+0.25*bl2+0.25*bl3+tD) -


exp(intcpt+0.25*bl1+0.25*bl2+0.25*bl3+tE);

to the nlmixed code. The linear predictors for each treatment take averages over the block
effects prior to exponentiation. The table added by these three statements to the proc nlmixed
output shown earlier follows.

Output 6.24.
Additional Estimates

Standard
Label Estimate Error DF t Value Pr > |t|

count(A)-count(B) 5.0070 89.3979 14 0.06 0.9561


count(A)-count(C) 343.39 65.2337 14 5.26 0.0001
count(D)-count(E) 43.2518 13.5006 14 3.20 0.0064

We conclude that there is no difference between treatments E and F a: œ !Þ*&'"b, but


that there are significant differences in average poppy counts between treatments E and G
a:  !Þ!!!"b and H and I a: œ !Þ!!'%b. Notice that the estimate of %$Þ#&") for the
comparison of predicted counts for treatments H and I is close to the difference in average
counts of %%Þ! in Table 6.24.

© 2003 by CRC Press LLC


Chapter 7

Linear Mixed Models for


Clustered Data

“The new methods occupy an altogether higher plane than that in which
ordinary statistics and simple averages move and have their being.
Unfortunately, the ideas of which they treat, and still more, the many
technical phrases employed in them, are as yet unfamiliar. The arithmetic
they require is laborious, and the mathematical investigations on which
the arithmetic rests are difficult reading even for experts... This new
departure in science makes its appearance under conditions that are
unfavourable to its speedy recognition, and those who labour in it must
abide for some time in patience before they can receive sympathy from the
outside world.” Sir Francis Galton.

7.1 Introduction
7.2 The Laird-Ware Model
7.2.1 Rationale
7.2.2 The Two-Stage Concept
7.2.3 Fixed or Random Effects
7.3 Choosing the Inference Space
7.4 Estimation and Inference
7.4.1 Maximum and Restricted Maximum Likelihood
7.4.2 Estimated Generalized Least Squares
7.4.3 Hypothesis Testing
7.5 Correlations in Mixed Models
7.5.1 Induced Correlations and the Direct Approach
7.5.2 Within-Cluster Correlation Models
7.5.3 Split-Plots, Repeated Measures, and the Huynh-Feldt Conditions
7.6 Applications
7.6.1 Two-Stage Modeling of Apple Growth over Time

© 2003 by CRC Press LLC


7.6.2 On-Farm Experimentation with Randomly Selected Farms
7.6.3 Nested Errors through Subsampling
7.6.4 Recovery of Inter-Block Information in Incomplete Block Designs
7.6.5 A Split-Strip-Plot Experiment for Soybean Yield
7.6.6 Repeated Measures in a Completely Randomized Design
7.6.7 A Longitudinal Study of Water Usage in Horticultural Trees
7.6.8 Cumulative Growth of Muskmelons in Subsampling Design

© 2003 by CRC Press LLC


7.1 Introduction
Box 7.1 Linear Mixed Models

• A linear mixed effects model contains fixed and random effects and is linear
in these effects. Models for subsampling designs or split-plot-type models
are mixed models.

• Multifactor models where the levels of some factors are predetermined


while the levels of other factors are chosen at random are mixed models.

• Models in which regression coefficients vary randomly between groups of


observations are mixed models.

A distinction was made in §1.7.5 between fixed, random, and mixed effects models based on
the number of random variables and fixed effects involved in the statistical model. A mixed
effects model — or mixed model for short — contains fixed effects as well as at least two
random variables (one of which is the obligatory model error). Mixed models arise quite fre-
quently in designed experiments. In a completely randomized design with subsampling, for
example, > treatments are assigned at random to <> experimental units. A random subsample
of 8 observations is then drawn from every experimental unit. This design is practical if an
experimental unit is too large to be measured in its entirety, for example, a field plot contains
twelve rows of a particular crop but only three rows per plot can be measured and analyzed.
Soil samples are often randomly divided into subsamples prior to laboratory analysis. The
statistical model for such a design can be written as
]345 œ . € 73 € /34 € .345
[7.1]
3 œ "ß ÞÞÞß >à 4 œ "ß ÞÞÞß <à 5 œ "ß ÞÞÞß 8,

where /345 µ a!ß 5 # bß .345 µ Ð!ß 5.# Ñ are zero mean random variables representing the experi-
mental and observational errors, respectively. If the treatment effects 73 are fixed, i.e., the
levels of the treatment factor were predetermined and not chosen at random, this is a mixed
model.
In the past fifteen years, mixed linear models have risen to great importance in statistical
modeling because of their tremendous flexibility. As will be demonstrated shortly, mixed
models are more general than standard regression and classification models and contain the
latter. In the analysis of designed experiments mixed models have been in use for a long time.
They arise naturally through the process of randomization as shown in the introductory
example. In agricultural experiments, the most important traditional mixed models are those
for subsampling and split-plot type designs (Example 7.1).
Data from subsampling and split-plot designs are clustered structures (§2.4.1). In the for-
mer, experimental units are clusters for the subsamples and in the latter whole-plots are clus-
ters for the subplot treatments. In general, mixed models arise very naturally in situations
where data are clustered or hierarchically organized. This is by no means restricted to de-
signed experiments with splits or subsampling. Longitudinal studies and repeated measures

© 2003 by CRC Press LLC




experiments where observations are gathered repeatedly on subjects or experimental units


over time also give rise to clustered data. Here, however, the selections of units within a clus-
ter is not random but ordered along some metric. This metric is usually temporal, but this is
not necessary. Lysimeter measurements made at three depths of the same soil profile are also
longitudinal in nature with measurements ordered by depth (spatial metric).

Example 7.1. An experiment is planned to investigate different agricultural manage-


ment strategies and cropping systems. The management alternatives chosen are an
organic strategy without use of pesticides (M1), an integrated strategy with low pesti-
cide input (M2), and an integrated strategy with high pesticide input (M3). The crops
selected for the experiment are corn, soybeans, and wheat. Since application of a ma-
nagement strategy requires large field units, which can accommodate several different
crops, it was decided to assign the management strategies in a randomized complete
block design with four blocks (replicates I to IV). Each replicate consists of three fields
made homogeneous as far as possible before the experiment by grouping fields accord-
ing to soil parameters sampled previously. Each field is then subdivided into three plots,
and the crops are assigned at random to the fields. This process is repeated in each field.
The layout of the design for management types is shown in Figure 7.1. The linear
model is ]34 œ . € 34 € 73 € /34 a3 œ "ß ÞÞÞß $à 4 œ "ß ÞÞß %b, where 34 is the effect
associated with replicate 4, 73 is the effect of the 3th treatment, and /34 µ a!ß 5 # b is the
experimental error associated with the fields. Figure 7.2 shows the assignment of the
three crop types within each field after randomization.

Replicate I Replicate II
M1 M2
M3 M3
M2 M1

Replicate III Replicate IV


M2 M2
M1 M3
M3 M1
Field

Figure 7.1. Randomized complete block design with four blocks (= replicates) for
management strategies M" to M$.

This experiment involves two separate stages of randomization. The management types
are assigned at random to the fields and independently thereof the crops are assigned to
the plots within a field. The variability associated with the fields should be independent
from the variability associated with the plots. Ignoring the management types and
focusing on the crop alone, the design is a randomized complete block with $‡% œ "#
blocks of size $ and the model is

© 2003 by CRC Press LLC


]345 œ . € 934 € !5 € .345 , [7.2]

5 œ "ß ÞÞß $, where 934 is the effect of the 34th block, !5 is the effect of the 5 th crop type,
and .345 is the experimental error associated with the plots.

Replicate I Replicate II
M1 corn soy wheat wheat corn soy
M2
M3 soy wheat corn wheat corn soy
M3
wheat corn soy corn soy wheat
M2 M1
Plot
Replicate III Replicate IV
soy corn wheat M1 soy wheat corn
M2
corn soy wheat wheat soy corn
M1 M2
soy wheat corn corn soy wheat
M3 M3
Field

Figure 7.2. Split-plot design for management strategies aM" to M$b and crop types.

Both RCBDs are analysis of variance models with a single error term. The mixed model
comes about when the two randomizations are combined into a single model, letting
934 œ 34 € 73 € /34 ,
]345 œ . € 34 € 73 € /34 € !5 € a7!b35 € .345 . [7.3]

The two random variables depicting experimental error on the field and plot level are
/34 and .345 , respectively, and the fixed effects are 34 for the 4th replicate, 73 for the 3th
management strategy, !5 for the 5 th crop type, and a7!b35 for their interaction. This is a
classical split-plot design where the whole-plot factor management strategy is arranged
in a randomized block design with four replication (blocks) and the subplot factor crop
type has three levels.

The distinction between longitudinal and repeated measures data adopted here is as
follows: if data are collected repeatedly on experimental material to which treatments were
applied initially, the data structure is termed a repeated measure. Data collected repeatedly
over time in an observational study are termed longitudinal. A somewhat different distinction
between repeated measures and longitudinal data in the literature states that it is assumed that
cluster effects do not change with time in repeated measures models, while time as an
explanatory variable of within-cluster variation related to growth, development, and aging
assumes a center role with longitudinal data (Rao 1965, Hayes 1973). In designed experi-
ments involving time the assumption of time-constant cluster effects is often not tenable.
Treatments are applied initially and their effects may wear off over time, changing the
relationship among treatments as the experiment progresses. Treatment-time interactions are
important aspects of repeated measures experiments and the investigation of trends over time
and how they change among treatments can be the focus of the investigation.

© 2003 by CRC Press LLC




The appeal of mixed models lies in their flexibility to handle diverse forms of hierarchi-
cally organized data. Depending on application and circumstance the emphasis of data
analysis will differ. In designed experiments treatment comparisons, the estimation of treat-
ment effects and contrasts, and the estimation of sources of variability come to the fore. In re-
peated measures analyses investigating the interactions of treatment and time and modeling
the response trends over time play an important role in addition to treatment comparisons. For
longitudinal data emphasis is on developing regression-type models that account for cluster-
to-cluster and within-cluster variation and on estimating the mean response for the population
average and/or for the specific clusters. In short, designed experiments with clustered data
structure emphasize the between-cluster variation, longitudinal data analyses emphasize the
within-cluster variation, and repeated measures analyses place more emphasis on either one
depending on the goals of the analysis.
Mixed models are an efficient vehicle to separate between-cluster and within-cluster
variation that is essential in the analysis of clustered data. In a completely randomized design
(CRD) with subsampling,
]345 œ . € 73 € /34 € .345 ,

where /34 denotes the experimental error aII b associated with the 4th replication of the 3th
treatment and .345 is the observational (subsampling) error among the 5 subsamples from an
experimental unit, between-cluster variation of units treated alike is captured by Varc/34 d and
within-cluster variation by Varc.345 d. To gauge whether treatments are effective the mean
square due to treatments should be compared to the magnitude of variation among experimen-
tal units treated alike, Q W aII b, not the observational error mean square. Analysis based on
a model with a single error term, ]34 œ . € 73 € %345 , assuming the %345 are independent,
would be inappropriate and could lead to erroneous conclusions about treatment performance.
We demonstrate this effect with the application in §7.6.3.
Not recognizing variability on the cluster and within-cluster level is dangerous from
another point of view. In the CRD with subsampling the experimental errors /34 are uncorrela-
ted as are the observational errors .345 owing to the random assignment of treatments to
experimental units and the random selection of samples from the units. Furthermore, the /34
and .345 are not correlated with each other. Does that imply that the responses ]34 are not
correlated? Some basic covariance operations provide the answer,
Covc]345 ß ]345w d œ Covc/34 € .345 ß /34 € .345w d œ Covc/34 ß /34 d œ Varc/34 d Á !.

While observations from different experimental units are uncorrelated, observations from the
same cluster are correlated. Ignoring the hierarchical structure of the two error terms by put-
ting /34 € .345 œ %345 and assuming the %345 are independent, ignores correlations among the
experimental outcomes. Since Covc]345 ß ]345w d œ Varc/34 d ž !, the observations are positively
correlated. If correlations are ignored, :-values will be too small (§2.5.2), even if the %345 re-
presented variability of experimental units treated alike, which it does not. In longitudinal and
repeated measures data, correlations enter the data more directly. Since measurements are
collected repeatedly on subjects or units, these measurements are likely autocorrelated.
Besides correlations among the observations, clustered data provide another challenge for
the analyst who must decide whether the emphasis is cluster-specific or population-average
inference. In a longitudinal study, for example, a natural focus of investigation are trends over

© 2003 by CRC Press LLC


time. There are two types of trends. One describes the behavior of the population-average (the
average trend in the universe of all clusters), the other the behavior of individual clusters. If
both types of trends can be modeled, a comparison of an individual cluster to the population
behavior allows conclusions about the conformity and similarity of clusters. The overall trend
in the population is termed the population-average (PA) trend and trends varying by clusters
are termed cluster-specific (CS) or subject-specific trends (Schabenberger et al. 1995,
Schabenberger and Gregoire 1996).

Example 7.2. Gregoire, Schabenberger, and Barrett (1995) analyze data from a longitu-
dinal study of natural grown Douglas fir (Pseudotsuga menziesii (Mirb.) Franco) stands
(plots) scattered throughout the western Cascades and coastal range of Washington and
Oregon in northwestern United States. Plots were visited repeatedly between 1970 and
1982 for a minimum of ' and a maximum of "! times. Measurement intervals for a
given plot varied between " and $ years. The works of Gregoire (1985, 1987) discuss
the data more fully.
45

40

35
Stand height (m)

30

25

20

15

10

10 15 20 25 30 35 40 45 50 55 60
Stand age (years)

Figure 7.3. Height-age profiles of ten Douglas fir (Pseudotsuga menziesii) stands. Data
kindly provided by Dr. Timothy G. Gregoire, School of Forestry and Environmental
Studies, Yale University. Used with permission.

Figure 7.3 shows the height of the socially dominant trees vs. the stand age for ten of
the '& stands. Each stand depicted represents a cluster of observations. The stands
differed in age at the onset of the study. The height development over the range of years
during which observations were taken is almost linear for the ten stands. However, the
slopes of the trends differ, as well as the maximum heights achieved. This could be due
to increased natural variability in height development with age or to differences in
micro-site conditions. Growing sites may be more homogeneous among younger than
among older stands, a feature often found in man-made forests. Figure 7.4 shows the
sample means for each stand (cross-hairs) and the population-averaged trend derived
from a simple linear regression model

© 2003 by CRC Press LLC




L34 œ "! € "" +1/34 € /34 ,

where L34 is the 4th height for the 3th stand. To predict the height of a stand at a given
age not in the data set, the population-average trend would be used. However, to predict
the height of a stand for which data was collected, a more precise and accurate predic-
tion should be possible if information about the stand's trend relative to the population
trend is utilized. Focusing on the population average only, residuals are measured as
deviations between observed values and the population trend. Cluster-specific predic-
tions utilize smaller deviations between observed values and the specific trend.
45

40

35
Stand height (m)

30

Population Average
25

20

15

10

10 15 20 25 30 35 40 45 50 55 60
Stand age (years)

Figure 7.4. Population-average trend and stand- (cluster-) specific means for Douglas
fir data. The dotted line represents population-average prediction of heights at a given
age.

Cluster-specific inferences exploit an important property of longitudinal data analysis to


let each cluster serve as its own control in the separation of age and cohort effects. To
demonstrate this concept consider only the first observations of the stands shown in Figure
7.3 at the initial measurement date. These data, too, can be used to derive a height-age rela-
tionship (Figure 7.5). The advantage is that by randomly selecting stands the ten observations
are independent and standard linear regression methods can be deployed. But do these data
represent true growth? The data in Figure 7.5 are cross-sectional, referring to a cross-section
of individuals at different stages of development. Only if the average height of thirty-year-old
stands twenty years from now is identical to the average height of fifty-year-old stands today,
can cross-sectional data be used for the purpose of modeling growth. The group-to-group dif-
ferences in development are called cohort effects. Longitudinal data allow the unbiased esti-
mation of growth and inference about growth by separating age and cohort effects.
The data in Figure 7.3 could be modeled with a purely fixed effects model. It seems
reasonable to assume that the individual trends emanate from the same point and to postulate
an intercept "! common to all stands. If the trends are linear, only the slopes would need to be
varied by stand. The model

© 2003 by CRC Press LLC


L34 œ "! € ""3 +1/34 € /34 [7.4]

fits a separate slope ""3 for each stand. A total of eleven fixed effects have to be estimated:
ten slopes and one intercept. If the intercepts are also varied by stands, the fixed effects
regression model
L34 œ "!3 € ""3 +1/34 € /34

could be fitted requiring a total of #! parameters in the mean function.

45

40
Stand height at initial visit (m)

35

30

25

20

15

10

10 15 20 25 30 35 40 45 50 55 60
Stand age (years)

Figure 7.5. Observed stand heights and ages at initial visit and linear trend.

Early approaches to modeling of longitudinal and repeated measures data employed this
philosophy: to separate the data into as many subsets as there are clusters and fit a model
separately to each cluster. Once the individual estimates "s !" ß âß "
s !ß"! and "
s "" ß âß "
s "ß"! were
obtained, the population-average estimates were calculated as some weighted average of the
cluster-specific intercepts and slopes. This two-step approach is inefficient as it ignores infor-
mation contributed by other clusters in the estimation process and leads to parameter prolif-
eration. While the data points from cluster $ do not contribute information about the slope in
cluster ", the information in cluster $ nevertheless can contribute to the estimation of the var-
iance of observations about the cluster means. As the complexity of the individual trends and
the cluster-to-cluster heterogeneity increases, the approach becomes impractical. If the num-
ber of observations per cluster is small, it may turn out to be actually impossible. If, for
example, the individual trends are quadratic, and only two observations were collected on a
particular cluster, the model cannot be fit to that cluster's data. If the mean function is non-
linear, fitting separate regression models to each cluster is plagued with numerical problems
as nonlinear models require fairly large amounts of information to produce stable parameter
estimates. It behooves us to develop an approach to data analysis that allows cluster-specific
and population-average inference simultaneously without parameter proliferation. This is
achieved by allowing effects in the model to vary at random rather than treating them as
fixed. These ideas are cast in the Laird-Ware model.

© 2003 by CRC Press LLC


7.2 The Laird-Ware Model
Box 7.2 Laird-Ware Model

• The Laird-Ware model is a two-stage model for clustered data where the
first stage describes the cluster-specific response and the second stage
captures cluster-to-cluster heterogeneity by randomly varying parameters
of the first-stage model.

• Most linear mixed models can be cast as Laird-Ware models, even if the
two-stage concept may not seem natural at first, e.g., split-plot designs.

7.2.1 Rationale
Although mixed model procedures were developed by Henderson (1950, 1963, 1973), Gold-
berger (1962), and Harville (1974, 1976a, 1976b) prior to the seminal article of Laird and
Ware (1982), it was the Laird and Ware contribution that showed the wide applicability of
linear mixed models and provided a convenient framework for parameter estimation and in-
ference. Although their discussion focused on longitudinal data, the applicability of the Laird-
Ware model to other clustered data structures is easily recognized. The basic idea is as fol-
lows. The probability distribution for the measurements within a cluster has the same general
form for all clusters, but some or all of the parameters defining this distribution vary ran-
domly across clusters. First, define Y3 to be the a83 ‚ "b vector of observations for the 3th
cluster. In a designed experiment, Y3 represents the data vector collected from a single experi-
mental unit. Typically, in this case, additional subscripts will be needed to identify replica-
tions, treatments, whole-plots, sub-plots, etc. For the time being, it is assumed without loss of
generality that the single subscript 3 identifies an individual cluster. For the leftmost
(youngest) douglas fir stand in Figure 7.3, for example, "! observations were collected at
ages "%, "&, "', "(, "), "*, ##, #$, #%, #&. The measured heights for this sixth stand in the
data set are assembled in the response vector

Ô ]'" × Ô ""Þ'! ×
Ö ]'# Ù Ö "#Þ&! Ù
Ö Ù Ö Ù
Ö ]'$ Ù Ö "$Þ%" Ù
Ö Ù Ö Ù
Ö ]'% Ù Ö "$Þ&* Ù
Ö Ù Ö Ù
Ö ]'& Ù Ö "&Þ"* Ù
Y' œ Ö ÙœÖ Ù.
Ö ]'' Ù Ö "&Þ)# Ù
Ö Ù Ö Ù
Ö ]'( Ù Ö ")Þ#$ Ù
Ö ]') Ù Ö "*Þ&( Ù
Ù Ö
Ö Ù
Ö ]'* Ù Ö #!Þ%$ Ù
Õ ]'ß"! Ø Õ #!Þ') Ø

The Laird-Ware model assumes that the average behavior of the clusters is the same for
all clusters, varied only by cluster-specific explanatory variables. In matrix notation this is

© 2003 by CRC Press LLC


represented for the mean trend as
EcY3 d œ X3 ", [7.5]

where X3 is an a83 ‚ :b design or regressor matrix and " is a a: ‚ "b vector of regression
coefficients. Observe that clusters share the same parameter vector ", but clusters can have
different values of the regressor variables. If X3 contains a column of measurement times, for
example, these do not have to be the same across clusters. The Laird-Ware model easily
accommodates unequal spacing of measurements. Also, the number of cluster elements, 83 ,
can vary from cluster to cluster. Clusters with the same set of regressors X3 do not elicit the
same response as suggested by the common parameter " .
To allow clusters to vary in the effect of the explanatory variables on the outcome we can
put EcY3 l b3 d œ X3 a" € b3 b œ X3 " € X3 b3 . The b3 in this expression determine how much
the 3th cluster population-average response X3 " must be adjusted to capture the cluster-
specific behavior X3 " € X3 b3 . In a practical application not all of the explanatory variables
have effects that vary among clusters and we can add generality to the model by putting
EcY3 l b3 d œ X3 " € Z3 b3 where Z3 is an a83 ‚ 5 b design or regressor matrix. In this formula-
tion not all the columns in X3 are repeated in Z3 and on occasion one may place explanatory
variables in Z3 that are not part of X3 (although this is much less frequent than the opposite
case where the columns of Z3 are a subset of the columns of X3 ).
The expectation was reckoned conditionally, because b3 is a vector of random variables.
We assume that b3 has mean 0 and variance-covariance matrix D. Laird and Ware (1982)
term this a two-stage model. The first stage specifies the conditional distribution of Y3 , given
the b3 , as
Y3 lb3 µ KaX3 " € Z3 b3 ß R3 b. [7.6]

Alternatively, we can write this as a linear model


Y3 lb3 œ X3 " € Z3 b3 € e3 , e3 µ K a0ß R3 b.

In the second stage it is assumed that the b3 have a Gaussian distribution with mean 0 and
variance matrix D, b3 µ Ka0ß Db. The random effects b3 are furthermore assumed indepen-
dent of the errors e3 . The (marginal) distribution of the responses then is also Gaussian:
Y3 µ KaX3 "ß R3 € Z3 DZw3 b. [7.7]

In the text that follows we will frequently denote R3 € Z3 DZw3 as V3 . An alternative


expression for the Laird-Ware model is
Y3 œ X3 " € Z3 b3 € e3
e3 µ Ka0ß R3 bß b3 µ Ka0ß Db [7.8]
Covce3 ß b3 d œ 0Þ

Model [7.8] is a classical mixed linear model. It contains a fixed effect mean structure given
by X3 " and a random structure given by Z3 b3 € e3 . The b3 are sometimes called the random
effects if Z3 is a design matrix consisting of !'s and "'s, and random coefficients if Z3 is a
regressor matrix. We will refer to b3 simply as the random effects.
The extent to which clusters vary about the population-average response is expressed by
the variability of the b3 . If D œ 0, the model reduces to a fixed effects regression or classifica-

© 2003 by CRC Press LLC


tion model with a single error source e3 . The e3 are sometimes called the within-cluster errors,
since their variance-covariance matrix captures the variability and stochastic dependency
within a cluster. The comparison of the conditional distribution Y3 lb3 and the unconditional
distribution of Y3 shows how cluster-specific aX3 " € Z3 b3 b and population-average inference
aX3 " b are accommodated in the same modeling framework. To consider a particular cluster's
response we condition on b3 . This leaves e3 as the only random component on the right-hand
side of [7.8] and the cluster-specific mean and variance for cluster 3 are
EcY3 lb3 d œ X3 " € Z3 b3
[7.9]
VarcY3 lb3 d œ R3 .

Taking expectations over the distribution of the random effects, one arrives at
EcY3 d œ EcEcY3 lb3 dd œ EcX3 " € Z3 b3 d œ X3 "
[7.10]
VarcY3 d œ R3 € Z3 DZw3 œ V3 .

The marginal variance follows from the standard result by which unconditional variances can
be derived from conditional expectations:
Varc] d œ EcVarc] l\ dd € VarcEc] l\ dd.

Applying this to the mixed model [7.8] under the assumption that Covcb3 ß e3 d œ 0 leads to
EcVarcY3 lb3 dd € VarcEcY3 lb3 dd œ R3 € Z3 DZw3 .

In contrast to [7.9], [7.10] expresses the marginal or population-average mean and variance of
cluster 3.
The Laird-Ware model [7.8] is quite general. If the design matrix for the random effects
is absent, Z3 œ 0, the Laird-Ware model reduces to the classical linear regression model.
Similarly, if the random effects do not vary, i.e.,
Varcb3 d œ D œ 0,

all random effects must be exactly b3 ´ 0 since Ecb3 d œ 0 and the model reduces to a linear
regression model
Y3 œ X3 " € e3 Þ
If the fixed effects coefficient vector " is zero, the model becomes a random effects model
Y3 œ Z3 b3 € e3 .
To motivate the latter consider the following experiment.

Example 7.3. Twenty laboratories are randomly selected from a list of laboratories pro-
vided by the Association of Official Seed Analysts (AOSA). Each laboratory receives %
bags of "!! seeds each, selected at random from a large lot of soybean seeds. The
laboratories perform germination tests on the seeds, separately for each of the bags and
report the results back to the experimenter. A statistical model to describe the variability
of germination test results must accommodate laboratory-to-laboratory differences and
inhomogeneities in the seed lot. The results from two different laboratories may differ
even if they perform exactly the same germination tests with the same precision and

© 2003 by CRC Press LLC


accuracy since they received different samples. But even if the samples were exactly the
same, the laboratories will not produce exactly the same germination test results, due to
differences in technology, seed handling and storage at the facility, experience of
personnel, and other sources of variation particular to a specific laboratory. A model for
this experiment could be
]34 œ . € !3 € /34 ,

where ]34 is the germination percentage reported by the 3th laboratory for the 4th "!!
seed sample it received. . is the overall germination percentage of the seedlot. !3 is a
random variable with mean ! and variance 5!# measuring the lab-specific deviation
from the overall germination percentage. /34 is a random variable with mean ! and
variance 5 # measuring intralaboratory variability due to the four samples within a
laboratory.

Since apart from the grand mean . all terms in the model are random, this is a random
effects model. In terms of the components of the Laird-Ware model we can define a
cluster to consist of the four samples sent to a laboratory and let Y3 œ c]3" ß âß ]3% dw .
Then our model for the 3th laboratory is

Ô"× Ô"× Ô /3" ×


Ö"Ù Ö"Ù Ö/ Ù
Y3 œ X3 " € Z3 b3 € e3 œ Ö Ù. € Ö Ù!3 € Ö 3# Ù.
" " /3$
Õ"Ø Õ"Ø Õ /3% Ø

7.2.2 The Two-Stage Concept


Depending on the application it may be more or less natural to appeal to the two-stage con-
cept. In cases where the modeler has in mind a particular class of models that applies in
general to different groups, clusters, or treatments, the concept applies immediately. In the
first stage we select the population-average model and decide in the second stage which
parameters of the model vary at random among the groups. In §5.8.7 nonlinear yield-density
models were fit to data from different onion varieties. The basic model investigated there was

lne]34 f œ lnša!3 € "3 B34 b"Î)3 › € /34 ,

where ]34 denotes the yield per plant of variety 3 grown at density B34 . The parameters !, " ,
and ) were initially assumed to vary among the varieties in a deterministic manner, i.e., were
fixed. We could also cast this model in the mixed model framework. Since we are concerned
with linear models in this chapter, we concentrate on the inverse plant yield ]34" and its rela-
tionship to plant density as for the data shown in Table 5.15 and Figure 5.34 (p. 288). It
seems reasonable to assume that ] " is linearly related to density for any of the three
varieties. The general model is
]34" œ !3 € "3 B34 € /34 .

© 2003 by CRC Press LLC


These data are not longitudinal, since each variety ‚ density combination was grown inde-
pendently. The two-stage concept leading to a mixed model nevertheless applies. If the
variances are homogeneous, it is reasonable to put Varc/34 d œ 5 # , Covc/34 ß /3w 4w d œ ! whenever
3 Á 3w or 4 Á 4w . For convenience we identify each variety 3 (see Table 5.16) as a cluster. The
first stage is completed by identifying population parameters, cluster effects, and within-
cluster variation in the general model. To this end take !3 œ ! € ,"3 and "3 œ " € ,#3 . The
model can then be written as
]34" œ ! € ,"3 € a" € ,#3 bB34 € /34 . [7.11]

In this formulation ! and " are the population parameters and ,"3 , ,#3 measure the degree to
which the population-averaged intercept a!b and slope a" b must be modified to accommodate
the 3th variety's response. These are the cluster effects. The second stage constitutes the
assumption that ,"3 and ,#3 are randomly drawn from a universe of possible values for the
intercept and slope adjustment. In other words, it is assumed that ,"3 and ,#3 are random
variables with mean zero and variances 5"# and 5## , respectively. Assume 5## œ ! for the
moment. A random variable whose variance is zero is a constant that takes on its mean value,
which in this case is zero. If 5## œ ! the model reduces to
]34" œ ! € ,"3 € " B34 € /34 ,

stating that varieties differ in the relationship between inverse plant yield and plant density
only in their intercept, not their slope. This is a model with parallel trends among varieties.
Imagine there are $! varieties a3 œ "ß âß $!b. The test of slope equality if the "3 are fixed
effects is based on the hypothesis L! : "" œ "# œ ⠜ "$! , a twenty-nine degree of freedom
hypothesis. In the mixed model setup the test of slope equality involves only a single param-
eter, L! : 5## œ !. Even in this nonlongitudinal setting, the two-stage concept is immensely
appealing if we view varietal differences as random disturbances about a conceptual average
variety.
We have identified a population-average model for relating inverse plant yield to density,
E]34" ‘ œ ! € " B34 ,

and how to modify the population average with random effects to achieve a cluster-specific
(= variety-specific) model parsimoniously. For five hypothetical varieties Figure 7.6 shows
the flexibility of the mixed model formulation for model [7.11] under the following
assumptions:
• 5"# œ 5## œ !. This is a purely fixed effects model where all varieties share the same
dependency on plant density (Figure 7.6 a).
• 5## œ !. Varieties vary in intercept (Figure 7.6b).

• 5"# œ !. Varieties vary in slope (Figure 7.6 c).

• 5"# Á !, 5## Á !. Varieties differ in slope and intercept (Figure 7.6 d).

© 2003 by CRC Press LLC


a) b)
60 2 parameters: α, β 60 3 parameters: α, β, σ21
Inverse Plant Yield

Inverse Plant Yield


40 40

20
20

0
0
0 10 20 30 0 10 20 30
Plant Density Plant Density

c) d)
60 3 parameters: α, β, σ22 60 4 parameters: α, β, σ21, σ22
Inverse Plant Yield

Inverse Plant Yield

40 40

20
20

0
0
0 10 20 30 0 10 20 30
Plant Density Plant Density

Figure 7.6. Fixed and mixed model trends for five hypothetical varieties. Purely fixed effects
model (a), randomly varying intercepts (b), randomly varying slopes (c), randomly varying
intercepts and slopes (d). The population-averaged trend is shown as a dashed line in panels
(b) to (d). The same differentiation in cluster-specific effects as in (d) with a purely fixed
effects model would have required "! parameters. Number of parameters cited excludes
Varc/34 d œ 5 # .

We complete this example by expressing [7.11] in terms of matrices and vectors as a


Laird-Ware model. Collect the ten observations for variety 3 (see Table 5.16) into the vector
Y3 Þ Collect the ten plant densities for variety 3 into a two-column matrix X3 , adding an inter-
cept. For variety ", for example, we have

Ô "Î"!&Þ' × Ô" $Þ!( ×


Ö "Î*)Þ% Ù Ö" $Þ$" Ù
Ö Ù Ö Ù
Ö "Î("Þ! Ù Ö" &Þ*( Ù
Y" œ Ö Ù, X" œ Ö Ù.
Ö "Î'!Þ$ Ù Ö" 'Þ** Ù
Ö Ù Ö Ù
ã ã ã
Õ "Î")Þ& Ø Õ" $"Þ!) Ø

If both intercept and slope vary at random among varieties, we have Z3 œ X3 . If only the
intercepts vary, Z3 is the first column of X3 . If only the slopes vary at random among

© 2003 by CRC Press LLC


varieties, Z3 is the second column of X3 . The model [7.11] for variety " with both parameters
randomly varying, is

Ô "Î"!&Þ' × Ô " $Þ!( × Ô" $Þ!( × Ô /"" ×


Ö "Î*)Þ% Ù Ö " $Þ$" Ù Ö" $Þ$" Ù Ö /"# Ù
Ö Ù Ö Ù Ö Ù Ö Ù
Ö "Î("Þ! Ù Ö " &Þ*( Ù ! Ö" &Þ*( Ù ,"" Ö / Ù
Y" œ X" " € Z" b" € e" œ Ö ÙœÖ Ù” • € Ö Ù” • € Ö "$ Ù.
Ö "Î'!Þ$ Ù Ö " 'Þ** Ù " Ö" 'Þ** Ù ,#" Ö /"% Ù
Ö Ù Ö Ù Ö Ù Ö Ù
ã ã ã ã ã ã
Õ "Î")Þ& Ø Õ " $"Þ!) Ø Õ" $"Þ!) Ø Õ /"ß"! Ø

Because of the experimental setup we can put Varce3 d œ R3 œ 5 # I.


The two-stage concept may be less immediate when the mixed model structure arises
from hierarchical random processes of treatment assignment or sampling as, for example, in
split-plot and subsampling designs. A model-based selection of which parameters are random
and which are fixed quantities does not occur. Rather, the treatment assignment and/or
sampling scheme dictates whether an effect is fixed or random (see §7.2.3). Nevertheless, the
resulting models are within the framework of the Laird-Ware model. We illustrate with a
simple split-plot design. Consider two whole-plot treatments arranged in a randomized block
design with two replications. Each of the four whole-plots is split by two sub-plot treatments.
A smaller split-plot design is hardly imaginable. The total design has only eight data points.
The linear model for this experiment is
‡
]345 œ . € 34 € 73 € /34 € !5 € a7!b35 € /345 ,

where 34 a4 œ "ß #b are the whole-plot block effects, 73 a3 œ "ß #b are the whole-plot treatment
‡
effects, /34 are the whole-plot experimental errors, !5 a5 œ "ß #b are the sub-plot treatment
effects, a7!b35 are the interactions, and /345 denotes the sub-plot experimental errors. Using
matrices and vectors the model can be expressed as follows:

Ô . ×
Ö 3" Ù
Ô ]""" × Ô " " ! " ! " ! " ! ! ! ×Ö Ù " ! ! !× Ô /""" ×
Ö 3# Ù Ô
Ö ]""# Ù Ö " " ! " ! ! " ! " ! ! ÙÖ Ù Ö" ! ! !Ù Ö /""# Ù
Ö Ù Ö ÙÖ 7" Ù Ö Ù Ö Ù
Ö ]"#" Ù Ö " ! " " ! " ! " ! ! ! ÙÖ Ù Ö! " ! ! ÙÔ /‡"" × Ö /"#" Ù
Ö Ù Ö ÙÖ 7# Ù Ö Ù ‡ Ö Ù
Ö ]"## Ù Ö " ! " " ! ! " ! " ! ! ÙÖ Ù Ö! " ! ! ÙÖ /"# Ù Ö /"## Ù
Ö ÙœÖ ÙÖ !" Ù € Ö ÙÖ ‡ Ù € Ö Ù,
Ö ]#"" Ù Ö " " ! ! " " ! ! ! " ! ÙÖ Ù Ö! ! " ! Ù /#" Ö/ Ù
Ö Ù Ö ÙÖ !# Ù Ö ÙÕ ‡ Ø Ö #"" Ù
Ö ]#"# Ù Ö " " ! ! " ! " ! ! ! " ÙÖ Ù Ö! ! " ! Ù /## Ö /#"# Ù
Ö Ù Ö ÙÖ a7 !b"" Ù Ö Ù Ö Ù
]##" " ! " ! " " ! ! ! " ! Ö Ù ! ! ! " /##"
Õ ]### Ø Õ " Ö a7 !b"# Ù Õ
! " ! " ! " ! ! ! " ØÖ Ù ! ! ! "Ø Õ /### Ø
a7 !b#"
Õ a7 !b Ø
##

or
Y œ X" € Zb € e.
This is a mixed model with four clusters of size two. The horizontal lines delineate observa-
tions that belong to the same cluster (whole-plot). In the notation of the Laird-Ware model we
identify for the first whole-plot, for example,

© 2003 by CRC Press LLC


" " ! " ! " ! " ! ! ! " /
X" œ ” ,Z œ ‡
, b œ /"" , e œ ” """ •,
" " ! " ! ! " ! " ! !• " ”"• " /""#
and
" œ c.ß 3" ß 3# ß 7" ß 7# ß !" ß !# ß a7!b"" ß a7!b"# ß a7!b#" ß a7!b## dw .

Observe that the Z matrix for the entire data set is

Ô Z" 0 0 0 ×
Ö 0 Z# 0 0 Ù
ZœÖ Ù.
0 0 Z$ 0
Õ 0 0 0 Z% Ø

This seems like a very tedious exercise. Fortunately, computer software such as proc mixed
of The SAS® System handles the formulation of the X and Z matrices. What the user needs to
know is which effects of the model are fixed (part of X), and which effects are random (part
of Z).
In the previous examples we focused on casting the models in the mixed model frame-
work by specifying X3 , Z3 , and b3 . Little attention was paid to the variance-covariance
matrices D and R3 . In split-plot and subsampling designs these matrices are determined by the
randomization and sampling protocol. In repeated measures and longitudinal studies the
modeler must decide whether random effects/coefficients in b3 are independent (D diagonal)
or not and must decide on the structure of R3 . With 83 observations per cluster and if all
observations within a cluster are correlated and have unequal variances there are 83 variances
and 83 a83  "bÎ# covariances to be estimated in R3 . To reduce the number of parameters in D
and R3 these matrices are usually parameterized and highly structured. In §7.5 we examine
popular parsimonious parametric structures. The next example shows how starting from a
simple model accommodating the complexities of a real study leads to a mixed model where
the modeler makes subsequent adjustments to the fixed and random parts of the model,
including the D and R matrices, always with an eye toward parsimony of the final model.

Example 7.4. Soil nitrate levels and their dependence on the presence or absence of
mulch shoot are investigated on bare soils and under alfalfa management. The treatment
structure of the experiment is a # ‚ # factorial of factors cover (alfalfa/none (bare soil))
and mulch (shoots applied/shoots not applied). Treatments are arranged in three
complete blocks each accommodating four plots. Each plot receives one of the four
possible treatment combinations. There is a priori evidence that the two factors do not
interact. The basic linear statistical model for this experiment is given by
]345 œ . € 33 € !4 € "5 € /345 ,

where 3 œ "ß ÞÞÞß $ indexes the blocks, !4 are the effects of shoot application, "5 the
effects of cover type, and the experimental errors /345 are independent and identically
distributed random variables with mean ! and variance 5 # . The variance of an individ-
ual observation is Varc]345 d œ 5 # .

In order to reduce the costs of the study soil samples are collected on each plot at four
randomly chosen locations. The variability of an observation ]3456 , where 6 œ "ß ÞÞÞß %

© 2003 by CRC Press LLC


indexes the samples from plot 345 , is increased by the heterogeneity within a plot, 5:#
say,
Varc]3456 d œ 5 # € 5:# .

The revised model must accommodate the two sources of random variation across plots
and within plots. This is accomplished by adding another random effect
]3456 œ . € 33 € !4 € "5 € /345 € 03456 ,

where 03456 µ ˆ!ß 5:# ‰. This is a mixed model with fixed part . € 33 € !4 € "5 and
random part /345 € 03456 . It is reasonable to assume by virtue of randomization that the
two random effects are independent and also that
Covc03456 ß 03456w d œ !.

There are no correlations of the measurements within a plot. The D matrix of the model
in Laird-Ware form will be diagonal.

It is imperative to the investigators to study changes in nitrate levels over time. To this
end soil samples at the four randomly chosen locations within a plot are collected in
five successive weeks. The data now have an additional repeated measurement structure
in addition to a subsampling structure. First, the fixed effects part must be modified to
accommodate systematic changes in nitrate levels over time. Treating time as a
continuous variable, coded as the number of days > since the initial measurement, the
fixed effects part can be revised as
Ec]34567 d œ . € 33 € !4 € "5 € # >37 ,

where >37 is the time point at which all plots in block 3 were measured. If the measure-
ment times differ across plots, the variable > would receive subscript 345 instead. The
random effects structure is now modified to (a) incorporate the variability of
measurements at the same spatial location over time; (b) account for residual temporal
autocorrelation among the repeated measurements.

A third random component, 134567 µ Ð!ß 5># Ñ, is added so that the model becomes
]34567 œ . € 33 € !4 € "5 € # >37 € /345 € 03456 € 134567

and the variance of an individual observation is


Varc]34567 d œ 5 # € 5:# € 5># .

With five measurements over time there are "! unique correlations per sampling loca-
tion: CorrÒ]3456" ß ]3456# Ó, CorrÒ]3456" ß ]3456$ Ó,â, CorrÒ]3456% ß ]3456& Ó. Furthermore, it is rea-
sonable that measurements should be more highly correlated the closer together they
were taken in time. Choosing a correlation model that depends explicitly on the time of
measurement can be accomplished with only a single parameter. The temporal corre-
lation model chosen is
Corrc134567 ß 134567w d œ expe  $ l>37  >37w lf,

© 2003 by CRC Press LLC


known as the exponential or continuous AR(1) correlation structure (§7.5). The term
l>37  >37w l measures the separation in weeks between two measurements and $ ž ! is a
parameter to be estimated from the data. $ determines how quickly the correlations
decrease with temporal separation.

The final linear mixed model for analysis is


]34567 œ . € 33 € !4 € "5 € # >37 € /345 € 03456 € 134567
/345 µ ˆ!ß 5 # ‰ß 03456 µ ˆ!ß 5:# ‰ß 134567 µ ˆ!ß 5># ‰
Covc134567 ß 134567w d œ 5># expe  $ l>37  >37w lf.

If in the context of repeated measures data each soil sample location within a plot is
considered a cluster, 5># describes the within-cluster heterogeneity and 5 # € 5:# the
between-cluster heterogeneity.

This model can be represented in matrix-vector notation at various levels of clustering.


Assuming that a cluster is formed by the repeated measurements at a given location
within a plot (index 3456),
Y3456 œ X3456 " € Z3456 b3456 € g3456

where, for example,

Ô . € 3$ € !# € "# ×
Ô" " ! " ! >"" ×
Ö 3"  3$ Ù
Ö" " ! " ! >"# Ù Ö Ù
Ö Ù Ö 3#  3$ Ù
X""#$ œ Ö" " ! " ! >"$ Ùß " œ Ö Ù
Ö Ù Ö !"  !# Ù
" " ! " ! >"% Ö Ù
Õ" "  "
" ! " ! >"& Ø Õ
" #
Ø
#

Ô" "× Ô 1""#$" ×


Ö" "Ù Ö 1""#$# Ù
Ö Ù / Ö Ù
Z""#$ œ Ö" " Ùß b""#$ œ ” ""# •ß g3456 œ Ö 1""#$$ Ù.
Ö Ù 0""#$ Ö Ù
" " 1""#$%
Õ" "Ø Õ 1""#$& Ø

The variance-covariance matrix of the random effects b3456 is


5# !
Varcb3456 d œ D œ ”
! 5:# •

and of the within-cluster disturbances

Ô " /$ .3"# /$ .3"$ /$ .3"% /$ .3"& ×


Ö /$ .3#" " /$ .3#$ /$ .3#% /$ .3#& Ù
Ö
# Ö $ .3$"
Ù
Varcg3456 d œ R34356 œ 5> Ö / /$ .3$# " /$ .3$% /$ .3$& Ù
Ù,
Ö /$ .3%" /$ .3%# /3$3%$ " /$ .3%& Ù
Õ /$ .3&" /$ .3&# /$ .3&$ /$ .3&% " Ø

© 2003 by CRC Press LLC


.377w œ l>37  >37w l. With the mixed procedure of The SAS® System this model is
analyzed with the following statements.
proc mixed data=YourData;
class block shoot cover location;
model nitrate = block shoot cover day;
random block*shoot*cover; /* random effect /345 */
random block*shoot*cover*location; /* random effect 03456 */
/* repeated measures 134567 */
repeated / subject=block*shoot*cover*location type=sp(exp)(day);
run;

7.2.3 Fixed or Random Effects


Box 7.3 Fixed or Random?

• An effect is random if its levels were chosen by some random mechanism


from a population (list) of possible levels, or, if the levels were not randomly
selected, the effects on the outcome are of a stochastic nature. Otherwise,
the effect is considered fixed.

• For random effects, inferences can be conducted in three different inference


spaces, termed the broad, intermediate, and narrow spaces. Conclusions in
fixed effects models apply only to the narrow inference space.

When appealing to the two-stage concept one assumes that some effects or coefficients of the
population-averaged model vary randomly from cluster to cluster. This requires in theory that
there is a population or universe of coefficients from which the realizations in the data can be
drawn (Longford 1993). In the onion plant density example, it is assumed that intercepts
and/or slopes in the universe of varieties vary at random around the average value of ! and/or
". Conceptually, this does not cause much difficulty if the varieties were selected at random
and stochastic variation between clusters can be reasoned. In many cases the clusters are not
selected at random and the question whether an effect is fixed or random is not clear.
Imagine, for example, that the same plant density experiment is performed at various loca-
tions. If locations were predetermined, rather than randomly selected, can we still attribute
differences in variety performance from location to location to stochastic effects, or are these
fixed effects? Some modelers would argue that location effects are deterministic, fixed effects
because upon repetition of the experiment the same locations would be selected and the same
locational effects should operate on the outcome. Others consider locations as surrogates of
different environments and consider environmental effects to be stochastic in nature.
Repetition of the experiment even at the same locations will produce different outcomes due
to changes in the environmental conditions and locational effects should thus be treated as
random. A similar discrepancy of opinion applies to the nature of seasonal effects. Are the
effects of years considered fixed or random? The years in which an experiment is conducted
are most likely not a random sample from a list of possible years. Experiments are conducted
when experimental areas can be secured, funds, machinery, and manpower are available.
According to the acid-test that declares factors as fixed if their levels were pre-determined,

© 2003 by CRC Press LLC


years should enter the analysis as fixed effects. But if year effects are viewed as stochastic
environmental effects they should enter the model as random effects (see, e.g., Searle 1971,
pp. 382-383 and Searle et al. 1992, pp. 15-16).
In a much cited paper, Eisenhart (1947) introduced fixed and random analysis of variance
models, which he termed Models I and II, a distinction used frequently to this day. He empha-
sized two parallel criteria to aid the modeler in the determination whether effects are fixed or
random.
(i) If upon repetition of the experiment the "same things" (levels of the factor) would be
studied again, the factor is fixed.
(ii) If inferences are to be confined to the factor levels actually employed in the experi-
ment, the factor is fixed. If conclusions are expanded to apply to more general
populations, it is random.

The cited test that determines factors as random if their levels are selected by a random
mechanism falls under the first criterion. The sampling mechanism itself makes the effect ran-
dom (Kempthorne 1975). Searle (1971, p. 383) subscribes to the second criterion, that of con-
fining inferences to the levels at hand. If one is interested in conclusions about varietal per-
formance for the specific years and locations in a multiyear, multilocation variety trial,
location and year effects would be fixed. If conclusions are to be drawn about the population
of locations at which the experiment could have been conducted in particular years, location
effects would be random and seasonal effects would be fixed. Finally, if inferences are to per-
tain to all locations in any season, then both factors would be random. We agree with the
notion implied by Searle's (and Eisenhart's second) criterion that it very much depends on the
context whether an effect is considered random or not. Robinson (1991) concludes similarly
when he states that “The choice of whether a class of effects is to [be] treated as fixed or ran-
dom may vary with the question which we are trying to answer.” His criterion, replacing both
(i) and (ii) above, is to ask whether the effects in question come from a probability distribu-
tion. If they do, they are random, otherwise they are fixed. Robinson's criterion does not
appeal to any sample or inference model and is thus attractive. It is noteworthy that Searle et
al. (1992, p. 16) placed more emphasis on the random sampling mechanism than Searle
(1971, p. 383). The latter reference reads
“In considering these points the important question is that of inference: are inferences going to be
drawn from these data about just these levels of the factor? "Yes"  then the effects are considered
as fixed effects. "No"  then, presumably, inferences will be made not just about the levels
occurring in the data but about some population of levels of the factor from which those in the data
are presumed to have come; and so the effects are considered as being random.”

Searle et al. (1992, p. 16) state, on the other hand,


“In considering these points the important question is that of inference: are the levels of the factor
going to be considered a random sample from a population of values? "Yes"  then the effects are
going to be considered as random effects. "No"  then, presumably, inferences will be made just
about the levels occuring in the data and the effects are considered as fixed effects.”

We emphasize that for the purpose of analysis it is often reasonable to consider effects as
random for some questions, and as fixed for others within the same investigation. Assume
locations were selected at random from a list of possible locations in a variety trial, so that
there is no doubt that they are random effects. One question of interest is which variety is

© 2003 by CRC Press LLC


424 Chapter 7  Linear Mixed Models

highest yielding across all possible locations. Another question may be whether varieties A
and B show significant yield differences at the particular locations used. Under Eisenhart's
second criterion one should treat location effects as random for the first analysis and as fixed
for the second analysis, which would upset Eisenhart's first criterion. Fortunately, mixed
models provide a way out of this dilemma. Within the same analysis we can choose with
respect to the random effects different inference spaces, depending on the question at hand
(see §7.3). Even if an effect is random conclusions can be drawn pertaining only to the factor
levels actually used and the effects actually observed (Figure 7.7).

Factor levels Factor levels not


randomly sampled randomly sampled
Genesis of Effect

Effect on Effect on
outcome outcome
stochastic deterministic

Random Fixed

Conclusions about Conclusions about all Conclusions


all possible levels levels of some random about the levels
Inference Space

or average stochastic effects and particular at hand


effects levels of others

Broad Intermediate Narrow


Space Space Space
Figure 7.7. Genesis of effects as fixed or random and their relationships to broad,
intermediate, and narrow inference spaces (§7.3).

Other arguments have been brought to bear to solve the fixed vs. random debate more or
less successfully. We want to dispense with two of these. The fact that the experimenter does
not know with certainty how a particular treatment will perform at a given location does not
imply a random location effect. This argument would necessarily lead to all effects being con-
sidered random since prior to the experiment none of the effects is known with certainty.
Another line of argument considers those effects random that are not under the experimenter's
control, such as block and environmental effects. Under this premise the only fixed effects
model is that of a completely randomized design and all treatment factors would be fixed.
These criteria are neither practical nor sensible. Considering block and other experimental
effects (apart from treatment effects) as random even if their selection was deterministic pro-

© 2003 by CRC Press LLC


The Laird-Ware Model 425

vided their effect on the outcome has a stochastic nature, yields a reasonable middle-ground
in our opinion. Treatment factors are obviously random only when the treatments are chosen
by some random mechanism, for example, when entries are selected at random for a variety
trial from a larger list of possible entries. If treatment levels are predetermined, treatment
effects are fixed. The interested reader can find a wonderful discourse of these and other
issues related to analysis of variance in general and the fixed/random debate in Kempthorne
(1975).

7.3 Choosing the Inference Space


Mixed models have an exciting property that enables researchers to perform inferences that
apply to different populations of effects. This property can defuse the fixed vs. random debate
(§7.2.3). Models in which all effects are fixed do not provide this opportunity. To motivate
the concept of an inference space, we consider a two-factor experiment where the treatment
levels were predetermined (are fixed) and the random factor corresponds to environmental
effects such as randomly chosen years or locations, or predetermined locations with
stochastically varying environment effects. The simple mixed model we have in mind is
]345 œ . € !3 € 74 € a!7 b34 € /345 , [7Þ12]

where the !3 a3 œ "ß âß +b are random environmental effects and 74 a4 œ "ß âß >b are the
fixed treatment effects, e.g., entries (genotypes) in a variety trial. There are 5 replications of
each environment ‚ entry combination and a!7 b34 represents genotype ‚ environment inter-
action. We observe that because the !3 are random variables, the interaction is also a random
quantity. If all effects in [7.12] were fixed, inferences about entry performance would apply
to the particular environments (locations) that are selected in the study, but not to other
environments that could have been chosen. This inference space is termed the narrow space
by McLean, Sanders, and Stroup (1991). The narrow inference space can also be chosen in
the mixed effects model to evaluate and compare the entries. If genotype performance is of
interest in the particular environments in which the experiment was performed and for the
particular genotype ‚ environment interaction, the narrow inference space applies. On other
occasions one might be interested in conclusions about the entries that pertain to the universe
of all possible environments that could have been chosen for the study. Entry performance is
then evaluated relative to potential environmental effects and random genotype ‚ environ-
ment interactions. McLean et al. (1991) term this the broad inference space and conclude
that it is the appropriate reference for inference if environmental effects are hard to specify.
The broad inference space has no counterpart in fixed effects models.
A third inference space, situated between the broad and narrow spaces has been termed
the intermediate space by McLean et al. (1991). Here, one appeals to specific levels of some
random effects, but to the universe of all possible levels with respect to other random effects.
In model [7.12] an intermediate inference space applies if one is interested in genotype per-
formance in specific environments but allows the genotype ‚ environment interaction to
vary at random from environment to environment. For purposes of inferences, one would fix
!3 and allow a!7 b34 to vary. When treatment effects 74 are fixed, the interaction is a random
effect since the environmental effects !3 are random. It is our opinion that the intermediate in-
ference space is not meaningful in this particular model. When focusing on a particular

© 2003 by CRC Press LLC


426 Chapter 7  Linear Mixed Models

environmental effect the interaction should be fixed at the appropriate level too. If, however,
the treatment effects were random, too, the intermediate inference space where one focuses
on the performance of all genotypes in a particular environment is meaningful.
In terms of testable hypotheses or "estimable" functions in the mixed model we are con-
cerned with linear combinations of the model terms. To demonstrate the distinction between
the three inference spaces, we take into account the presence of b, not just ", in specifying
these linear combinations. An "estimable" function is now written as
"
A" € Mb œ L” ,

where L œ cAß Md. Since estimation of parameters should be distinguished from prediction
of random variables, we use quotation marks. Setting M to 0, the function becomes A" , an
estimable function because A" is a matrix of constants. No reference is made to specific ran-
dom effects and thus the inference is broad. By selecting the entries of M such that Mb repre-
sents averages over the appropriate random effects, the narrow inference space is chosen and
one should refer to A" € Mb as a predictable function. An intermediate inference space is
constructed by averaging some random effects, while setting the coefficients of M pertaining
to other random effects to zero. We illustrate these concepts with an example from Milliken
and Johnson (1992, p. 285).

Example 7.8. Machine Productivity. Six employees are randomly selected from the
work force of a company that has plans to replace the machines in one of its factories.
Three candidate machine types are evaluated. Each employee operates each of the
machines in a randomized order. Milliken and Johnson (1992, p. 286) chose the mixed
model
]345 œ . € !3 € 74 € a!7 b34 € /345 ,

where 74 represents the fixed effect of machine type 4 a4 œ "ß âß $b, !3 the random
effect of employee 3 a3 œ "ß âß 'à !3 µ KÐ!ß 5!# bÑ, a!7 b34 the machine ‚ employee
#
interaction aa random effect with mean ! and variance 5!7 b, and /345 represents experi-
mental errors associated with employee 3 operating machine 4 at the 5 th time. The out-
come ]345 was a productivity score. The data for this experiment appear in Table 23.1
of Milliken and Johnson (1992) and are reproduced on the CD-ROM.

If we want to estimate the mean of machine type ", for example, we can do this in a
broad, narrow, or intermediate inference space. The corresponding expected values are
Broad: Ec]3"5 d œ . € 7"
" ' " '
Narrow: Ec]3"5 l!3 ß a!7 b3" d œ . € 7" € "!3 € "a!7 b3"
' 3œ" ' 3œ"
" '
Intermediate: Ec]3"5 l!3 d œ . € 7" € "!3
' 3œ"

Estimates of these quantities are obtained by substituting the estimates for .


s and s7 " and
the best linear unbiased predictors (BLUPs) for the random variables !3 and a!7 b3"

© 2003 by CRC Press LLC


Choosing the Inference Space 427

from the mixed model analysis. In §7.4 the necessary details of this estimation and
prediction process are provided. For now we take for granted that the estimates and
BLUPs can be obtained in The SAS® System with the statements

proc mixed data=productivity;


class machine person;
model score = machine / s;
random person machine*person / s;
run;

The /s option on the model statement prints the estimates of all fixed effects, the /s
option on the random statement prints the estimated BLUPs for the random effects. The
latter are reproduced from SAS® output in Table 7.1.

Table 7.1. Best linear unbiased predictors of !3 and a!7 b34


Best Linear Unbiased Predictors
Employee a3b !3 a!7 b3" a!7 b3# a!7 b3$
" "Þ!%%&  !Þ(&!" "Þ&!!!  !Þ""%#
#  "Þ$(&* "Þ&&#' !Þ'!'*  #Þ**''
$ &Þ$'!) "Þ(((' #Þ#**%  !Þ)"%*
%  !Þ!&*)  "Þ!$*% #Þ%"(%  "Þ%"%%
& #Þ&%%'  $Þ%&'* #Þ"&#" #Þ)&$#
'  (Þ&"%$ "Þ*"'$  )Þ*(&( #Þ%)(!
'
! ! ! ! !
3œ"

The solutions for the random effects sum to zero across the employees. Hence, when
the solutions are substituted into the formulas above for the narrow and intermediate
means and averaged, we have, for example,

" '
s € s7 " € "!
. s3 œ .
s € s7 " .
' 3œ"

The broad, narrow and intermediate estimates of the means will not differ. Provided
that the D matrix is nonsingular, this will hold in general for linear mixed models. Our
prediction of the average production score for machine " does not depend on whether
we refer to the six employees actually used in the experiment or the population of all
company employees from which the six were randomly selected. So wherein lies the
difference? Although the point estimates do not differ, the variability of the estimates
will differ greatly. Since the mean estimate in the intermediate inference space,

" '
s € s7 " € "!
. s3 ,
' 3œ"

involves the random quantities !s3 , its variance will exceed that of the mean estimate in
the narrow inference space, which is just . s €s7 " . By the same token the variance of
estimates in the broad inference space will exceed that of the estimates in the interme-
diate space. By appealing to the population of all employees, the additional uncertainty
that stems from the random selection of employees must be accounted for. The pre-

© 2003 by CRC Press LLC


428 Chapter 7  Linear Mixed Models

cision of broad and narrow inference will be identical if the random effects variance 5!#
is !, that is, there is no heterogeneity among employees with respect to productivity
scores.

Estimates and predictions of various various quantities obtained with proc mixed of
The SAS® System are shown in the next table from which the impact of choosing the
inference space on estimator/predictor precision can be inferred.

Table 7.2. Quantities to be estimated or predicted in machine productivity example


(intermediate inference applies to specific employees but randomly
varying employee ‚ machine interaction)
Description ID Space Parameter
Machine 1 Mean (1) Broad . € 7"
(2) Intermediate . € 7" € "' !'3œ" !3
(3) Narrow . € 7" € "' !'3œ" !3 € "' !'3œ" a!7 b3"
Machine 2 Mean (4) Broad . € 7#
(5) Intermediate . € 7# € "' !'3œ" !3
(6) Narrow . € 7# € "' !'3œ" !3 € "' !'3œ" a!7 b3#
Mach. 1 - Mach. 2 (7) Broad 7"  7#
(8) Narrow 7"  7# € "' !'3œ" a!7 b3"  "' !'3œ" a!7 b3#
Employee 1 BLUP (9) . € "$ !$4œ" 74 € !" € "$ !$4œ" a!7 b"4
Employee 2 BLUP (10) . € "$ !$4œ" 74 € !# € "$ !$4œ" a!7 b#4
Empl. 1 - Empl. 2 (11) !"  !# € "$ !$4œ" a!7 b"4  "$ !4œ"
$
a!7 b#4

The proc mixed code to calculate these quantities is as follows.

proc mixed data=productivity;


class machine person;
model score = machine;
random person machine*person;

estimate '(1) Mach. 1 Mean (Broad) ' intercept 1 machine 1 0 0;


estimate '(2) Mach. 1 Mean (Interm)' intercept 6 machine 6 0 0
| person 1 1 1 1 1 1 /divisor = 6;
estimate '(3) Mach. 1 Mean (Narrow)' intercept 6 machine 6 0 0
| person 1 1 1 1 1 1
machine*person 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0/divisor = 6;

estimate '(4) Mach. 2 Mean (Broad) ' intercept 1 machine 0 1 0;


estimate '(5) Mach. 2 Mean (Interm)' intercept 6 machine 0 6 0
| person 1 1 1 1 1 1 /divisor = 6;
estimate '(6) Mach. 2 Mean (Narrow)' intercept 6 machine 0 6 0
| person 1 1 1 1 1 1
machine*person 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0/divisor = 6 ;

© 2003 by CRC Press LLC


Choosing the Inference Space 429

estimate '(7) Mac. 1 vs. Mac. 2 (Broad) ' machine 1 -1;


estimate '(8) Mac. 1 vs. Mac. 2 (Narrow)' machine 6 -6
| machine*person 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 0 0 0 0 0 0/
divisor = 6 ;

estimate '(9) Person 1 BLUP' intercept 6 machine 2 2 2


| person 6 0 0 0 0 0
machine*person 2 0 0 0 0 0 2 0 0 0 0 0 2 0 0 0 0 0/divisor = 6;
estimate '(10) Person 2 BLUP' intercept 6 machine 2 2 2
| person 0 6 0 0 0 0
machine*person 0 2 0 0 0 0 0 2 0 0 0 0 0 2 0 0 0 0/divisor = 6;
estimate '(11) Person 1 - Person 2'
| person 6 -6 0 0 0 0
machine*person 2 -2 0 0 0 0 2 -2 0 0 0 0 2 -2 0 0 0 0/divisor = 6;
run;

When appealing to the narrow or intermediate inference spaces coefficients for the ran-
dom effects that are being held fixed are added after the | in the estimate statements. If
no random effects coefficients are specified, the M matrix in the linear combination
A" € Mb is set to zero and the inference space is broad. Notice that least squares
means calculated with the lsmeans statement of the mixed procedure are always broad.
The abridged output follows.

Output 7.1.
The Mixed Procedure

Model Information
Data Set WORK.PRODUCTIVITY
Dependent Variable score
Covariance Structure Variance Components
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Containment

Class Level Information


Class Levels Values
machine 3 1 2 3
person 6 1 2 3 4 5 6

Covariance Parameter Estimates


Cov Parm Estimate
person 22.8584
machine*person 13.9095
Residual 0.9246

Estimates

Standard
Label Estimate Error DF t Value Pr >|t|
(1) Mach. 1 Mean (Broad) 52.3556 2.4858 10 21.06 <.0001
(2) Mach. 1 Mean (Interm) 52.3556 1.5394 10 34.01 <.0001
(3) Mach. 1 Mean (Narrow) 52.3556 0.2266 10 231.00 <.0001
(4) Mach. 2 Mean (Broad) 60.3222 2.4858 10 24.27 <.0001
(5) Mach. 2 Mean (Interm) 60.3222 1.5394 10 39.19 <.0001
(6) Mach. 2 Mean (Narrow) 60.3222 0.2266 10 266.15 <.0001
(7) Mac. 1 vs. Mac. 2 (Broad) -7.9667 2.1770 10 -3.66 0.0044
(8) Mac. 1 vs. Mac. 2 (Narrow)-7.9667 0.3205 10 -24.86 <.0001
(9) Person 1 BLUP 60.9064 0.3200 10 190.32 <.0001
(10) Person 2 BLUP 57.9951 0.3200 10 181.22 <.0001
(11) Person 1 - Person 2 2.9113 0.4524 36 6.43 <.0001

© 2003 by CRC Press LLC


430 Chapter 7  Linear Mixed Models

The table of Covariance Parameter Estimates displays the estimates of the variance
components of the model; 5s #! œ ##Þ)&), 5
s #!7 œ "$Þ*!*, 5
s # œ !Þ*#&. Because the data
are balanced and proc mixed estimates variance-covariance parameters by restricted
maximum likelihood (by default), these estimates coincide with the method-of-moment
estimates derived from expected mean squares and reported in Milliken and Johnson
(1992, p. 286).

It is seen from the table of Estimates that the means in the broad, intermediate, and
narrow inference spaces are identical; for example, &#Þ$&&' is the estimate for the mean
production score of machines of type ", regardless of inference space. The standard
errors of the three estimates are largest in the broad inference space and smallest in the
narrow inference space. The same holds for differences of the means. Notice that if one
would analyze the data as a fixed effects model, the estimates for (1) through (8)
would be identical. Their standard errors would be incorrect, however. The next to the
last two estimates are predictions of random effects ((9) and (10)) and the prediction
of the difference of two random effects (11). If one would incorrectly specify the
model as a fixed effects model, the estimates and their standard errors would be
incorrect for (9) - (11).

7.4 Estimation and Inference


The Laird-Ware model
Y3 œ X3 " € Z3 b3 € e3 , VarcY3 d œ Z3 DZw3 € R3

contains a fair number of unknown quantities that must be calculated from data. Parameters
of the model are " , D, and R3 ; these must be estimated. The random effects b3 are not param-
eters, but random variables. These must be predicted in order to calculate cluster-specific
trends and perform cluster-specific inferences. If "s is an estimator of " and s
b3 is a predictor
of b3 , the population-averaged prediction of Y3 is
s
s3 œ X 3 "
Y
and the cluster-specific prediction is calculated as

Y s € Z3 s
s3 œ X 3 " b3 .
Henderson (1950) derived estimating equations for " and b3 known as the mixed model equa-
tions. We derive the equations in §A7.7.1 and their solutions in §A7.7.2. Briefly, the mixed
model equations are
"
s
" Xw R" X Xw R" Z Xw R" y
–s — œ ” w " w
" " • ” w " •, [7.13]
b ZR X ZR Z€B ZR y

where the vectors and matrices of the individual clusters were properly stacked and arranged

© 2003 by CRC Press LLC


Estimation and Inference 431

to eliminate the subscript 3 and B is a block-diagonal matrix whose diagonal blocks consist of
the matrix D (see §A7.7.1 for details). The solutions are
s œ ˆXw V" X‰" Xw V" y
" [7.14]
s s ).
b œ BZw V" Ðy  X" [7.15]

The estimate "s is a generalized least squares estimate. Furthermore, the predictor s
b is the best
linear unbiased predictor (BLUP) of the random effects b (§A7.7.2.). Properties of these ex-
pressions are easily established. For example,
"
s“ œ ",
E’" s “ œ ˆXw V" X‰ ,
Var’" Es
b‘ œ 0 .

Since b is a random variable, more important than evaluating VarÒs bÓ is the variance of the
prediction error s
b  b, which can be derived after some tedious calculations (Harville 1976a,
Laird and Ware 1982) as
"
Vars
b  b‘ œ B  BZw V" ZB € BZw V" XˆXw V" X‰ Xw V" ZB. [7.16]

Although not a very illuminating expression, it is [7.16] rather than VarÒs


bÓ that should be
reported. When predicting random variables the appropriate measure of uncertainty is the
mean square prediction error [7.16], not the variance of the predictor.
Expressions [7.14] and [7.15] assume that the variance-covariance matrix V and hence D
and R are known, which almost always they are not. Even in the simplest case of independent
and homoscedastic within-cluster errors where R œ 5 # I, an estimate of the variance 5 # is re-
quired. It seems reasonable to use instead of [7.14] and [7.15] a substitution estimator/predic-
µ
tor where V is replaced with an estimate V ,
" µ "
s œ ŠXwµ
"
"
V X‹ Xw V Y
s µ " s ).
b œ BZw V ÐY  X"
s would then be similarly estimated by substituting µ
The variance of " V into the expression for
s Ó,
VarÒ"
"
Var s “ œ ŠXwµ
s ’"
"
V X‹ . [7.17]

s Ó is a consistent estimator of VarÒ"


s Ò"
Var s Ó if µ
V is consistent for V. It is a biased estimator,
µ "
however. There are two sources to this bias in that (i) ÐXw V Xс" is a biased estimator of
µ
ÐXw V" Xс" and (ii) the variability that arises from estimating V by V is unaccounted for
(Kenward and Roger 1997). Consequently, [7.17] underestimates VarÒ" s Ó. The various
approaches to estimation and inference in the linear mixed model to be discussed next depend
on how V is estimated and thus which matrix is used for substitution. The most important
principles are maximum likelihood, restricted maximum likelihood, and estimated generalized
least squares (generalized estimating equations). Bias corrections and bias corrected estima-
tors of Var[" s ] for likelihood-type estimation are discussed in Kackar and Harville (1984),
Prasad and Rao (1990), Harville and Jeske (1992), and Kenward and Roger (1997).

© 2003 by CRC Press LLC


432 Chapter 7  Linear Mixed Models

7.4.1 Maximum and Restricted Maximum Likelihood


We collect all unknown parameters in D and R into a parameter vector ) and call ) the vector
of covariance parameters (although it may also contain variances). The maximum likelihood
principle chooses those values s s as estimates for ) and " that maximize the joint Gaus-
) and "
sian distribution of the aX ‚ "b vector Y, i.e., KaX"ß Va)bb. Here, X denotes the total num-
ber of observations across all 8 clusters, X œ !83œ" 83 . Details of this process can be found in
§A7.7.3. Briefly, the objective function to be minimized, the negative of twice the Gaussian
log-likelihood, is
°a" ß )à yb œ lnlVÐ)Ñl € ay  X" bw VÐ)с" Ðy  X" Ñ € X lne#1f. [7.18]

The problem is solved by first profiling " out of the equation. To this end " in [7.18] is re-
"
placed with aXw V())" Xb Xw V() )" y and the resulting expression is minimized with respect
to ). If derivatives of the profiled log-likelihood with respect to one element of ) depend on
other elements of ), the process is iterative. On occasion, some or all covariance parameters
can be estimated in noniterative fashion. For example if R œ 5 # R‡ , B œ 5 # B‡ , with R‡ and
B‡ known, then
" w
s ‹ aZB‡ Zw € R‡ b" Šy  X"
s ‹.
s# œ
5 Šy  X"
X
Upon convergence of the algorithm, the final iterate s
)Q is the maximum likelihood estimate
of the covariance parameters, and
"
s Q œ ŠXw VÐs
" )Q с" X‹ Xw VÐs
) Q с" Y [7.19]

is the maximum likelihood estimator of "s . Since maximum likelihood estimators (MLE) have
certain optimality properties; for example, they are asymptotically the most efficient estima-
tors, substituting the MLE of ) in the generalized least squares estimate for " has much
appeal. The predictor for the random effects is calculated as

bQ œ BÐs
s )Q ÑZw VÐs s Q ‹.
) Q с" Šy  X" [7.20]

Maximum likelihood estimators of covariance parameters have a shortcoming. They are


usually negatively biased, that is, too small on average. The reason for this phenomenon is
their not taking into account the number of fixed effects being estimated. A standard example
illustrates the problem. In the simple model ]3 œ " € /3 a3 œ "ß âß 8b where the /3 's are a
random sample from a Gaussian distribution with mean ! and variance 5 # , we wish to find
MLEs for " and 5 # . The likelihood for these data is
8
" "
¿ ˆ" ß 5 # à y ‰ œ $ expœ  # aC3  " b# 
3œ"
È #15 # #5

s and 5
and the values " s # that maximize ¿a" ß 5 # lyb necessarily minimize
8
"
 #6ˆ" ß 5 # à y‰ œ °ˆ" ß 5 # à y‰ œ "lne#1f € ln˜5 # ™ € aC3  " b# .
3œ"
5#

© 2003 by CRC Press LLC


Estimation and Inference 433

Setting derivatives with respect to " and 5 # to zero leads to two equations
" 8
óÀ ` °ˆ" ß 5 # à y‰Î` " œ "aC3  " b ´ !
5 # 3œ"
8
8 "
ôÀ ` °ˆ" ß 5 # à y‰Î` 5 # œ  #
€ " % aC3  " b# ´ !.
5 3œ"
5

Solving ó yields " s œ 8" !8 C3 œ C . Substituting for " in ô and solving yields
3œ"
8 #
s Q œ 8" !3œ" aC3  C b . The estimate of " is the familiar sample mean, but the estimate of
5 #

the variance parameter is not the sample variance =# œ a8  "b" !83œ" aC3  C b# . Since the
sample variance is an unbiased estimator of 5 # under random sampling from any distribution,
s #Q has bias
we see that 5
8" # 8" # "
s #Q  5 # ‘ œ E”
E5 W •  5# œ 5  5# œ  5# .
8 8 8

If " were known there would be only one estimating equation (ô) and the MLE for 5 # would
be
" 8 " 8
s# œ
5 "a]3  " b# œ "a]3  Ec]3 db# ,
8 3œ" 8 3œ"

s #Q originates in not adjusting the divisor of


which is an unbiased estimator of 5 # . The bias of 5
the sum of squares by the number of estimated parameters in the mean function. An alterna-
tive estimation method is restricted maximum likelihood (REML), also known as residual
maximum likelihood (Patterson and Thompson 1971; Harville 1974, 1977). Here, adjust-
ments are made in the objective function to be minimized that account for the number of esti-
mated mean parameters. Briefly, the idea is as follows (see §A7.7.3 for more details). Rather
than maximizing the joint distribution of Y we focus on the distribution of KY, where K is a
matrix of error contrasts. These contrasts are linear combinations of the observations such
that EcKYd œ 0. Hence we require that the inner product of each row of K with the vector of
observations has expectation !. We illustrate the principle in the simple setting examined
above where ]3 œ " € /3 a3 œ "ß âß 8b. Define a new vector

Ô ]"  ] ×
Ö ] ] Ù
Ua8"‚"b œÖ
Ö
# Ù
Ù [7.21]
ã
Õ ]8"  ] Ø

and observe that U has expectation 0 and variance-covariance matrix given by


"
Ô"  8  8" ⠁ 8" ×
Ö " "  8" ⠁ 8" Ù
VarcUd œ 5 # Ö 8 Ù œ 5 # ŒI8"  " J8"  œ 5 # P.
Ö ã ä ã Ù 8
Õ  "
 "
â " 8" Ø
8 8

Applying Theorem 8.3.4 in Graybill a1969, p. 190b, the inverse of this matrix turns out to
have a surprisingly simple form,

© 2003 by CRC Press LLC


434 Chapter 7  Linear Mixed Models

Ô# " â "×
" " Ö" # â "Ù "
VarcUd œ #Ö Ù œ # aI8" € J8" b.
5 ã ä ã 5
Õ" " â #Ø

Also, lPl œ "Î8 and some algebra shows that Uw P" U œ !83œ" Ð]3  ] Ñ# , the residual sum of
squares. The likelihood for U is called the restricted likelihood of Y because U is restricted to
have mean 0. It can be written as
l5 # Pl½ " w "
¿ ˆ5 # à u ‰ œ expœ  u P u
a #1 b a8"bÎ# #5 #

and is no longer a function of the mean " . Minus twice the restricted log likelihood becomes
 #6ˆ5 # à u‰ œ °ˆ5 # à u‰ œ a8  "blne#1f  lne8f
8
a C3  C b # [7.22]
€ a8  "blnˆ5 # ‰ € " .
3œ"
5#

Setting the derivative of °a5 # à ub with respect to 5 # to zero one obtains the estimating equa-
tion that implies the residual maximum likelihood estimate:
` °a5 # à ub a8  "b " 8
´ ! Í œ "aC3  C b#
` 5# 5# 5 % 3œ"
8
"
s #V œ
5 "aC3  C b# .
8  " 3œ"

The REML estimator for 5 # is the sample variance and hence unbiased.
The choice of U in [7.21] corresponds to a particular matrix K such that U œ KY. We
can express U formally as KY where
K œ c I8" 0a8"‚"b d  c J8" 1a8"‚"b d.

If EcYd œ X" , K needs to be chosen such that KY contains no term in " . This is equivalent
to removing the mean and considering residuals. The alternative name of residual maximum
likelihood derives from this notion. Fortunately, as long as K is chosen to be of full row rank
and KX œ 0, the REML estimates do not depend on the particular choice of error contrasts.
In the simple constant mean model ]3 œ " € /3 we could define an orthogonal contrast matrix

Ô" " ! â ! ! ×
Ö" " # â ! ! Ù
Ca8"‚8b œÖ Ù
ã ã
Õ" " " â "  a8  #b Ø
½
and a diagonal matrix DÐ8"‚8"Ñ œ DiagÖa3 € 3# b ×. Letting K‡ œ DC and U‡ œ KY, then
EcU‡ d œ 0, VarcU‡ d œ 5 # DCCw D œ I8" , u‡ w u‡ œ !83œ" aC3  C b# and minus twice the log
likelihood of U‡ is

© 2003 by CRC Press LLC


Estimation and Inference 435

" 8
°ˆ5 # ;u‡ ‰ œ a8  "blne#1f € a8  "blnˆ5 # ‰ € "aC3  Cb# . [7.23]
5 # 3œ"

Apart from the constant lne8f this expression is identical to [7.22] and minimization of either
function will lead to the same REML estimator of 5 # . For more details on REML estimation
see Harville (1974) and Searle et al. (1992, Ch. 6.6). Two generic methods for constructing
the K matrix are described in §A7.7.3.
REML estimates of variance components and covariance parameters have less bias than
maximum likelihood estimates and in certain situations (e.g., certain balanced designs) are
unbiased. In a balanced completely randomized design with subsampling, fixed treatment
effects and 8 subsamples per experimental unit, for example, it is well-known that the
observational error mean square and experimental error mean square have expectations
EcQ W aSI bd œ 59#
EcQ W aII bd œ 59# € 85/# ,

where 59# and 5/# denote the observational and experimental error variances, respectively. The
ANOVA method of estimation (Searle et al. 1992, Ch. 4.4) equates mean squares to their
expectations and solves for the variance components. From the above equations we derive the
estimators
s #! œ Q W aSI b
5
"
s #/ œ eQ W aII b  Q W aSI bf.
5
8
These estimators are unbiased by construction and identical to the REML estimators in this
case (for an application see §7.6.3). A closer look at 5 s #/ shows that this quantity could
possibly be negative. Likelihood estimators must be values in the parameter space. Since
s /#  ! is considered only a solution to the likelihood estimation problem, but
5/# ž !, a value 5
not a likelihood estimate. Unfortunately, to retain unbiasedness, one has to allow for the
possibility of a negative value. One should choose maxÖ5 s #/ ß !× as the REML estimator in-
stead. While this introduces some bias, it is the appropriate course of action. Corbeil and
Searle (1976) derive solutions for the ML and REML estimates for four standard classifica-
tion models when data are balanced and examine the properties of the solutions. They call the
solutions "ML estimators" or "REML estimators," acknowledging that ignoring the positivity
requirement does not produce true likelihood estimators. Lee and Kapadia (1984) examine the
bias and variance of ML and REML estimators for one of Corbeil and Searle's models for
which the REML solutions are unbiased. This is the balanced two-way mixed model without
interaction,
]34 œ . € !3 € "4 € /34 ß a3 œ "ß âß +à 4 œ "ß âß , b.

Here, !3 could correspond to the effects of a fixed treatment factor with + levels and "4 to the
random effects of a random factor with , levels, "4 µ KÐ!ß 5,# Ñ. Observe that there is only a
single observation per combination of the two factors, that is, the design is nonreplicated. The
experimental errors are assumed independent Gaussian with mean ! and variance 5 # . Table
7.3, adapted from Lee and Kapadia (1984), shows the bias, variance, and mean square error
of the maximum likelihood and restricted maximum likelihood estimators of 5 # and 5,# for

© 2003 by CRC Press LLC


436 Chapter 7  Linear Mixed Models

+ œ 'ß , œ "!. ML estimators have the smaller variability throughout, but show non-negli-
gible negative bias, especially if the variability of the random effect is small relative to the
error variability. In terms of the mean square error (Variance + Bias# ), REML estimators of
5 # are superior to ML estimators but the reverse is true for estimates of 5,# . Provided 5,#
accounts for at least &!% of the response variability, REML estimators are essentially un-
biased, since then the probability of obtaining a negative solution for 5,# tends quickly to zero.
Returning to mixed models of the Laird-Ware form, it must be noted that the likelihood
for KY in REML estimation does not contain any information about the fixed effects ".
REML estimation will produce estimates for ) only. Once these estimates have been obtained
we again put the substitution principle to work. If s
)V is the REML estimate of ) , the fixed
effects are estimated as
"
s V œ ŠXw VÐs
" )V с" X‹ Xw VÐs
)V с" y, [7.24]

and the random effects are predicted as

bV œ BÐs
s )V ÑZw VÐs s V ‹.
) V с" Šy  X" [7.25]

Because the elements of ) were estimated, [7Þ24] is no longer a generalized least squares
(GLS) estimate. Because the substituted estimate s
)V is not a maximum likelihood estimate,
[7Þ24] is also not a maximum likelihood estimate. Instead, it is termed an Estimated GLS
(EGLS) estimate.

Table 7.3. Bias (F ), variance (Z +<), and mean square error aQ WI b of ML and REML
estimates in a balanced, two-way mixed linear model without replication†
(fixed factor A has ', random factor B has "! levels)

5#
ML REML
Varc"4 dÎVarc]34 d F Z +< Q WI F Z +< Q WI
!Þ"  !Þ!** !Þ!#( !Þ!$(  !Þ!"! !Þ!$$ !Þ!$%
!Þ$  !Þ!(" !Þ!"( !Þ!##  !Þ!"" !Þ!#" !Þ!##
!Þ&  !Þ!&! !Þ!!* !Þ!""  !Þ!!! !Þ!"" !Þ!""
!Þ(  !Þ!$! !Þ!!$ !Þ!!%  !Þ!!! !Þ!!% !Þ!!%
!Þ*  !Þ!"! !Þ!!! !Þ!!!  !Þ!!! !Þ!!! !Þ!!!

5,#
ML REML
Varc"4 dÎVarc]34 d F Z +< Q WI F Z +< Q WI
!Þ"  !Þ!!" !Þ!"! !Þ!"! !Þ!"! !Þ!"# !Þ!"#
!Þ$  !Þ!#* !Þ!$" !Þ!$# !Þ!!" !Þ!$* !Þ!$*
!Þ&  !Þ!&! !Þ!'# !Þ!'% !Þ!!! !Þ!(' !Þ!('
!Þ(  !Þ!(! !Þ"!" !Þ"!' !Þ!!! !Þ"!' !Þ"#&
!Þ*  !Þ!*! !Þ"&" !Þ"&* !Þ!!! !Þ")( !Þ")(

Adapted from Table 1 in Lee and Kapadia (1984). With permission of the International Biometric
Society.

© 2003 by CRC Press LLC


Estimation and Inference 437

Because REML estimation is based on the likelihood principle and REML estimators
have lower bias than maximum likelihood estimators, we prefer REML for parameter estima-
tion in mixed models over maximum likelihood estimation and note that it is the default
method of the mixed procedure in The SAS® System.

7.4.2 Estimated Generalized Least Squares


Maximum likelihood estimation of Ð" ß )Ñ and restricted maximum likelihood estimation of )
is a numerically expensive process. Also, the estimating equations used in these procedures
rely on the distribution of the random effects b3 and the within-cluster errors e3 being
Gaussian. If the model
Y3 œ X3 " € Z3 b3 € e3
e3 µ a0ß R3 b
b3 µ a0ß Db

holds with Covce3 ß b3 d œ 0, the generalized least squares estimator


8 " 8
s KPW œ "Xw3 V3 Ð)с" X3 w "
" Ž  "X3 V3 Ð)Ñ Y3 [7.26]
3œ" 3œ"
"
œ ˆX V Ð ) Ñ X ‰ X V Ð ) Ñ Y
w " w "

can be derived without any further distributional assumptions such as Gaussianity of b3


and/or e3 . Only the first two moments of the marginal distribution of Y3 , EcY3 d œ X3 " and
VarcY3 d œ V3 œ Z3 DZw3 € R3 are required. The idea of estimated generalized least squares
(EGLS) is to substitute for VÐ)Ñ a consistent estimator. If s
) is consistent for ) one can simply
substitute s) for ) and use Vs œ VÐs)Ñ. The ML and REML estimates [7.19] and [7.24] are of
this form. Alternatively, one can estimate D and R directly and use V s œ ZBZs w€R s. In either
case the estimator of the fixed effects becomes

8 " 8
"
s"
s IKPW œ "Xw3 V
" Ž
w s " w s" w s"
3 X3  "X3 V3 Y3 œ ŠX V X‹ X V Y. [7.27]
3œ" 3œ"

The ML [7.19], REML [7.24], and EGLS [7.27] estimators of the fixed effects are of the
same general form, and they differ only in how V is estimated. EGLS is appealing when V
can be estimated quickly, preferably with a noniterative method. Vonesh and Chinchilli
(1997, Ch. 8.2.4) argue that in applications with a sufficient number of observations and
when interest lies primarily in ", little efficiency is lost. Two basic noniterative methods are
outlined in §A7.7.4 for the case where within cluster observations are uncorrelated and
homoscedastic, that is, R3 œ 5 # I. The first method estimates D and 5 # by the method of
moments and predicts the random effects with the usual formulas such as [7.15] substituting
s for V. The second method estimates the random effects b3 by regression methods first and
V
s from the s
calculates an estimate D b3 .

© 2003 by CRC Press LLC


438 Chapter 7  Linear Mixed Models

7.4.3 Hypothesis Testing


Testing of hypotheses in linear mixed models proceeds along very similar lines as in the
standard linear model without random effects. Some additional complications arise, however,
since the distribution theory of the standard test statistics is not straightforward. To motivate
the issues, we distinguish three cases:
(i) VarcYd œ V is completely known;
(ii) VarcYd œ 5 # V œ 5 # ZBZw € 5 # R is known up to the scalar constant 5 # ;
(iii) VarcYd œ Va)b or VarcYd œ 5 # Va)‡ b depends on unknown parameters ) œ c5 # ß )‡ d.

Of concern is a testable hypothesis of the same form as in the fixed effects linear model,
namely L! : A" œ d. For cases (i) and (ii) we develop in §A7.7.5 that for Gaussian random
effects b3 and within-cluster errors e3 exact tests exist. Briefly, in case (i) the statistic
" "
s  dÑw ’AˆXw V" X‰ Aw “ ÐA"
[ œ ÐA " s  dÑ [7.28]

is distributed under the null hypothesis as a Chi-squared variable with <aAb degrees of
freedom, where <aAb denotes the rank of the matrix A. Similarly, in case (ii) we have
" "
s  dÑw ’AˆXw V" X‰ Aw “ ÐA"
[ œ ÐA " s  dÑÎ5 # µ ;# .
<aAb

If the unknown 5 # is replaced with


" w
s ‹ V" Šy  X"
s ‹.
s# œ
5 Šy  X" [7.29]
X  < aX b

then
" "
s  dÑw ’AaXw V" Xb Aw “ ÐA"
ÐA " s  dÑ
J9,= œ [7.30]
s#
< aA b 5

is distributed as an J variable with <aAb numerator and X  <aXb denominator degrees of


freedom. Notice that 5s # is not the maximum likelihood estimator of 5 # . A special case of
[7.30] is when A is a vector of zeros with a " in the 4th position and d œ 0. The linear
hypothesis L! : A" œ 0 then becomes L! : "4 œ ! and J9,= can be written as

s #4
"
J9,= œ #
,
s4‹
eseŠ"

s 4 Ñ is the estimated standard error of "


where eseÐ" s 4 . This J9,= has one numerator degree of
freedom and consequently
s 4 ‹ ‚ ÈJ9,=
>9,= œ signŠ" [7.31]

is distributed as a > random variable with X  <aXb degrees of freedom. A "!!a"  !b%

© 2003 by CRC Press LLC


Estimation and Inference 439

confidence interval for "4 is constructed as

s 4 „ >!Î#ßX <ÐXÑ ‚ eseŠ"


" s 4 ‹.

The problematic case is (iii), where the marginal variance-covariance matrix is unknown
and more than just a scalar constant must be estimated from the data. The proposal is to
‡
replace V with VÐs
)Ñ in [7.28] and 5s # VÐs
) Ñ in [7.30] and to use as test statistics
" "
s  dÑw ”AŠXw VÐs
[ ‡ œ ÐA " s  dÑ
)с" X‹ Aw • ÐA" [7.32]

if VarcYd œ Va)b and


" "
‡
s  dÑw ”AŠXw VÐs
ÐA " s  dÑ
) с" X‹ Aw • ÐA"
‡
J9,= œ #
[7.33]

< aA b5

if VarcYd œ 5 # Va)‡ b. [ ‡ is compared against cutoffs from a Chi-squared distribution with


‡
<aAb degrees of freedom and J9,= against cutoffs from an J distribution. Unfortunately, this
substitution has dramatic consequences for the distribution of the resulting statistics. Assume
that the estimator of the covariance parameters being substituted is consistent. One can then
show that, asymptotically, the distribution of [ ‡ is ;# with <aAb degrees of freedom, but the
#
asymptotic distribution of J9,=‡
is not that of an J random variable. The reason is that 5 s‡
# ‡ #
converges in distribution to 5 and in the limit J9,= is not the ratio of two independent ;
‡
variables divided by their respective degrees of freedom. By comparing J9,= to cutoffs from
an J<aAbßX <aXb distribution, we do not utilize the correct asymptotic distribution. This argu-
ment should lead one to favor [7.32] over [7Þ33]. It has been established empirically,
however, that the Type-I error rates of tests based on [7Þ33] are closer to the nominal rates
than those of [7.32] and :-values of the Chi-square test will be smaller. There is a heuristic
explanation for this phenomenon since [7.32] essentially corresponds to using an J distribu-
tion with infinitely many denominator degrees of freedom.
In certain balanced cases, for example, in complete split-plot designs, tests based on
[7.32] and [7Þ33] can be exact. In general, one should anticipate, however, that the tests could
be distorted. When substituting an estimate of ) , AÐXw VÐs )с" Xс" Aw is not an unbiased esti-
mator of the variance of A" s , which is the centerpiece in [7.28] and [7.30]. Even if VÐs )Ñ were
s
unbiased for VÐ)Ñ, ÐX VÐ)с" Xс" underestimates VarÒ"
w s Ó since the uncertainty arising from
substituting s) for ) is not accounted for. Bias corrections and bias corrected estimators of
Var[" s ] were developed by Kackar and Harville (1984), Prasad and Rao (1990), and Harville
and Jeske (1992). Kenward and Roger (1997) combine a bias adjustment with a degree of
freedom correction applied to the J test [7Þ33] that ensures that the actual Type-I error rate is
close to the nominal rate in complex mixed models and models with complex error structure.
They anticipate the correction to be necessary when sample size is small. But what is a small
sample size? To demonstrate the bias in the various tests, we simulated "ß !!! repetitions of a
repeated measures experiment under the following conditions. Four treatments are applied in
a completely randomized design with three replications and repeated measurements are
collected at times > œ "ß #ß $ß %ß & on all experimental units. No observations are missing and

© 2003 by CRC Press LLC


440 Chapter 7  Linear Mixed Models

all units are measured at the same time intervals. The linear mixed model for this experiment
is
]345 œ . € 73 € /34 € >5 € a7 >b35 € .345 , [7.34]

where 73 measures the effect of treatment 3 and /34 are independent experimental errors
associated with replication 4 of treatment 3. The terms >5 and a7 >b35 denote time effects and
treatment ‚ time interactions. Finally, .345 are random disturbances among the serial
measurements from an experimental unit. It is assumed that these disturbances are serially
correlated according to
l>5  >5w l
Corrc.345 ß .345w d œ expœ  . [7.35]
9

This is known as the exponential correlation model (§7.5). The parameter 9 determines the
strength of the correlation of two disturbances l>5  >5w l time units apart. Observations from
different experimental units were assumed to be uncorrelated in keeping with the random
assignment of treatments to experimental units. We simulated the experiments for various
values of 9 (Table 7.4).

Table 7.4. Correlations among the disturbances .345 in model [7.34]


based on the exponential correlation model [7.35]
Values of 9 in simulation
"
l>5  >5w l $ " # $ %
" !Þ!%* !Þ$') !Þ'!' !Þ("( !Þ(()
# !Þ!!# !Þ"$& !Þ$') !Þ&"$ !Þ'!(
$ !Þ!!! !Þ!%* !Þ##$ !Þ$') !Þ%(#
% !Þ!!! !Þ!") !Þ"$& !Þ#'% !Þ$')

The treatment ‚ time cell mean structure is shown in Figure 7.8. The treatment means
were chosen so that there was no marginal time effect and no marginal treatment effect. Also,
there was no difference of the treatments at times #, %, or &. In each of the "ß !!! realizations
of the experiment we tested the hypotheses
ó L! : no treatment main effect
ô L! : no time main effect
õ L! : no treatment effect at time #
ö L! : no treatment effect at time %.

These null hypotheses are true and at the &% significance level the nominal Type-I error rate
of the tests should be !Þ!&. An appropriate test procedure will be close to this nominal rate
when average rejection rates are calculated across the "ß !!! repetitions.
The following tests were performed:
• The exact Chi-square test based on [7.28] where the true values of the covariance para-
meters were used. These values were chosen as Varc/34 d œ 5/# œ ", Varc.345 d œ
5.# œ ", and 9 according to Table 7.4;

© 2003 by CRC Press LLC


Estimation and Inference 441

• The asymptotic Chi-square test based on [7.32] where the restricted maximum likeli-
hood estimates of 5/# , 5.# , and 9 were substituted;
• The asymptotic J test based on [7Þ33] where the restricted maximum likelihood
estimates of 5/# , 5.# , and 9 were substituted;
• The J test based on Kenward and Roger (1997) employing a bias correction in the esti-
s Ó coupled with a degree of freedom adjusted J test.
mation of VarÒ"

2 Tx 2

Tx 3, Tx 4
0

-1

-2 Tx 1

1 2 3 4 5
Time t

Figure 7.8. Treatment ‚ time cell means in repeated measures simulation based on model
[7.34].

The proc mixed statements that produce these tests are as follows.
/* Analysis with correct covariance parameter estimates */
/* Exact Chi-square test [7.28] */
proc mixed data=sim noprofile;
class rep tx t;
model y = tx t tx*t / Chisq ;
random rep(tx);
repeated /subject=rep(tx) type=sp(exp)(time);
/* First parameter is Var[rep(tx)] */
/* Second parameter is Var[e] */
/* Last parameter is range, passed here as a macro variable */
parms (1) (1) (&phi) / hold=1,2,3;
by repetition;
run;

/* Asymptotic Chi-square and F tests [7.32] and [7Þ33] */


proc mixed data=sim;
class rep tx t;
model y = tx t tx*t / Chisq ;
random rep(tx);
repeated /subject=rep(tx) type=sp(exp)(time);
by repetition;
run;

© 2003 by CRC Press LLC


442 Chapter 7  Linear Mixed Models

/* Kenward-Roger F Tests */
proc mixed data=sim;
class rep tx t;
model y = tx t tx*t / ddfm=KenwardRoger;
random rep(tx);
repeated /subject=rep(tx) type=sp(exp)(time);
by repetition;
run;

Table 7.5. Simulated Type-I error rates for exact and asymptotic Chi-square and J tests and
‡
Kenward-Roger adjusted J -test (KR-J denotes J9,= in Kenward-Roger test and
KR-.0 the denominator degrees of freedom for KR-J )
Exact ;# Asymp. ;# Asymp. J
9 L! [7.28] [7.32] [7Þ33] KR- J KR-.0
"Î$ ó !Þ!&" !Þ"$! !Þ!&" !Þ!&& )Þ"
ô !Þ!%* !Þ!*# !Þ!'( !Þ!&' #%Þ*
õ !Þ!&" !Þ"!$ !Þ!(( !Þ!&" "#Þ%
ö !Þ!&' !Þ"!" !Þ!)& !Þ!&& "#Þ%
" ó !Þ!&& !Þ"#' !Þ!%) !Þ!&" )Þ$
ô !Þ!%* !Þ!*" !Þ!'( !Þ!&% #&Þ'
õ !Þ!'! !Þ!)& !Þ!'* !Þ!&& ")Þ"
ö !Þ!&& !Þ!)" !Þ!'( !Þ!&$ ")Þ"
# ó !Þ!&% !Þ"!' !Þ!%' !Þ!&& )Þ'
ô !Þ!%* !Þ!*" !Þ!'( !Þ!&' #'Þ#
õ !Þ!&& !Þ!(& !Þ!&* !Þ!&% ##Þ$
ö !Þ!%* !Þ!($ !Þ!&) !Þ!&& ##Þ$
$ ó !Þ!&$ !Þ"!( !Þ!$* !Þ!%* )Þ)
ô !Þ!%* !Þ!*# !Þ!(! !Þ!'! #'Þ'
õ !Þ!%* !Þ!(& !Þ!&$ !Þ!%) #%Þ&
ö !Þ!%) !Þ!'* !Þ!&" !Þ!%) #%Þ&
% ó !Þ!&' !Þ"!& !Þ!$) !Þ!&! *Þ!
ô !Þ!%* !Þ!*! !Þ!(! !Þ!'" #(Þ"
õ !Þ!&$ !Þ!') !Þ!&# !Þ!%* #&Þ)
ö !Þ!%# !Þ!'' !Þ!%( !Þ!%% #&Þ)

The results are displayed in Table 7.5. The exact Chi-square test maintains the nominal
Type-I error rate, as it should. The fluctuations around !Þ!& are due to simulation variability.
Increasing the number of repetitions will decrease this variability. When REML estimators
are substituted for ), the asymptotic Chi-square test performs rather poorly. The Type-I errors
are substantially inflated; in many cases they are more than doubled. The asymptotic J test
performs better, but the Type-I errors are typically somewhat inflated. Notice that with
increasing strength of the serial correlation (increasing 9) the inflation is less severe for the
asymptotic test, as was also noted by Kenward and Roger (1997). The bias and degree of
freedom adjusted Kenward-Roger test performs extremely well. The actual Type-I errors are
very close to the nominal error rate of !Þ!&. Even with sample sizes as small as '! one should
consider this test as a suitable procedure if exact tests of the linear hypothesis do not exist.
The tests discussed so far are based on the exact (V()) known) or asymptotic (VÐ)Ñ
s . Tests of any model parameters, " or ),
unknown) distribution of the fixed effect estimates "

© 2003 by CRC Press LLC


Estimation and Inference 443

can also be conducted based on the likelihood ratio principle, provided the hypothesis being
tested is a simple restriction on a"ß )b so that the restricted model is nested within the full
sQ ß s µ µ
model. If Ð" )Q Ñ are the maximum likelihood estimate in the full model and Ð"Q ß )Q Ñ are
the maximum likelihood estimates under the restricted model, the likelihood-ratio test statistic
sQ ß s µ µ
A œ 6Š" )Q à y‹  6Š"Q ß )Q à y‹ [7.36]

has an asymptotic ;# distribution with degrees of freedom equal to the number of restrictions
imposed. In REML estimation the hypothesis would be imposed on the covariance parameters
only (more on this below) and the (residual) likelihood ratio statistic is
µ
A œ 6Šs
)V à y‹  6Š )V à y‹.

Likelihood ratio tests are our test of choice to test hypotheses about the covariance
parameters (provided the two models are nested). For example, consider again model [7.34]
with equally spaced repeated measurements. The correlation of the random variables .345 can
also be expressed as
Corrc.345 ß .345w d œ 3l>5 >5w l , [7.37]

where 3 œ expe  "Î9f in [7.35]. This is known as the first-order autoregressive correlation
model (see §7.5.2 for details about the genesis of this model). For a single replicate the
correlation matrix now becomes

Ô" 3 3# 3$ 3% ×
Ö3 " 3 3# 3$ Ù
Ö Ù
Corrcd34 d œ Ö 3# 3 " 3 3# Ù.
Ö $ Ù
3 3# 3 " 3
Õ 3% 3$ 3# 3 "Ø

A test of L! :3 œ ! would address the question of whether the within-cluster disturbances (the
repeated measures errors) are independent. In this case the mixed effects structure would be
identical to that from a standard split-plot design where the temporal measurements comprise
the sub-plot treatments. To test L! :3 œ ! with the likelihood ratio test the model is fit first
with the autoregressive structure and then with an independence structure a3 œ !b. The
difference in their likelihood ratios is then calculated and compared to the cutoffs from a Chi-
squared distribution with one degree of freedom. We illustrate the process with one of the
repetitions from the simulation experiment. The full model is fit with the mixed procedure of
The SAS® System:
proc mixed data=sim;
class rep tx t;
model y = tx t tx*t;
random rep(tx);
repeated /subject=rep(tx) type=ar(1);
run;

The repeated statement indicates that the combinations of replication and treatment
variable, which identifies the experimental unit, are the clusters which are considered
independent. All observations that share the same replication and treatment values are con-
sidered the within-cluster observations and are correlated according to the autoregressive

© 2003 by CRC Press LLC


444 Chapter 7  Linear Mixed Models

model (type=ar(1), Output 7.2). Notice the table of Covariance Parameter Estimates
s #/ œ "Þ#(', 5
which reports the estimates of the covariance parameters 5 s #. œ !Þ!'&), and
3 œ !Þ#$&$. The table of Fit Statistics shows the residual log likelihood 6Ðs
s ) V à yÑ œ
 #*Þ$ as Res Log Likelihood.

Output 7.2.
The Mixed Procedure

Model Information

Data Set WORK.SIM


Dependent Variable y
Covariance Structures Variance Components, Autoregressive
Subject Effect rep(tx)
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Containment

Dimensions
Covariance Parameters 3
Columns in X 30
Columns in Z 12
Subjects 1
Max Obs Per Subject 60
Observations Used 60
Observations Not Used 0
Total Observations 60

Covariance Parameter Estimates


Cov Parm Subject Estimate
rep(tx) 1.2762
AR(1) rep(tx) 0.2353
Residual 0.06580

Fit Statistics
Res Log Likelihood -29.3
Akaike's Information Criterion -32.3
Schwarz's Bayesian Criterion -33.0
-2 Res Log Likelihood 58.6

Fitting the model with an independence structure a3 œ !b is accomplished with the


statements (Output 7.3)
proc mixed data=sim;
class rep tx t;
model y = tx t tx*t;
random rep(tx);
run;

Observe that the reduced model has only two covariance parameters, 5/# and 5.# . Its
µ
residual log likelihood is 6Ð)V à yÑ œ  #*Þ'&&*. Since one parameter was removed, the
likelihood ratio test compares
A œ #a!#*Þ$  a  #*Þ'&bb œ !Þ(

against a ;#" distribution. The :-value for this test is PrÐ;#" ž !Þ(Ñ œ !Þ%!#) and L! :3 œ !
cannot be rejected. The :-value for this likelihood-ratio test can be conveniently computed
with The SAS® System:

© 2003 by CRC Press LLC


Estimation and Inference 445

data test; p = 1-ProbChi(0.7,1); run; proc print data=test; run;

Output 7.3. The Mixed Procedure

Model Information
Data Set WORK.SIM
Dependent Variable y
Covariance Structure Variance Components
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Containment

Dimensions
Covariance Parameters 2
Columns in X 30
Columns in Z 12
Subjects 1
Max Obs Per Subject 60
Observations Used 60
Observations Not Used 0
Total Observations 60

Covariance Parameter Estimates


Cov Parm Estimate
rep(tx) 1.2763
Residual 0.05798

Fit Statistics
Res Log Likelihood -29.6
Akaike's Information Criterion -31.6
Schwarz's Bayesian Criterion -32.1
-2 Res Log Likelihood 59.3

In the preceding test we used the restricted maximum likelihood objective function for
testing L! :3 œ !. This is justified since the hypothesis was about a covariance parameter. As
discussed in §7.4.1, the restricted likelihood contains information about the covariance
parameters only, not about the fixed effects " . To test hypotheses about " based on the likeli-
hood ratio principle one should fit the model by maximum likelihood. With the mixed
procedure this is accomplished by adding the method=ml option to the proc mixed statement,
e.g.,
proc mixed data=sim method=ml;
class rep tx t;
model y = tx t tx*t;
random rep(tx);
repeated /subject=rep(tx) type=ar(1);
run;

At times the full and reduced models are not nested, that is, the restricted model cannot
be obtained from the full model by simply constraining or setting to zero some of its param-
eters. For example, to compare whether the correlations between the repeated measurements
follow the exponential model
l>5  >5w l
Corrc.345 ß .345w d œ expœ  
9

or the spherical model

© 2003 by CRC Press LLC


446 Chapter 7  Linear Mixed Models

$
$ l>5  >5w l " l>5  >5w l
Corrc.345 ß .345w d œ ž"  € Œ  ŸM al>5  >5w l Ÿ 9b,
# 9 # 9

one cannot nest one model within the other. A different test procedure is needed. The method
commonly used relies on comparing overall goodness-of-fit statistics of the competing
models (Bozdogan 1987, Wolfinger 1993a). The most important ones are Akaike's informa-
tion criterion (AIC, Akaike 1974) and Schwarz' criterion (Schwarz 1978). Both are functions
of the (restricted) log likelihood with penalty terms added for the number of covariance
parameters. In some releases of proc mixed a smaller value of the AIC or Schwarz' criterion
indicates a better fit. Notice that there are other versions of these two criteria where larger
values indicate a better fit. One cannot associate degrees of significance or :-values with
these measures. They are interpreted in a greater/smaller is better sense only. For the two
models fitted above AIC is –$#Þ$ for the model with autoregressive error terms and –$"Þ' for
the model with independent error terms. The AIC criterion leads to the same conclusion as
the likelihood-ratio test. The independence model fits this particular set of data better. We
recommed to use the AIC or Schwarz criterion for models with the same fixed effects terms
but different, non-nested covariance structures.

7.5 Correlations in Mixed Models

7.5.1 Induced Correlations and the Direct Approach


A feature of mixed models is to induce correlations even if all the random variables in a
model are uncorrelated. While for many applications this is an incidental property of the
model structure, the modeler can also purposefully employ this feature in cases where data
are known to be correlated, as in longitudinal studies where serial correlation among the
repeat measurements is typical. We illustrate the phenomenon of inducing correlations with a
simple example.
Consider a linear regression model where the cluster-specific intercepts vary at random
(see, for example, Figure 7.6b). Casting this model in the Laird-Ware framework, we write
for observation 4 from cluster 3,
]34 œ a"! € ,3 b € "" B34 € /34 [7.38]
Varc,3 d œ 5,# , Varc/34 d œ 5 #
3 œ "ß ÞÞÞß 8à 4 œ "ß ÞÞÞß 83 .

In terms of matrices and vectors, Y3 œ X3 " € Z3 b3 € e3 , the model for the 3th cluster is
written as

Ô ]3" × Ô " B3" × Ô"× Ô /3" ×


Ö ]3# Ù Ö " B3# Ù "! Ö"Ù Ö/ Ù
Ö ÙœÖ Ù € Ö Ù,3 € Ö 3# ÙÞ
ã ã 㠔 "" • " ã
Õ ]38 Ø Õ " B383 Ø Õ"Ø Õ /38 Ø
3 3

© 2003 by CRC Press LLC


Correlations in Mixed Models 447

Notice that the Z3 matrix is the first column of the X3 matrix (a random intercept). If the error
terms /34 are homoscedastic and uncorrelated and Covc,3 ß /34 d œ !, the marginal variance-
covariance matrix of the observations from the 3th cluster is then

Ô"× Ô" ! ! !×
Ö"Ù #Ö ! " â !Ù
VarcY3 d œ 5,# Z3 Zw3 #
€5 Iœ 5,# Ö Ùc " " â "d € 5 Ö Ù
ã ã ã ä ã
Õ"Ø Õ! ! â "Ø
# #
Ô 5, € 5 5,# â 5,# ×
#
Ö 5, 5,# € 5# â 5,# Ù
œÖ Ù œ 5,# J € 5 # I.
ã ã ä ã
Õ 5# 5,# â 5,# € 5 # Ø
,

The correlation between the 4th and 5 th observation from a cluster is


Corrc]34 ß ]35 d œ 3 œ 5,# Έ5,# € 5 # ‰, [7.39]

the ratio between the variability among clusters and the variability of an observation. VarcY3 d
can then also be expressed as

Ô" 3 â 3×
ˆ5,# # ‰Ö 3 " â 3Ù
VarcY3 d œ €5 Ö Ù. [7.40]
ã ã ä ã
Õ3 3 â "Ø

The variance-covariance structure [7.40] is known as compound symmetry aCSb, the


exchangeable structure, or the equicorrelation structure and 3 is also called the intracluster
correlation coefficient. In this model all observations within a cluster are correlated by the
same amount. The structure arises with nested random effects, i.e., Z3 is a vector of ones and
the within-cluster errors are uncorrelated. Hence, subsampling and split-plot designs exhibit
this correlation structure. We term correlations that arise from the nested character of the
random effects (,3 is a cluster-level random variable, /34 is an observation-specific random
variable) induced correlations.
In applications where it is known a priori that observations are correlated within a cluster
one can take a more direct approach. In the model Y3 œ X3 " € Z3 b3 € e3 assume for the
moment that b3 œ 0. We are then left with a fixed effects linear model where the errors e3
have variance-covariance matrix R3 . If we let Varc]34 d œ 0 and

Ô" 3 â 3×
Ö3 " â 3Ù
R3 œ 0 Ö Ù, [7.41]
ã ã ä ã
Õ3 3 â "Ø

then the model also has a compound-symmetric correlation structure. Such a model could
arise with clusters of size four if one draws blood samples from each leg of a heifer and 3
measures the correlation of serum concentration among the samples from a single animal. No
leg takes precedence over any other legs; they are exchangeable, with no particular order. In
this model the within-cluster variance-covariance matrix was targeted directly to capture the

© 2003 by CRC Press LLC


448 Chapter 7  Linear Mixed Models

correlations among the observations from the same cluster; we therefore call it direct
modeling of the correlations. Observe that if 0 œ 5,# € 5 # there is no difference in the
marginal variability between [7.40] and [7.41]. In §7.6.3 we examine a subsampling design
and show that modeling a random intercept according to [7.40] and modeling the
exchangeable structure [7.41] directly leads to the same inference.
There is also a mixture approach where some correlations are induced through random
effects and the variance-covariance matrix R3 of the within-cluster errors is also structured. In
the linear mixed model with random intercepts above ([7.38]) assume that the measurements
from a cluster are repeated observations collected over time. It is then reasonable to assume
that the measurements from a given cluster are serially correlated, i.e., R3 is not a diagonal
matrix. If it is furthermore sensible to posit that observations close together in time are more
highly correlated than those far apart, one may put

Ô " 3 3# 3$ â 383 " ×


Ö 3 " 3 3# â 383 # Ù
Ö Ù
#Ö 3# 3 " 3 â 383 $ Ù
R3 œ 5 Ö Ù, 3   !.
Ö 3$ 3# 3 " â 383 % Ù
Ö Ù
ã ã ã ã ä ã
Õ 383 " 383 # 383 $ 383 % â " Ø

This correlation structure is known as the first-order autoregressive (ARa"b) structure


borrowed from the analysis of time series (see §7.5.2 and §A7.7.6). It is applied frequently if
repeated measurements are equally spaced. For example, if repeated measurements on experi-
mental units in a randomized block design are taken in weeks !, #, %, ', and ), the correlation
between any two measurement two weeks apart is 3, between any two measurements four
weeks apart is 3# , and so forth. Combining this within-cluster correlation structure with
cluster-specific random intercepts leads to a more complicated marginal variance-covariance
structure:
# #
Ô 5, € 5 5,# € 5 # 3 5,# € 5 # 3# â 5,# € 5 # 383 " ×
Ö 5# € 5# 3 5,# € 5 # 5,# € 5 # 3 â 5,# € 5 # 383 # Ù
Ö , Ù
VarcY3 d œ 5,# J € R3 œ Ö #
Ö 5, € 5 3
# #
5# 3 5,# € 5 # â 5,# € 5 # 383 $ Ù
Ù.
Ö ã ã ã ä ã Ù
Õ 5,# € 5 # 383 " 5, € 5 # 383 #
#
â â 5,# € 5 # Ø

Notice that with 3 ž ! the within-cluster correlations approach zero with increasing tem-
poral separation and the marginal correlations approach 5,# ÎÐ5,# € 5 # Ñ. Decaying correlations
with increasing separation are typically reasonable. Because mixed models induce correla-
tions through the random effects one can achieve marginal correlations that are functions of a
chosen metric quite simply. If correlations are to depend on time, for example, simply include
a time variable as a column of Z. The resulting marginal correlation structure may not be
meaningful, however. We illustrate with an example.

Example 7.5. The growth pattern of an experimental soybean variety is studied. Eight
plots are seeded and the average leaf weight per plot is assessed at weekly intervals fol-
lowing germination. If > measures time since seeding in days, the average growth is

© 2003 by CRC Press LLC


Correlations in Mixed Models 449

assumed to follow a linear model


]34 œ "! € "" >34 € "# >#34 € /34 ,

where ]34 is the average leaf weight per plant on plot 3 measured at time >34 . The double
subscript for the time variable > allows measurement occasions to differ among plots.
Figure 7.9 shows data for the growth of soybeans on the eight plots simulated after
Figure 1.2 in Davidian and Giltinan (1995). In this experiment, a plot serves as a cluster
and the eight trends differ in their linear gradients (slopes). It is thus reasonable to add
random coefficients to "" . The mixed model becomes
]34 œ "! € a"" € ,"3 b>34 € "# >#34 € /34. [7.42]

35

30

25
Leaf weight per plant (g)

20

15

10

10 20 30 40 50 60 70 80
Days after seeding

Figure 7.9. Simulated soybean leaf weight profiles fitted to data from eight experi-
mental plots.

If the within-cluster errors are independent, reasonable when the leaf weight is obtained
from a random sample of plants from plot 3 at time >34 , the model in matrix formulation
is Y3 œ X3 " € Z3 b3 € e3 , Varce3 d œ 5 # I, with quantities defined as follows:

Ô" >3" >#3" × Ô >3" ×


Ö" >3# >#3# Ù Ô "! × > Ù
X3 œ Ö Ùß " œ "" ß Z3 œ Ö
Ö 3# Ùß b3 œ ,"3 , Varc,3 d œ 5,# .
Öã ã ã Ù ã
Õ "# Ø
Õ" >383 >#383 Ø Õ >38 Ø
3

The marginal variance-covariance matrix VarcY3 d œ Z3 DZw3 € 5 # I for [7.42] is now


# # #
Ô >3" € 5 Î5, >3" >3# á >3" >383 ×
Ö >3# >3" >#3# € 5 # Î5,# á >3# >383 Ù
VarcY3 d œ 5,# Ö
Ö
Ù.
Ù
ã ä ã
Õ >383 >3" >383 >3# á >#383 € 5 # Î5,# Ø

The covariance structure depends on the time variable which is certainly a meaningful
metric for the correlations. Whether the correlations are suitable functions of that metric

© 2003 by CRC Press LLC


450 Chapter 7  Linear Mixed Models

can be argued. Between any two leaf weight measurements on the same plot we have
5,# >34 >34w
Corrc]34 ß ]34w d œ Þ
Ɉ5,# >#34 € 5 # ‰ˆ5,# >#34w € 5 # ‰

For illustration let 5 # œ 5,# œ " and the repeated measurements be coded >3" œ ",
>3# œ #, >3$ œ $, and so forth. Then Corr[]3" ß ]3# ] œ #ÎÈ"! œ !Þ'$, Corrc]3" ß ]3$ d œ
$ÎÈ#! œ !Þ'(, and Corrc]3" ß ]3% d œ %ÎÈ$% œ !Þ'). The correlations are not decaying
with temporal separation, they increase. Also, two time points equally spaced apart do
not have the same correlation. For example, Corrc]3# ß ]3$ d œ 'ÎÈ&! œ !Þ)& which
exceeds the correlation between time points " and #. If one would code time as
>3" œ !ß >3# œ "ß >3$ œ #, â, the correlations of any observations with the first time
point would be uniformly zero.

If correlations are subject to modeling, they should be modeled directly on the within-
cluster level, i.e., through the R3 matrix. Interpretation of the correlation pattern should then
be confined to cluster-specific inference. Trying to pick up correlations by choosing columns
of the Z3 matrix that are functions of the correlation metameter will not necessarily lead to a
meaningful marginal correlation model, or one that can be interpreted with ease.

7.5.2 Within-Cluster Correlation Models


In order to model the within-cluster correlations directly and parsimoniously, we rely on
structures that impose a certain behavior on the within-cluster disturbances while requiring
only a small number of parameters. In this subsection several of these structures are intro-
duced. The modeler must eventually decide which structure is to be used for a final fit of the
model to the data. In our experience it is more important to model the correlation structure in
a reasonable and meaningful way rather than to model the correlation structure perfectly.
When data are clearly correlated, assuming independence is certainly not a meaningful
approach, but one will find that a rather large class of correlation models will provide similar
goodness-of-fit to the data. The modeler is encouraged to find one member of this class,
rather than to go overboard in an attempt to mold the model too closely to the data at hand. If
models are nested, the likelihood ratio test (§7.4.3) can help to decide which correlation struc-
ture fits the data best. If models are not nested, we rely on goodness-of-fit criterion such as
Akaike's Information Criterion and Schwarz' criterion. If several non-nested models with
similar AIC statistic emerge, we remind ourselves of Ockham's razor and choose the most
parsimonious one, the one with the fewest parameters. Finding an appropriate function that
describes the mean trend, the fixed effects structure, is a somewhat intuitive part of statistical
modeling. A plot of the data suggests a particular linear or nonlinear model which can be
fitted to data and subsequently improved upon. Gaining insight into the correlation pattern of
data over time and/or space is less intuitive. Fortunately, several correlation models have
emerged and proven to work well in many applications. Some of these are borrowed from de-
velopments in time series analysis or geostatistics, others from the theory of designed
experiments.

© 2003 by CRC Press LLC


Correlations in Mixed Models 451

In what follows we assume that the within-cluster variance-covariance matrix is of the


form
Varce3 d œ R3 œ R3 a!b,

where ! is a vector of parameters determining the correlations among and the variances of
the elements of e3 . In previous notation we labeled ) the vector of covariance parameters in
VarcY3 d œ Z3 DZw3 € R3 . Hence ! contains only those covariance parameters that are not con-
tained in Varcb3 d œ D. In many applications ! will be a two-element vector, containing one
parameter to model the within-cluster correlations and one parameter to model the within-
cluster variances.
We would be remiss not to mention approaches which account for serial correlations but
make no assumptions about the structure of Varce3 d. One is the multivariate repeated measures
approach (Cole and Grizzle 1966, Crowder and Hand 1990, Vonesh and Chinchilli 1997),
sometimes labeled multivariate analysis of variance (MANOVA). This approach is restrictive
if data are unbalanced or missing or covariates are varying with time. The mixed model
approach based on the Laird-Ware model is more general in that it allows clusters of unequal
sizes, unequal spacing of observation times or locations and missing observations. The
MANOVA approach essentially uses an unstructured variance-covariance matrix, which is
one of the structures open to investigation in the Laird-Ware model (Jennrich and Schluchter
1986).
An a8 ‚ 8b covariance matrix contains 8a8 € "bÎ# unique elements and if a cluster
contains ' measurements, say, up to '‡(Î# œ #" parameters need to be estimated in addition
to the fixed effects parameters and variances of any random effects. Imposing structure on the
variance-covariance matrix beyond an unstructured model requires far fewer parameters. In
the remainder of this section it is assumed for the sake of simplicity that a cluster contains
83 œ % elements and that measurements are collected in time. Some of these structures will
reemerge when we are concerned with spatial data in §9.
The simplest covariance structure is the independence structure,

Ô" ! ! !×
# #Ö ! " ! !Ù
Varce3 d œ 5 I œ 5 Ö Ù. [7.43]
! ! " !
Õ! ! ! "Ø

Apart from the scalar 5 # no additional parameters need to be estimated. The compound-
symmetric or exchangeable structure

Ô" 3 3 3×
Ö3 " 3 3Ù
R 3 a! b œ 5 # Ö Ùß ! œ 5 # ß 3‘ , [7.44]
3 3 " 3
Õ3 3 3 "Ø

is suitable if measurements are equicorrelated and exchangeable. If there is no particular


ordering among the correlated measurements a compound-symmetric structure may be
reasonable. We reiterate that a split-plot or subsampling design also has a compound-sym-
metric marginal correlation structure. In proc mixed one can thus analyze a split-plot experi-
ment in two equivalent ways. The first is to use independent random effects representing

© 2003 by CRC Press LLC


452 Chapter 7  Linear Mixed Models

whole-plot and sub-plot experimental errors. If the whole-plot design is a randomized


completely randomized design, the proc mixed statements are
proc mixed data=yourdata;
class rep A B;
model y = A B A*B;
random rep(A);
run;

It appears that only one random effect has been specified, the whole-plot experimental
error rep(A). Proc mixed will add a second, residual error term automatically, which corre-
sponds to the sub-plot experimental error. The default containment method of assigning
degrees of freedom will ensure that J tests are formulated correctly, that is, whole-plot
effects are tested against the whole-plot experimental error variance and sub-plot effects and
interactions are tested against the sub-plot experimental error variance. If comparisons of the
treatment means are desired, we recommend to add the ddfm=satterth option to the model
statement. This will invoke the Satterthwaite approximation where necessary, for example,
when comparing whole-plot treatments at the same level of the sub-plot factor (see §7.6.5 for
an example):
proc mixed data=yourdata;
class rep A B;
model y = rep A B A*B / ddfm=satterth;
random rep(A);
run;

As detailed in §7.5.3 this error structure will give rise to a marginal compound-
symmetric structure. The direct approach of modeling a split-plot design is by specifying the
compound-symmetric model through the repeated statement:
proc mixed data=yourdata;
class rep A B;
model y = A B A*B / ddfm=satterth;
repeated / subject=rep(A) type=cs;
run;

In longitudinal or repeated measures studies where observations are ordered along a time
scale, the compound symmetry structure is often not reasonable. Correlations for pairs of time
points are not the same and hence not exchangeable. The analysis of repeated measures data
as split-plot-type designs assuming compound symmetry is, however, very common in
practice. In section §7.5.3 we examine under which conditions this is an appropriate analysis.
The first, rather crude, modification is not to specify anything about the structure of the
correlation matrix and to estimate all unique elements of R3 a!b. This unstructured variance-
covariance matrix can be expressed as
#
Ô 5" 5"# 5"$ 5"% ×
Ö 5#" 5## 5#$ 5#% Ù
R 3 a! b œ Ö
Ö 5$"
Ù [7.45]
5$# 5$# 5$% Ù
Õ 5%" 5%# 5%$ 5%# Ø

! œ 5"# ß âß 5%# ß 5"# ß 5"$ ß âß 5$% ‘.

Here, 534 œ 543 is the covariance between observations 3 and 4 within a cluster. This is not a
parsimonious structure; there are 83 ‚ a83 € "bÎ# parameters that need to be estimated. In

© 2003 by CRC Press LLC


Correlations in Mixed Models 453

many repeated measures data sets where the sequence of temporal observations is small, in-
sufficient information is available to estimate all the correlations with satisfactory precision.
Furthermore, there is no guarantee that the correlations will decrease with temporal separa-
tion. If that is a reasonable stipulation, other models should be employed. The unstructured
model is fit by proc mixed with the type=un option of the repeated statement.
The large number of parameters in the unstructured parameterization can be reduced by
introducing constraints. For example, one may assume that all - -step correlations are
identical,
344w œ 34 if l4  4w l œ - .

This leads to banded, also called Toeplitz, structures if the diagonal elements are the same. A
Toeplitz matrix of order 5 has 5  " off-diagonals filled with the same element. A #-banded
Toeplitz parameterization is

Ô" 3" ! !×
Ö3 " 3" !Ù
R 3 a! b œ 5 # Ö " Ùß ! œ 5 # ß 3" ‘, [7.46]
! 3" " 3"
Õ! ! 3" "Ø

and a $-banded Toeplitz structure

Ô" 3" 3# !×
Ö3 " 3" 3# Ù
R 3 a! b œ 5 # Ö " Ùß ! œ 5 # ß 3" ß 3# ‘. [7.47]
3# 3" " 3"
Õ! 3# 3" "Ø

A #-banded Toeplitz structure may be appropriate if, for example, measurements are taken at
weekly intervals, but correlations do not extend past a period of seven days. In a turfgrass
experiment where mowing clippings are collected weekly but a fast acting growth regulator is
applied every ten days, an argument can be made that correlations do not persist over more
than two measurement intervals.
Unstructured correlation models can also be banded by setting elements in off-diagonal
cells more than 5  " positions from the main diagonal to zero. The #-banded unstructured
parameterization is
#
Ô 5" 5"# ! ! ×
Ö 5#" 5## 5#$ ! Ù
R 3 a! b œ Ö
Ö !
Ù, ! œ 5 # ß âß 5 # ß 5"# ß 5#$ ß 5$% ‘, [7.48]
5$# 5$# 5$% Ù " %

Õ ! ! 5%$ 5%# Ø

and the $-banded unstructured is


#
Ô 5" 5"# 5"$ ! ×
Ö 5#" 5## 5#$ 5#% Ù
R 3 a! b œ Ö
Ö 5$"
Ù, ! œ 5 # ß âß 5 # ß 5"# ß 5#$ ß 5$% ß 5"$ ß 5#% ‘ [7.49]
5$# 5$# 5$% Ù " %

Õ ! 5%# 5%$ 5%# Ø

The "-banded unstructured matrix is appropriate for independent observations which differ in

© 2003 by CRC Press LLC


454 Chapter 7  Linear Mixed Models

their variances, i.e., heteroscedastic data:


#
Ô 5" ! ! ! ×
Ö ! 5## ! ! Ù
R 3 a!b œ Ö
Ö !
Ùß ! œ 5 # ß âß 5 # ‘. [7.50]
! 5$# ! Ù " %

Õ ! ! ! #Ø
5%

These structures are fit in proc mixed with the following options of the repeated statement:
type=Toep(2) /* 2-banded Toeplitz */
type=Toep(3) /* 3-banded Toeplitz */
type=un(2) /* 2-banded unstructured */
type=un(3) /* 3-banded unstructured */
type=un(1) /* 1-banded unstructured */.

We now turn to models where the correlations decrease with temporal separation of the
measurements. One of the more popular models is borrowed from the analysis of time series
data. Assume that a present observation at time >, ] a>b, is related to the immediately
preceding observation at time >  " through the relationship
] a>b œ 3] a>  "b € /a>b. [7.51]

The /a>b's are uncorrelated, identically distributed random variables with mean ! and variance
5/# and 3 is the autoregressive parameter of the time-series model. In the vernacular of time
series analysis the /a>b are called the random innovations (random shocks) of the process.
This model is termed the first-order autoregressive (ARa"b) time series model since an out-
come ] a>b is regressed on the immediately preceding observation.

a) ρ = 0 b) ρ = 0.5
1 1

0 0

-1
-1

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

c) ρ = 1.0

-1

-2

0 10 20 30 40 50 60 70 80 90 100

Figure 7.10. Realizations of first-order autoregressive time series with 5/# œ !Þ#. a) white
noise process; b) 3 œ !Þ&; c) random walk.

© 2003 by CRC Press LLC


Correlations in Mixed Models 455

Figure 7.10 shows three realizations of first-order autoregressive models with 3 œ !, !Þ&,
"Þ!, and mean !. As 3 increases series of positive and negative deviations from the mean
become longer, a sign of positive serial autocorrelation. To study the correlation structure
implied by this and other models we examine the covariance and correlation function of the
process (see §2.5.1 and §9). For a stationary ARa"b process we have  "  3  " and the
function
G a5 b œ Covc] a>bß ] a>  5 bd

measures the covariance between two observations 5 time units apart. It is appropriately
called the covariance function. Note that G a!b œ Covc] a>bß ] a>bd œ Varc] a>bd is the var-
iance of an observation. Under stationarity, the covariances G a5 b depend on the temporal
separation only, not on the time origin. Time points spaced five units apart are correlated by
the same amount, whether the first time point was a Monday or a Wednesday. The
correlation function V a5 b is written simply as V a5 b œ G a5 bÎG a!b.
The covariance and correlation function of the AR(1) time series [7.51] are derived in
§A7.7.6. The elementary recursive relationship is G a5 b œ 3GÐ5  "Ñ œ 35 G a!b for 5 ž !.
Rearranging one obtains the correlation function as V a5 b œ G a5 bÎG a!b œ 35 . The auto-
regressive parameter thus measures the strength of the correlation of observations one time
unit apart; the lag-one correlation. For longer temporal separation the correlation is a power
of 3 where the exponent equals the number of temporal lags. Since the lags 5 are discrete, so
is the correlation function. For positive 3 the correlations step down every time 5 increases
(Figure 7.11), for negative 3 positive and negative correlations alternate, eventually con-
verging to zero. Negative correlations of adjoining observations are not the norm and are
usually indicative of an incorrect model for the mean function. Jones (1993, p. 54) cites as an
example a process with circadian rythm. If daily measurements are taken early in the morning
and late at night, it is likely that the daily rhythm affects the response and must be accounted
for in the mean function. Failure to do so may result in model errors with negative serial
correlation.

1.0

0.8
ρ = 0.7
Correlation C(k)/C(0)

0.6

0.4
ρ = 0.3

0.2

ρ = 0.1

0.0

0 1 2 3 4 5 6 7 8 9 10
Lag k

Figure 7.11. Correlation functions for first-order autoregressive processes with 3 œ !Þ"ß !Þ$ß
and !Þ(.

© 2003 by CRC Press LLC


456 Chapter 7  Linear Mixed Models

In terms of the four element cluster 3 and its variance-covariance matrix R3 a!b the ARa"b
process is depicted as

Ô" 3 3# 3$ ×
Ö3 " 3 3# Ù
R 3 a! b œ 5 # Ö # Ùß ! œ 5 # ß 3‘, [7.52]
3 3 " 3
Õ 3$ 3# 3 "Ø

where G a!b œ 5 # . The ARa"b model in longitudinal data analysis dates back to Potthoff and
Roy (1964) and is popular for several reasons. The model is parsimonious, the correlation
matrix is defined by a single parameter 3. The model is easy to fit to data. Numerical prob-
lems in iterative likelihood or restricted likelihood estimation of ! can often be reduced by
specifying an AR(1) correlation model rather than some of the more complicated models
below. Missing observations within a cluster are not a problem. For example, if it was
planned to take measurements at times "ß #ß $ß % but the third measurement was unavailable or
destroyed, the row and column associated with the third observation are simply deleted from
the correlation matrix:

Ô" 3 3$ ×
#
R3 a!b œ 5 3 " 3# Þ
Õ 3$ 3# "Ø

The AR(1) model is fit in proc mixed with the type=ar(1) option of the repeated statement.
The actual measurement times do not enter the correlation matrix in the AR(") process
with discrete lag, only information about whether a measurement occurred after or before
another measurement. It is sometimes labeled a discrete autoregressive process for this reason
and implicit is the assumption that the measurements are equally spaced. Sometimes there is
no basic interval at which observations are taken and observations are unequally spaced. In
this case the underlying metric of sampling within a cluster must be continuous. Note that
unequal spacing is not a sufficient condition to distinguish continuous from discrete proc-
esses. Even if the measurements are equally spaced, the underlying metric may still be
continuous. The leaf area of perennial flowers in a multiyear study collected at a few days
throughout the years at irregular intervals should not be viewed as discrete daily data with
most observations missing, but as unequally spaced observations collected in continuous time
with no observations missing. For unequally spaced time intervals the AR(1) model is not the
best choice. Assume measurements were gathered at days "ß #ß 'ß "". The AR(1) model
assumes that the correlation between the first and second measurement (spaced one day apart)
equals that between the second and third measurement (spaced four days apart). Although
there are two pairs of measurements with lag &, their correlations are not identical. The corre-
lation between the day " and day ' measurement is 3# , that between the day ' and the day ""
measurement is 3.
With unequally spaced data the actual measurement times should be taken into account in
the correlation model. Denote by >34 the 4th time at which cluster 3 was observed. We allow
these time points to vary from cluster to cluster. The continuous analog of the discrete AR(1)
process has correlation function

© 2003 by CRC Press LLC


Correlations in Mixed Models 457

l>34  >34w l
G a5 bÎG a!b œ expœ   [7.53]
9

and is called the continuous AR(1) or the exponential correlation model (Diggle 1988, 1990;
Jones and Boadi-Boateng 1991; Jones 1993; Gregoire et al. 1995). The lag between two
observations is measured as the absolute difference of the measurement times >34 and
>34w ß 5 œ >34  >34w . Denoting G a!b œ 5 # yields the covariance function
l>34  >34w l
G a5 b œ 5 # expœ  . [7.54]
9

For stationarity, it is required that 9 ž ! restricting correlations to be positive (Figure 7.12).


The parameter 9 determines the strength of the correlations, but should not be interpreted
as a correlation coefficient. Instead it is related to the practical range of the temporal proc-
ess, that is, the time separation at which the correlations have almost vanished. The practical
range is usually chosen to be the point at which G a5 bÎG a!b œ !Þ!&. In the continuous AR(1)
model [7.54] the practical range equals $9 (Figure 7.12). The range of a stationary stochastic
process is the lag at which the correlations are exactly zero. The exponential model achieves
this value only asymptotically for l>34  >34w l Ä _. The within-cluster variance-covariance
matrix for the continuous AR(1) model becomes

Ô " /l>3" >3# lÎ9 /l>3" >3$ lÎ9 /l>3" >3% lÎ9 ×
Ö / l>3# >3" lÎ9
" /l>3# >3$ lÎ9 /l>3# >3% lÎ9 Ù
R3 a!b œ 5 # Ö
Ö /l>3$ >3" lÎ9
Ù, ! œ 5 # ß 9‘. [7.55]
/l>3$ >3# lÎ9 " /l>3$ >3% lÎ9 Ù
Õ /l>3% >3" lÎ9 /l>3% >3# lÎ9 / l>3% >3$ lÎ9
" Ø

1.0

0.9

0.8

0.7
Correlation C(k)/C(0)

0.6

0.5

φ= 3
0.4

0.3 φ=2

φ = 1.5
0.2
φ=1
0.1

0.0

0 1 2 3 4 5 6
Lag = |tij - tij' |

Figure 7.12. Correlation functions of continuous ARa"b processes (exponential models).

The magnitude of 9 depends on the units in which time was measured. Since the tem-
poral lags and 9 appear in the exponent of the correlation matrix, numerical overflows or

© 2003 by CRC Press LLC


458 Chapter 7  Linear Mixed Models

underflows can occur depending on the temporal units used. For example, if time was
measured in seconds and the software package fails to report an estimate of 9, the iterative
estimation algorithms can be helped by rescaling the time variable into minutes or hours. In
proc mixed of The SAS® System this correlation model is fit with the type=sp(exp)(time)
option of the repeated statement. sp() denotes the family of spatial correlation models proc
mixed provides, sp(exp) is the exponential model. In the second set of parentheses are listed
the numeric variables in the data set that contain the coordinate information. In the spatial
context, one would list the longitude and latitude of the sample locations, e.g.,
type=sp(exp)(xcoord ycoord). In the temporal setting only one coordinate is needed, the
time of measurement.
A reparameterization of the exponential model is obtained by setting expe  "Î9f œ 3:
G a5 bÎG a!b œ 3l>34 >34w l . [7.56]

SAS® calls this the power model. This terminology is unfortunate because the power model
in spatial data analysis is known as a different covariance structure (see §9.2.2). The specifi-
cation as type=sp(pow)(time) in SAS® as a spatial covariance structure suggests that
sp(pow) refers to the spatial power model. Instead, it refers to [7.56]. It is our experience that
numerical difficulties encountered when fitting the exponential model in the form [7.54] can
often be overcome by changing to the parameterization [7.56]. From [7.56] the close resem-
blance of the continuous and discrete AR(1) models is readily established. If observations are
equally spaced, that is
l>34  >34w l œ -l4  4 w l,

for some constant - , the continuous and discrete autoregressive correlation models produce
the same correlation function at lag 5 .
A second model for continuous, equally or unequally spaced observations, is called the
gaussian model (Figure 7.13). It differs from the exponential model only in the square of the
exponent,
Ð>34  >34w Ñ#
G a5 bÎG a!b œ expœ  . [7.57]
9#

The name must not imply that the gaussian correlation model deserves similar veneration
as the Gaussian probability model. Stein (1999, p. 25) points out that “Nothing could be far-
ther from the truth.” The practical range for the gaussian correlation model is 9È$. For the
same practical range as in the exponential model, correlations are more persistent over short
ranges, they decrease less rapidly (compare the model with 9 œ 'ÎÈ$ in Figure 7.13 to that
with 9 œ # in Figure 7.12). Stochastic processes whose autocorrelation follows the gaussian
model are highly continuous and smooth (see §9.2.3). It is difficult to imagine physical proc-
esses of this kind. We use lowercase spelling when referring to the correlation model to avoid
confusion with the Gaussian distribution. From where does the model get its name? A sto-
chastic process with covariance function
G a5 b œ - exp˜  " 5 # ™,

which is a slight reparameterization of [7.57], (" œ "Î9# ), has spectral density

© 2003 by CRC Press LLC


Correlations in Mixed Models 459

" -
0 a=b œ exp˜  =# Î%" ™
# È1"

which resembles in functional form the Gaussian probability mass function. The gaussian
model is fit in proc mixed with the type=sp(gau)(time) option of the repeated statement.

1.0

0.9

0.8

0.7
Correlation C(k)/C(0)

0.6

0.5 φ = 6*3-1/2

0.4 φ=2

0.3
φ = 1.5
0.2 φ=1

0.1

0.0
0 1 2 3 4 5 6 7
Lag = |tij - tij' |

Figure 7.13. Correlation functions in gaussian correlation model. The model with 9 œ 'ÎÈ$
has the same practical range as the model with 9 œ # in Figure 7.12.

A final correlation model for data in continuous time with or without equal spacing is the
spherical model
Ú l>34 >34w l l>34 >34w l $
"  $# Š ‹ € "# Š ‹ l>34  >34w l Ÿ 9
G a5 bÎG a!b œ Û 9 9 [7.58]
Ü! l>34  >34w l ž 9.

In contrast to the exponential and gaussian model the spherical structure has a true range. At
lag 9 the correlation is exactly zero and remains zero thereafter (Figure 7.14).
The spherical model is less smooth than the gaussian correlation model but more so than
the exponential model. The spherical model is probably the most popular model for
autocorrelated data in geostatistical applications (see §9.2.2). To Stein (1999, p. 52), this
popularity is a mystery that he attributes to the simple functional form and the “mistaken
belief that there is some statistical advantage in having the autocorrelation function being
exactly ! beyond some finite distance.” This correlation model is fit in proc mixed with the
type=sp(sph)(time) option of the repeated statement.

The exponential, gaussian, and spherical models for processes in continuous time as well
as the discrete AR(1) model assume stationarity of the variance of the within-cluster errors.
Models that allowed for heterogeneous variances as the unstructured models had many
parameters. A class of flexible correlation models which allow for nonstationarity of the
within-cluster variances, and changes in the correlations without parameter proliferation was

© 2003 by CRC Press LLC


460 Chapter 7  Linear Mixed Models

first conceived by Gabriel (1962) and is known as the ante-dependence models. Both
continuous and discrete versions of ante-dependence models exist, each in different orders.

1.0

0.9

0.8

0.7
Correlation C(k)/C(0)

0.6

0.5
φ= 9

0.4 φ=6

0.3
φ = 4.5
0.2
φ=3

0.1

0.0

0 1 2 3 4 5 6 7
Lag = |tij - tij' |

Figure 7.14. Spherical correlation models with ranges equal to the practical ranges for the
exponential models in Figure 7.12.

Following Kenward (1987) and Machiavelli and Arnold (1994), the discrete version of a
first order ante-dependence model can be expressed as
#
Ô 5" 5" 5# 3" 5" 5$ 3" 3# 5" 5% 3" 3# 3$ ×
Ö 5# 5" 3" 5## 5# 5$ 3# 5# 5% 3# 3$ Ù
R 3 a! b œ Ö
Ö 5$ 5" 3# 3"
Ù. [7.59]
5$ 5# 3# 5$# 5$ 5% 3$ Ù
Õ 5% 5" 3$ 3# 3" 5% 5# 3$ 3# 5% 5$ 3$ 5%# Ø

Zimmerman and Núñez-Antón (1997) termed it AD(1). For cluster size 83 the discrete AD(1)
model contains #83  " parameters, but offers nearly the flexibility of a completely
unstructured model with 83 a83 € "bÎ# parameters. Zimmerman and Núñez-Antón (1997)
discuss extensions to accommodate continuous correlation processes in ante-dependence
models. Their continuous first-order model — termed structured ante-dependence model
(SAD(")) — is given by
ˆ ‰
Corrc/34 ß /34w d œ 30 a>34 ß-b0 >34w ß- ß 4w ž 4
[7.60]
Varc/34 d œ 5 # 1a>34 ß <bß 4 œ "ß ÞÞÞß 83 .

where 0 a>34 ß -b and 1a>34 ß <b are functions of the measurement times or locations which
depend on parameter vectors - and <. Zimmerman and Núñez-Antón advocate choosing 0 a•b
from the family of Box-Cox transformations (see §5.6.2)
ˆ>-34  "‰Î- -Á!
0 a>34 ß -b œ œ .
lne>34 f -œ!

If - œ " and 1a>34 ß <b œ "ß the power correlation model results. See Zimmerman and Núñez-

© 2003 by CRC Press LLC


Correlations in Mixed Models 461

Antón (1997) for higher order ante-dependence models and Machiavelli and Arnold (1994)
for variable-order models. The SADab models alleviate some shortcomings of the stationary
continuous models. In growth studies, variability often increases with time. Heteroscedas-
ticity of the within-cluster residuals is already incorporated in ante-dependence models. Also,
equidistant observations do not necessarily have the same correlation as is implied by the
stationary models discussed above. The discrete first-order ante-dependence model can be fit
with proc mixed as type=ante(1).

7.5.3 Split-Plots, Repeated Measures, and the Huynh-Feldt


Conditions
On the surface repeated measures experiments show similarities to split-plot type designs
which frequently prompts research workers to analyze repeated measures data as if it had
arisen in a split design. Figure 7.15 shows a single replicate with three treatments (A" , A# ,
A$ ) and a four-level split in a genuine split-plot design (Figure 7.15 a) and a repeated
measures design (Figure 7.15 b).

a)
b)
Replicate I Replicate I

A1 B1 B3 B4 B2 A1 T1 T2 T3 T4

A3 B4 B2 B1 B3 A3 T1 T2 T3 T4

A2 B4 B3 B2 B1 A2 T1 T2 T3 T4

Figure 7.15. Replicate in split-plot design with three levels of the whole-plot and four levels
of the sub-plot treatment factor (a) and single block in repeated measures design with three
treatments and four re-measurements (b).

Both replicates have the same number of observations. For each whole-plot there are four
sub-plot treatments in the split-plot design and four remeasurements in Figure 7.15b. The
split-plot analysis proceeds by calculating the analysis of variance of the design and then
formulating appropriate test statistics to test for factor E main effects, F main effects, and
E ‚ F interactions followed by tests of treatment contrasts or other post-ANOVA pro-
cedures. The analysis of variance table for the split-plot design with < replicates is based on
the linear mixed model
‡
]345 œ . € 34 € !3 € /34 € "5 € a!" b35 € /345 ,

where 34 a4 œ "ß âß <b are the whole-plot replication (block) effects, !3 a3 œ "ß âß +b are the
‡
whole-plot treatment effects, /34 is the whole-plot experimental error, "5 a5 œ "ß âß , b are the
sub-plot treatment effects, a!" b35 are the interactions, and /345 denotes the sub-plot experi-
mental errors. Letting 5‡# denote the whole-plot experimental error variance and 5 # the sub-
plot experimental error variance, Table 7.6 shows the analysis of variance and expected mean
squares.

© 2003 by CRC Press LLC


462 Chapter 7  Linear Mixed Models

From the expected mean squares it is seen that the appropriate test statistics for main
effects and interactions tests are

ó E main effect: J9,= œ Q WEÎQ WIE


ô F main effect: J9,= œ Q WFÎQ WIF
õ E ‚ F interaction: J9,= œ Q WEFÎQ WIF .

Would it be incorrect to perform a similar analysis in the repeated measures setting,


labeling factor F in Table 7.6 as X 37/ and testing
ó E main effect: J9,= œ Q WEÎQ WIE
ô X 37/ main effect: J9,= œ Q WX 37/ÎQ WIX 37/
õ E ‚ X 37/ interaction: J9,= œ Q WEX 37/ÎQ WIX 37/ ?

Table 7.6. Analysis of variance table of a standard split-plot design (whole-plot


factor arranged in a randomized complete block design with < replicates)
Source df Mean Squares Expected Mean Squares
Replicates <"
<+ ! #
E +" Q WE 5 # € ,5‡# € ," !3
3
ErroraEb a<  "ba+  "b Q WIE 5 # € , 5‡#
<+ ! #
F ," Q WF 5# € ," "5
5
E‚F a+  "ba,  "b Q WEF 5# € < !a!" b#35
a+"ba,"b
3ß5
ErroraF b +a<  "ba,  "b Q WIF 5#
Total +<,  "

First we observe that the sub-plot treatments are randomized to the whole-plots, but the
repeated measurements cannot be randomized. Time point X" occurs before X# which occurs
before X$ and so forth. Is this difference substantial enough to throw off the analysis, though?
To approach an answer it is worthwhile to study the correlation pattern that the split-plot de-
sign implies. Whole-plot errors are independent due to randomization of the whole-plot treat-
ments and so are sub-plot errors. Independent randomizations to whole- and sub-plots also
‡
establish that Cov[/34 ß /345 Ó œ !. Since sub-plot errors are nested within whole-plots this is the
same setting as in the random intercept model of §7.5.1 and one arrives at a compound-
symmetric structure for the observations from the same whole-plot:

Covc]345 ß ]345w d œ Cov/34


‡ ‡
€ /345 ß /34 € /345w ‘ œ 5 #
Covc]345 ß ]345w d 5#
Corrc]345 ß ]345w d œ œ # ‡ #;
ÈVarc]345 dVarc]345w d 5‡ € 5
VarcY34 d œ 5‡# J, € 5 # I, .

Compound symmetry implies exchangeabiliy of the observations within a whole-plot


Exchangeability is appealing since the sub-plot treatments were randomized. No particular

© 2003 by CRC Press LLC


Correlations in Mixed Models 463

ordering or arrangement of treatments within a whole-plot is given preference. This result


sheds light on the appropriateness of a split-plot type analysis for repeated measures data.
Only if the correlations between any two time points X5 and X5w are identical, will the
correlation structure be exchangeable. In that case the ordering of time points is not material
and the nonrandomized ordering X" ß X# ß X$ ß â will lead to a correct analysis under the split-
plot model. There are other correlation structures besides compound symmetry for which
repeated measures analysis with a split-plot model is appropriate. These structures are defined
through conditions on the variance-covariance matrices known as the Huynh-Feldt conditions
(Huynh and Feldt 1970). That compound symmetry satisfies the Huynh-Feldt conditions was
established by Geisser and Greenhouse (1958). The Huynh-Feldt conditions are more general
than compound symmetry, however. Assume the repeated measures analysis is based on the
model
‡
]345 œ . € 34 € !3 € /34 € 75 € a!7 b35 € /345 ß 5 œ "ß âß >,

where 75 represents the time effects and a!7 b35 the treatment ‚ time interactions. A test of
L! : 7" œ ÞÞÞ œ 7> via a regular analysis of variance J test of form
Q WX 37/
J9,= œ
Q WIX 37/
is valid if the variance-covariance matrix of the "sub-plot" errors e34 œ c/34" ß ÞÞÞß /34> dw can be
expressed in the form
Varce34 d œ -I> € # 1w> € 1> # w ,

where # is a vector of parameters and - is a constantÞ Similarly, the J test for the whole-plot
‡ w
factor L! :!" œ ÞÞÞ œ !+ is valid if the whole-plot errors e‡3 œ c/3"
‡
ß ÞÞÞß /3+ d have a variance-
covariance matrix which can be expressed as
Varce‡3 d œ -‡ I+ € # ‡ 1w+ € 1+ # ‡ w .

The variance-covariance matrix is then said to meet the Huynh-Feldt conditions. We note in
passing that a correct analysis via split-plot ANOVA requires that the condition is met for
every random term in the model (Milliken and Johnson 1992 , p. 325). Two special cases of
variance-covariance matrices that meet the Huynh-Feldt conditions are independence and
compound symmetry. The combination of - œ 5 # and # œ ! yields the independence struc-
w
ture 5 # I; the combination of - œ 5 # and # œ "# c5># ß ÞÞÞß 5># d yields a compound symmetric
structure 5 # I € 5># J.
Analyzing repeated measures data with split-plot models implicitly assumes a compound
symmetric or Huynh-Feldt structure which may not be appropriate. If correlations decay over
time, for example, the compound symmetric model is not a reasonable correlation model.
Two different courses of action can then be taken. One relies on making adjustments to the
degrees of freedom for test statistics in the split-plot analysis, the other focuses on modeling
the variance-covariance structure of the within-cluster errors. We comment on the adjustment
method first.
Box (1954b) developed the measure % for the deviation from the Huynh-Feldt conditions.
% is bounded between a>  "b" and ", where > is the number of repeated measurements. The
degrees of freedom for tests of Time main effects and Treatment ‚ Time interactions are

© 2003 by CRC Press LLC


464 Chapter 7  Linear Mixed Models

adjusted by the multiplicative factor %. The standard J statistic


Q WX 37/ÎQ WIX 37/

for the Time main effect is not compared against the critical value J!ß>"ß+a<"ba>"b but
against J!ß%a>"bß%+a<"ba>"b . Unfortunately, % is unknown. A conservative approach is to set %
equal to its lower bound. The critical value then would be J!ß"ß+a<"b for the Time main effect
test and J!ß+"ß+a<"b for the test of Treatment ‚ Time interactions. The reduction in test
power that results from the conservative adjustment is disconcerting. Less power is sacrificed
by estimating % from the data. The estimator is discussed in Milliken and Johnson (1992, p.
355). Briefly, let
"
1
s55w œ "ˆC345  C 34Þ  C 3Þ5 € C 3ÞÞ ‰ˆC345w  C 34Þ  C 3Þ5w € C 3ÞÞ ‰.
+a<  "b 3ß4

Then,
> #
"
s% œ a>  "b Ž"1 s#55w .
s55  ‚"1 [7.61]
5œ" 5ß5 w

This adjustment factor for the degrees of freedom is less conservative than using the lower
bound of %, but more conservative than an adjustment proposed by Huynh and Feldt (1976).
The estimated Box epsilon [7.61] and the Huynh-Feldt adjustment are calculated by proc glm
of The SAS® System if the repeated statement of that procedure is used. [7.61] is labeled
Greenhouse-Geisser Epsilon on the proc glm output (Greenhouse and Geisser 1959). To
decide whether any adjustment is necessary, i.e., whether the dispersion of the data deviates
significantly from the Huynh-Feldt conditions, a test of sphericity can be invoked. This test is
available through the printE option of the repeated statement in proc glm. If the sphericity
test is rejected, a degree of freedom adjustment is deemed necessary.
This type of repeated measures analysis attempts to coerce the analysis into a split-plot
model framework and if that framework does not apply uses fudge factors to adjust the end
result (critical values or :-values). If the sphericity assumption (i.e., the Huynh-Feldt
conditions) are violated, the basic problem from our standpoint is that the statistical model
undergirding the analysis is not correct. We prefer a more direct, and hopefully more
intuitive, approach to modeling repeated measures data. Compound symmetry is a disper-
sion/correlation structure that comes about in the split-plot model through nested, indepen-
dent random components. In a repeated measures setting there is often a priori knowledge or
theory about the correlation pattern over time. In a two-year study of an annual crop it is
reasonable to assume that within a growing season measurements are serially correlated and
that correlations wear off with temporal separation. Secondly, it may be reasonable to assume
that the measurements at the beginning of the second growing season are independent of the
responses at the end of the previous season. Rather than relying on a compound symmetric or
Huynh-Feldt correlation structure, we can employ a correlation structure whose behavior is
consistent with these assumptions. Of the large number of such correlation models some were
discussed in §7.5. Equipped with a statistical package capable of fitting data with the chosen
correlation model and a method to distinguish the goodness-of-fit of competing correlation
models, the research worker can develop a statistical model that describes more closely the
structure of the data without relying on fudge factors.

© 2003 by CRC Press LLC


Applications 465

7.6 Applications
The applications of linear mixed models for clustered data we entertain in this section are as
varied as the Laird-Ware model. We consider traditional growth studies with linear mixed
regression models as well as designed experiments with multiple random effects. §7.6.1 is a
study of apple growth over time where we are interested in predicting population-averaged
and apple-specific growth trends. The empirical BLUPs will play a key role in estimating the
cluster-specfic trends. Because measurements on individual apples were collected repeatedly
in time we also pay attention to the possibility of serial correlation in the model residuals.
This application is intended to underscore the two-stage concept in mixed modeling. §7.6.2 to
§7.6.5 are experimental situations where the statistical model contains multiple random terms
for different reasons. In §7.6.2 we analyze data from an on-farm trial where identical experi-
mental designs are laid out on randomly selected farms. The random selection of farms results
in a mixed model containing random experimental errors, random farm effects, and random
interactions. The estimated BLUPs are again key quantities on which inferences involving
particular farms are based.
Subsampling of experimental units also gives rise to clustered data structures and a
nesting of experimental and observational error sources. The liabilities of not recognizing the
subsampling structure are discussed in §7.6.3 along with the correct analysis based on a linear
mixed model. A very special case of an experimental design with a linear mixed model arises
when block effects are random effects, for example, when locations are chosen at random.
The on-farm experiment in §7.6.2 can be viewed as a design of that nature. If blocks are in-
complete in the sense that the size of the block cannot accommodate all treatments, mixed
model analyses are more powerful than fixed model analyses because of their ability to
recover treatment contrasts from comparisons across blocks. This recovery of interblock
information is straightforward with the mixed procedure in SAS® and we apply the
techniques to a balanced incomplete block design (BIB) in §7.6.4. Finally, a common method
for clustering data in experimental designs is the random assignment of treatments to experi-
mental units of different size. The resulting split-type designs that result have a linear mixed
model representation. Experimental units can be of different sizes if one group of units is ar-
ranged within a larger unit (splitting) or if units are arranged perpendicular to each other
(stripping). A combination of both techniques that is quite common in agricultural field ex-
periments is the split-strip-plot design that we analyze in §7.6.5.
Modeling the correlations among repeated observations directly is our preferred method
over inducing correlations through random effects or random coefficients. The selection of a
suitable correlation structure based on AIC and other fit criteria is the objective of analyzing a
factorial treatment structure with repeated measurements in §7.6.6. The model we arrive at is
a mixed model because it contains random effects for the experimental errors corresponding
to experimental units and serially correlated observational disturbances observed over time
within each unit.
Many growth studies involve nonlinear models (§5, §8) or polynomials. How to deter-
mine which terms in a growth model are to be made random and which are to be kept fixed is
the topic of §7.6.7, concerned with the water usage of horticultural trees. The comparison of
treatments in a growth study with complex subsampling design is examined in §7.6.8.

© 2003 by CRC Press LLC


466 Chapter 7  Linear Mixed Models

7.6.1 Two-Stage Modeling of Apple Growth over Time


At the Winchester Agricultural Experiment Station of Virginia Polytechnic Institute and State
University ten apple trees were randomly selected and twenty-five apples were randomly
chosen on each tree. We concentrate the analysis on the apples in the largest size class, those
whose initial diameter exceeded #Þ(& inches. In total there were eighty apples in that size
class. Diameters of the apples were recorded in two-week intervals over a twelve-week
period. The observed apple diameters are shown for sixteen of the eighty apples in Figure
7.16. The profiles for the remaining apples are very similar to the ones shown. Only those
apples that remained on the tree for the entire three-month period have complete data. Apple
"% on tree ", for example, was measured only on the first occasion, and only three measure-
ments were available for apple "& on tree # .
Of interest is modeling the apple-specific and the overall growth trends of apples in this
size class. One can treat these data as clustered on two levels; trees are clusters of the first
size consisting of apples within trees. Apples are clusters of the second size containing the re-
peated observations over time. It turns out during analysis that tree-to-tree variability is very
small, most of the variation in the data is due to apple-specific effects. For this and expository
reasons, we focus now on apples as the clusters of interest. At the end of this section we
discuss how to modify the analysis to account for tree and apple cluster effects.
2 4 6 8 10 12 2 4 6 8 10 12

Tree 1 , Apple 1 Tree 1 , Apple 4 Tree 1 , Apple 5 Tree 1 , Apple 10


3.5

3.0

Tree 1 , Apple 11 Tree 1 , Apple 13 Tree 1 , Apple 14 Tree 1 , Apple 15 2.5


Apple Diameter (inches)

3.5

3.0

2.5 Tree 1 , Apple 17 Tree 1 , Apple 18 Tree 1 , Apple 19 Tree 1 , Apple 25


3.5

3.0

Tree 2 , Apple 7 Tree 2 , Apple 9 Tree 2 , Apple 11 Tree 2 , Apple 15 2.5


3.5

3.0

2.5
2 4 6 8 10 12 2 4 6 8 10 12
Week

Figure 7.16. Observed diameters over a "#-week period for "' of the )! apples. Data kindly
provided by Dr. Ross E. Byers, Alson H. Smith, Jr. AREC, Virginia Polytechnic Institute and
State University, Winchester, Virginia. Used with permission.

Figure 7.16 suggests that the trends are linear for each apple. A naïve approach to
estimating the population-averaged and cluster-specific trends is as follows. Let
]34 œ "!3 € ""3 >34 € /34

© 2003 by CRC Press LLC


Applications 467

denote a simple linear regression for the data from apple 3 œ "ß âß )!. We fit this linear
regression separately for each apple and obtain the population-averaged estimates of the
overall slope "! and intercept "" by averaging the apple-specific estimates. In this averaging
one can calculate equally weighted averages or take the precision of the apple-specific esti-
mates into account. The equally weighted average approach is implemented in SAS® as
follows (Output 7.4).
/* variable time is coded as 1,2,...,6 corresponding to weeks 2,4,...,12 */
proc reg data=apples outest=est noprint;
model diam = time;
by tree apple;
run;
proc means data=est noprint;
var intercept time;
output out=PAestimates mean=beta0 beta1 std=sebeta0 sebeta1;
run;
title 'Naive Apple-Specific Estimates';
proc print data=est(obs=20) label;
var tree apple _rmse_ intercept time;
run;
title 'Naive PA Estimates';
proc print data=PAEstimates; run;

Output 7.4.
Naive Apple-Specific Estimates

Root mean
squared
Obs tree apple error Intercept time

1 1 1 0.009129 2.88333 0.010000


2 1 4 0.005774 2.83667 0.030000
3 1 5 0.008810 2.74267 0.016857
4 1 10 0.012383 2.78867 0.028000
5 1 11 0.016139 2.72133 0.026286
6 1 13 0.014606 2.90267 0.024000
7 1 14 . 3.08000 0.000000
8 1 15 0.010511 3.01867 0.032286
9 1 17 0.008944 2.76600 0.022000
10 1 18 0.011485 2.74133 0.023429
11 1 19 0.014508 2.77133 0.036286
12 1 25 0.013327 2.74067 0.028857
13 2 7 0.013663 2.82800 0.026000
14 2 9 0.014557 2.74667 0.021429
15 2 11 0.006473 2.76267 0.022571
16 2 15 0.008165 2.83333 0.010000
17 2 17 0.015228 2.82867 0.019429
18 2 23 0.016446 2.79267 0.028286
19 2 24 0.010351 2.84000 0.025714
20 2 25 0.013801 2.74333 0.024286

Naive PA Estimates

_freq_ beta0 beta1 sebeta0 sebeta1


80 2.83467 0.028195 0.092139 .009202401

Notice that for Apple "% on tree ", which is the 3 œ (th apple in the data set, the
s "ß( is ! because only a single observation had been collected on that
coefficient estimate "
apple. The apple-specific predictions can be calculated from the coefficients in Output 7.4 as
can the population-averaged growth trend. We notice, however, that the ability to do this re-

© 2003 by CRC Press LLC


468 Chapter 7  Linear Mixed Models

quired estimation of "'! mean parameters, one slope and one intercept for each of eighty
apples. Counting the estimation of the residual variance in each model we have a total of #%!
estimated parameters. We can contrast the population-average and the apple-specific predic-
tions easily from Output 7.4. The growth of the average apple is predicted as
µ µ µ
C œ "! € "" > œ #Þ)$%'( € !Þ!#)"*& ‚ >
and that for apple " on tree ", for example, as
Cµ" œ #Þ))$$$ € !Þ!" ‚ >
µ µ
œ Š "! € !Þ!%)''‹ € Š ""  !Þ!")*&‹ ‚ >.

The quantities !Þ!%)'' and –!Þ!")*& are the adjustments made to the population-averaged
estimates of intercept and slope to obtain the predictions for the first apple. We use
the µ notation here, because the averages of the apple-specific fixed effects are not the
generalized least squares estimates.
Fitting the apple-specific trends by fitting a model to the data from each apple and then
averaging the estimates is a two-stage approach, literally. The cluster-specific estimates are
obtained in the first stage and the population-average is determined in the second stage. The
two-stage concept in the Laird-Ware model framework leads to the same end result, estimates
of the population-average trend and estimates of the cluster-specific trend. It does, however,
require fewer parameters. Consider the first-stage model
]34 œ a"! € ,!3 b € a"" € ,"3 b>34 € /34 , [7.62]

where ]34 is the measurement taken at time >34 for apple number 3. We assume for now that
the /34 are uncorrelated Gaussian random errors with zero mean and constant variance 5 # , al-
though the fact that ]3" ß ]3# ß â are repeated measurements on the same apple should give us
pause. We will return to this issue later. To formulate the second stage of the Laird-Ware
model we postulate that ,!3 and ,"3 are Gaussian random variables with mean ! and variances
5!# and 5"# , respectively. They are assumed not to be correlated with each other and are also
not correlated with the error terms /34 . This is obviously a mixed model of Laird-Ware form
with & parameters Ð"! , "" , 5!# , 5"# , 5 # Ñ. With the mixed procedure of The SAS® System, this is
accomplished through the statements
proc mixed data=apples;
class tree apple;
model diam = time / s;
random intercept time / subject=tree*apple s;
run;

By default the variance components 5!# , 5"# , and 5 # will be estimated by restricted maximum
likelihood. The subject=tree*apple statement identifies the unique combinations of the data
set variables apple and tree as clusters. Both variables are needed here since the apple
variable is numbered consecutively starting at 1 for each tree. Technically more correct would
be to write subject=apple(tree), since the apple identifiers are nested within trees, but the
analysis will be identical to the one above. In writing subject= options in proc mixed, the
user must only provide a variable combination that uniquely identifies the clusters. The /s
option on the model statement requests a printout of the fixed effects estimates " s ! and "
s " , the
same option on the random statement requests a printout of the solutions for the random

© 2003 by CRC Press LLC


Applications 469

effects (the estimated BLUPs s, !3 ß s, "3 ß â,


s !ß)! ß s, "ß)! ). In Output 7.5 we show only a partial
printout of the EBLUPs corresponding to the first ten apples.

Output 7.5.
The Mixed Procedure

Model Information
Data Set WORK.APPLES
Dependent Variable diam
Covariance Structure Variance Components
Subject Effect tree*apple
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Containment

Class Level Information


Class Levels Values
tree 10 1 2 3 4 5 6 7 8 9 10
apple 24 1 2 3 4 5 7 8 9 10 11 12 13 14
15 16 17 18 19 20 21 22 23 24 25

Dimensions
Covariance Parameters 3
Columns in X 2
Columns in Z Per Subject 2
Subjects 80
Max Obs Per Subject 6
Observations Used 451
Observations Not Used 29
Total Observations 480

Covariance Parameter Estimates


Cov Parm Subject Estimate
Intercept tree*apple 0.008547
time tree*apple 0.000056
Residual 0.000257

Fit Statistics
-2 Res Log Likelihood -1897.7
AIC (smaller is better) -1891.7
AICC (smaller is better) -1891.6
BIC (smaller is better) -1884.5

Solution for Fixed Effects


Standard
Effect Estimate Error DF t Value Pr > |t|
Intercept 2.8345 0.01048 79 270.44 <.0001
time 0.02849 0.000973 78 29.29 <.0001

Solution for Random Effects


Std Err
Effect tree apple Estimate Pred DF t Value Pr > |t|
Intercept 1 1 0.03475 0.01692 292 2.05 0.0409
time 1 1 -0.01451 0.003472 292 -4.18 <.0001
Intercept 1 4 0.003218 0.01692 292 0.19 0.8493
time 1 4 0.001211 0.003472 292 0.35 0.7274
Intercept 1 5 -0.09808 0.01692 292 -5.80 <.0001
time 1 5 -0.00970 0.003472 292 -2.79 0.0055
Intercept 1 10 -0.04518 0.01692 292 -2.67 0.0080
time 1 10 -0.00061 0.003472 292 -0.17 0.8614
Intercept 1 11 -0.1123 0.01692 292 -6.64 <.0001

© 2003 by CRC Press LLC


470 Chapter 7  Linear Mixed Models

Output 7.5 (continued).


time 1 11 -0.00229 0.003472 292 -0.66 0.5106
Intercept 1 13 0.06357 0.01692 292 3.76 0.0002
time 1 13 -0.00326 0.003472 292 -0.94 0.3482
Intercept 1 14 0.2094 0.02010 292 10.42 <.0001
time 1 14 0.001382 0.007486 292 0.18 0.8537
Intercept 1 15 0.1830 0.01692 292 10.82 <.0001
time 1 15 0.003882 0.003472 292 1.12 0.2644
Intercept 1 17 -0.07277 0.01759 292 -4.14 <.0001
time 1 17 -0.00491 0.004216 292 -1.16 0.2450
Intercept 1 18 -0.09474 0.01692 292 -5.60 <.0001
time 1 18 -0.00447 0.003472 292 -1.29 0.1989

The estimates of the variance components are fairly small,


s #! œ !Þ!!)&, 5
5 s #" œ !Þ!!!!&', 5
s # œ !Þ!!!#&(,
but one should keep in mind that the estimates are scale-dependent. If we model Y34 œ "!]34
instead of ]34 , these estimates will be "!# times larger without altering the fit of the model.
The fixed effects estimates are " s ! œ #Þ)$%& and "
s " œ !Þ!#)%*, fairly close to the
population-averaged estimates obtained by averaging the apple-specific fixed effects
coefficients earlier (Output 7.4). The Solution for Random Effects table lists the estimated
BLUPs. A population averaged prediction is obtained as
sC œ #Þ)$%& € !Þ!#)%* ‚ >
and the apple-specific predictions for the first apple, for example, as
s ! € !Þ!$%(&‹ € Š"
sC " œ Š" s "  !Þ!"%&"‹ ‚ >
œ #Þ)'*#& € !Þ!"$*) ‚ >.

In fitting this model it was assumed that the /34 are uncorrelated. This may not be tenable
since the measurements from the same apple are taken sequentially in time. To investigate
whether there is a significant serial correlation we perform a likelihood ratio test. We fit
model [7.62] but assume that the /34 follow a first-order autoregressive model. Since the
measurement occasions are equally spaced, this is a reasonable approach. Recall from Output
7.5 that minus twice the restricted (residual) log likelihood of the model with uncorrelated
errors is –")*(Þ(. We accomplish fitting a model with AR(1) errors by adding the repeated
statement as follows:
proc mixed data=apples noitprint;
class tree apple;
model diam = time / s;
random intercept time / subject=tree*apple;
repeated / subject=tree*apple type=ar(1);
run;

The estimate of the autocorrelation coefficient, the correlation between diameter meas-
urements on the same apple two weeks apart, is s 3 œ !Þ$)#& (Output 7.6). It appears fairly
substantial, but is adding the autocorrelation to the model a significant improvement? The
negative of twice the residual log likelihood in this model is –"*"!Þ& and the likelihood ratio
test statistic comparing the models with and without AR(1) correlation is "*"!Þ& 
")*(Þ( œ "#Þ). The :-value for the hypothesis that the autoregressive parameter is zero is

© 2003 by CRC Press LLC


Applications 471

thus PrÐ;#" ž "#Þ)Ñ œ !Þ!!!$&. Adding the AR(1) serial correlation does significantly im-
prove the model. The impact of adding the AR(1) correlations is primarily on the standard
errors of all estimated quantities. The population-averaged estimates as well as the BLUPs for
the random effects change very little, hence the impact on the predicted values is minor (not
necessarily so the impact on the precision of the predicted values).

Output 7.6. The Mixed Procedure

Model Information

Data Set WORK.APPLES


Dependent Variable diam
Covariance Structures Variance Components,
Autoregressive
Subject Effects tree*apple, tree*apple
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Containment

Class Level Information

Class Levels Values


tree 10 1 2 3 4 5 6 7 8 9 10
apple 24 1 2 3 4 5 7 8 9 10 11 12 13 14
15 16 17 18 19 20 21 22 23 24 25

Dimensions

Covariance Parameters 4
Columns in X 2
Columns in Z Per Subject 2
Subjects 80
Max Obs Per Subject 6
Observations Used 451
Observations Not Used 29
Total Observations 480

Covariance Parameter Estimates

Cov Parm Subject Estimate


Intercept tree*apple 0.008653
time tree*apple 0.000050
AR(1) tree*apple 0.3825
Residual 0.000365

Fit Statistics

-2 Res Log Likelihood -1910.5


AIC (smaller is better) -1902.5
AICC (smaller is better) -1902.4
BIC (smaller is better) -1893.0

Solution for Fixed Effects

Standard
Effect Estimate Error DF t Value Pr > |t|
Intercept 2.8321 0.01068 79 265.30 <.0001
time 0.02875 0.001017 78 28.28 <.0001

© 2003 by CRC Press LLC


472 Chapter 7  Linear Mixed Models

Output 7.6 (continued).


Solution for Random Effects

Std Err
Effect tree apple Estimate Pred DF t Value Pr > |t|

Intercept 1 1 0.02936 0.02049 292 1.43 0.1529


time 1 1 -0.01252 0.004189 292 -2.99 0.0030
Intercept 1 4 0.004805 0.02049 292 0.23 0.8147
time 1 4 0.000845 0.004189 292 0.20 0.8403
Intercept 1 5 -0.1024 0.02049 292 -5.00 <.0001
time 1 5 -0.00814 0.004189 292 -1.94 0.0529
Intercept 1 10 -0.04381 0.02049 292 -2.14 0.0333
time 1 10 -0.00080 0.004189 292 -0.19 0.8489

2 4 6 8 10 12 2 4 6 8 10 12

Tree 1, Apple 1 Tree 1, Apple 4 Tree 1, Apple 5 Tree 1, Apple 10


3.5

3.0

Tree 1, Apple 11 Tree 1, Apple 13 Tree 1, Apple 14 Tree 1, Apple 15 2.5


Apple Diameter (inches)

3.5

3.0

2.5 Tree 1, Apple 17 Tree 1, Apple 18 Tree 1, Apple 19 Tree 1, Apple 25


3.5

3.0

Tree 2, Apple 7 Tree 2, Apple 9 Tree 2, Apple 11 Tree 2, Apple 15 2.5


3.5

3.0

2.5
2 4 6 8 10 12 2 4 6 8 10 12
Week

Figure 7.17. Apple-specific predictions (solid lines) from mixed model [7.62] with AR(1)
correlated error terms for the same apples shown in Figure 7.16. Dashed lines show popula-
tion-averaged prediction, circles are raw data.

Once we have settled on the correlation model for these data we should go back and re-
evaluate whether the two random effects are in fact needed. Deleting them in turn one obtains
a residual log likelihood of *%(Þ(& for the model without random slope, *%'Þ$ for the model
without random intercept, and *%'Þ!& for the model without any random effects. Any one
likelihood ratio test against the full model in Output 7.6 is significant. Both random effects
remain in the model along with the AR(1) serial correlation.
The predictions from this model trace the observed growth profiles very closely (Figure
7.17) and the deviation of the solid from the dashed line in Figure 7.17 is an indication of
how strongly a particular apple differs from the average apple. It is clear that most of the
apple-to-apple variation is in the actual size of the apple (heterogeneity in intercepts), not the

© 2003 by CRC Press LLC


Applications 473

growth rate (5s #! was larger than 5


s #" ). Notice that population-average and subject-specific
predictions have been obtained for apple "% on tree #", although only a single diameter had
been measured for this apple.
What about the trees in this study? There were after all two levels of random selections.
Trees were randomly selected in the orchard and apples were randomly selected from the
trees. We can model the data starting with a population average model, adding tree-to-tree
heterogeneity, and then apple-to-apple heterogeneity within trees. The code
proc mixed data=apples update noitprint;
class tree apple;
model diam = time / s;
random intercept time / subject=tree;
random intercept time / subject=apple(tree);
repeated / subject=apple(tree) type=ar(1);
run;

uses the update option to write to the log window what proc mixed is currently doing. Fitting
models with many random effects can be time consuming and it is then helpful to find out
whether the procedure is still processing.
Notice that there are now two more covariance parameters corresponding to a random
intercept and a random slope for the trees (Output 7.7). A likelihood ratio test whether the
addition of the two random effects improved the model has test statistic "*"(Þ"  "*"!Þ& œ
'Þ' on two degrees of freedom with :-value PrÐ;## ž 'Þ'Ñ œ !Þ!$(. Notice, however, that the
variance component for the tree-specific random intercept is practically zero. AIC (smaller
is better) is calculated as minus twice the residual log likelihood plus twice the number of
covariance parameters. The AIC adjustment was made for only five, not six covariance
parameters. The estimate for the variance of tree-specific random intercepts was set to zero.
The data do not support that many random effects. Also, the variance of the random slopes on
the tree and the apple level add up to !Þ!!!!&$, which should be compared to 5 s #" œ !Þ!!!!&
in Output 7.6.

Output 7.7.
The Mixed Procedure

Model Information

Data Set WORK.APPLES


Dependent Variable diam
Covariance Structures Variance Components,
Autoregressive
Subject Effects tree, apple(tree),
apple(tree)
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Containment

Class Level Information

Class Levels Values


tree 10 1 2 3 4 5 6 7 8 9 10
apple 24 1 2 3 4 5 7 8 9 10 11 12 13 14
15 16 17 18 19 20 21 22 23 24 25

© 2003 by CRC Press LLC


474 Chapter 7  Linear Mixed Models

Output 7.7 (continued).


Dimensions

Covariance Parameters 6
Columns in X 2
Columns in Z Per Subject 26
Subjects 10
Max Obs Per Subject 72
Observations Used 451
Observations Not Used 29
Total Observations 480

Covariance Parameter Estimates

Cov Parm Subject Estimate

Intercept tree 3.74E-20


time tree 0.000017
Intercept apple(tree) 0.008653
time apple(tree) 0.000036
AR(1) apple(tree) 0.3649
Residual 0.000354

Fit Statistics

-2 Res Log Likelihood -1917.1


AIC (smaller is better) -1907.1
AICC (smaller is better) -1907.0
BIC (smaller is better) -1905.6

Solution for Fixed Effects

Standard
Effect Estimate Error DF t Value Pr > |t|
Intercept 2.8322 0.01066 9 265.58 <.0001
time 0.02870 0.001623 9 17.68 <.0001

7.6.2 On-Farm Experimentation with Randomly Selected


Farms
On-farm experimentation is part of the technology transfer from the agricultural experiment
station to the farm operation. On-farm trials are different from station experiments in several
ways. Often only two treatments are being compared, one representing a current standard, the
other a treatment in technology transfer from experimental research. Supplementary treat-
ments are sometimes added-on plots smaller than the experimental units to which the main
treatments of interest are applied (see, e.g., Petersen 1994). This introduces variance
heterogeneity into the data due to differing sizes of experimental units. Expressing outcomes
on a per-unit area basis may not remove these effects entirely due to the difference in border
effects between large and small units. Replication within a farm is not necessary for a valid
experimental design provided that the treatments of interest are applied on more than one
farm so that the farms can serve as block effects. Data from such experiments will exhibit
large experimental error variability and low power unless the farms are very similar. The
average treatment difference is the same on all farms, and a sufficient number of farms are in-
volved in the study. Due to differences in, for example, cropping history, soil types, and

© 2003 by CRC Press LLC


Applications 475

lacking control over experimental units and conditions one should anticipate such farm ‚
treatment interaction and allow for replication of the treatments within a farm. In contrast to
research station experimentation where locations at which to apply the treatments are often
chosen deliberately to reflect certain conditions of interest or because of availability, the
farms for on-farm research are often chosen at random to represent the population of farms
(conditions) in the region where technology transfer is to take place. If we think of farms as
stochastic locational or enviromental effects, these will have to enter any statistical model as
random effects (§7.2.3). Treatments, chosen deliberately to reflect current practice and tech-
nology to be transferred, are fixed effects on the contrary. As a consequence, statistical
models for the analysis of data from on-farm experimentation are typically mixed models.
Consider the (hypothetical) data in Table 7.7 representing wheat yields from eight on-
farm block designs, each with three blocks and two treatments. The farms were selected at
random from a list of all farms in the area where the new treatment (B) is to be tested.
On each farm we have a randomized block design
]34 œ . € 73 € 34 € /34 ,

where 73 a3 œ "ß #b is the effect of the 3th treatment and 34 a4 œ "ß #ß $b denotes the block
effects. One could analyze eight separate RCBDs to determine the effectiveness of the treat-
ments by farm. These farm-specific analyses would be powerless since each RCBD has only
two degrees of freedom for the experimental error. Also, nothing would be learned about the
treatment ‚ farm interaction. A more suitable analysis will combine all the data into a single
analysis.

Table 7.7. Data from on-farm trials conducted as randomized block designs
with three blocks on each of eight farms
Block 1 Block 2 Block 3
Farm A B A B A B
" $!Þ)' $$Þ$" $!Þ$# $!Þ*% $#Þ$" $&Þ#%
# $"Þ$* #(Þ)( $!Þ'# #&Þ#& #*Þ*$ #"Þ(*
$ $*Þ## %"Þ*& $)Þ*' %$Þ$) $&Þ$* %"Þ!*
% $(Þ"* $!Þ*( $'Þ"! $#Þ&& $&Þ)& $$Þ!%
& #%Þ*) #$Þ$* ##Þ!% #%Þ&! ##Þ*$ #$Þ#%
' #)Þ!' #)Þ'* #(Þ*) #&Þ') #&Þ"$ #&Þ))
( #(Þ)# $(Þ#$ #&Þ$# $%Þ%& #'Þ&# $#Þ%*
) #*Þ%" $!Þ*) #'Þ'$ $!Þ(" #*Þ'! $!Þ'$

The model for this analysis is that of a replicated RCBD with farm effects, treatment
effects, block effects nested within farms (since block 1 on farm 1 is a different physical
entity than block 1 on farm 2), and treatment ‚ farm interactions. Since the eight farms are a
random sample of farms their effects enter the model as random (95 ). The treatment effects
(73 ) are fixed and the interaction between farms and treatments (a97 b35 ) is random since it in-
volves the random farm effects. The complete model for analysis is

© 2003 by CRC Press LLC


476 Chapter 7  Linear Mixed Models

]345 œ . € 95 € 73 € 3Ð4Ñ5 € a97 b35 € /345 [7.63]


95 µ Kˆ!ß 59# ‰
3a4b5 µ Kˆ!ß 53# ‰
a97 b35 µ Kˆ!ß 597# ‰

/345 µ K ˆ!ß 5/# ‰


3 œ "ß #à 4 œ "ß #ß $à 5 œ "ß âß ).

Because the farm effects are random it is reasonable to also treat the block effects nested
within farms as random variables. To obtain a test of the treatment effects and estimates of all
variance components by restricted maximum likelihood we use the proc mixed code
proc mixed data=onfarm;
class farm block tx;
model yield = tx /ddfm=satterth;
random farm block(farm) farm*tx;
run;

Only the treatment effect tx is listed in the model statement since it is the only fixed
effect in the model. The Satterthwaite approximation is invoked here because exact tests may
not be available in complex mixed models such as this one.
The largest variance component estimate is 5 s #9 œ "*Þ**(* (Output 7.8). The variance
between farms is twenty times larger than variation within farms Ð5 s #3 œ !Þ*)%)Ñ. The test for
differences among the treatments is not significant aJ9,= œ !Þ$ß : œ !Þ'!!$b. Based on this
test one would conclude that the new treatment does not increase or decrease yield over the
current standard. It is possible, however, that the marginal treatment effect is masked by an
interaction. If, for example, the new treatment (B) outperforms the standard on some farms
but performs more poorly than the standard on other farms, it is conceivable that the
treatment averages across farms are not very different. To address this question we need to
test the significance of the interaction between treatments and farms. Since the interaction
terms a97 b35 are random variables, the farm*tx effect appears as a covariance parameter, not
as a fixed effect. Fitting a reduced model without the interaction we can calculate a likelihood
#
ratio test statistic to test L! : 597 œ !. The proc mixed code
proc mixed data=onfarm;
class farm block tx;
model yield = tx /ddfm=satterth;
random farm block(farm) ;
run;

produces a -2 Res Log Likelihood of #&!Þ% (output not shown). The difference between
this value and the -2 Res Log Likelihood of ##$Þ) is asymptotically the realization of a
Chi-square random variable with one degree of freedom. The :-value of the likelihood ratio
#
test of L! : 597 œ ! is thus PrÐ;#"   #'Þ'Ñ  !Þ!!!". There is a significant interaction
between farms and treatments.

© 2003 by CRC Press LLC


Applications 477

Output 7.8.
The Mixed Procedure

Model Information
Data Set WORK.ONFARM
Dependent Variable yield
Covariance Structure Variance Components
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Satterthwaite

Class Level Information


Class Levels Values
farm 8 1 2 3 4 5 6 7 8
block 3 1 2 3
tx 2 A B

Dimensions
Covariance Parameters 4
Columns in X 3
Columns in Z 48
Subjects 1
Max Obs Per Subject 48
Observations Used 48
Observations Not Used 0
Total Observations 48

Covariance Parameter Estimates


Cov Parm Estimate
farm 19.9979
block(farm) 0.9848
farm*tx 9.3377
Residual 1.6121

Fit Statistics
Res Log Likelihood -111.9
Akaike's Information Criterion -115.9
Schwarz's Bayesian Criterion -116.1
-2 Res Log Likelihood 223.8

Type 3 Tests of Fixed Effects


Num Den
Effect DF DF F Value Pr > F
tx 1 7 0.30 0.6003

This interaction would normally be investigated with interaction slices by farms producing
separate tests of the treatment difference for each farm. Since the interaction term is random
this is not possible (slices require fixed effects). However, the best linear unbiased predictors
for the treatment means on each farm can be calculated with the procedure. These are the
quantities on which treatment comparisons for a given farm should be based. The statements
estimate 'Blup Farm 1 tx A' intercept 1 tx 1 0 | farm 1 0 0 0 0 0 0 0
farm*tx 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
estimate 'Blup Farm 1 tx B' intercept 1 tx 0 1 | farm 1 0 0 0 0 0 0 0
farm*tx 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0;

for example, estimate the BLUPs for the two treatments on farm 1. Notice the vertical slash
after tx 1 0 which narrows the inference space by fixing the farm effects to that of farm ".

© 2003 by CRC Press LLC


478 Chapter 7  Linear Mixed Models

BLUPs for other farms are calculated similarly by shifting the coefficients for farm and
farm*tx effects to the appropriate positions. For example,

estimate 'Blup Farm 3 tx A' intercept 1 tx 1 0 | farm 0 0 1 0 0 0 0 0


farm*tx 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0;
estimate 'Blup Farm 3 tx B' intercept 1 tx 0 1 | farm 0 0 1 0 0 0 0 0
farm*tx 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0;

estimates the treatment BLUPs for the third farm. The coefficients that were shifted compared
to the estimate statements for farm " are shown in bold. The BLUPs so obtained for the two
treatments on all farms follow in Output 7.9. Comparing the Estimate values there with the
entries in Table 7.7 (p. 475), it is evident that the EBLUPs are the sample means for each
treatment calculated across the blocks on a particular farm (apart from roundoff error, the
values in Table 7.7 were rounded to two decimal places).
Of interest is of course a comparison of these means by farm. In other words, are there
farms where the new treatment outperforms the current standard and farms where the reverse
is true? Since the treatment effect was not significant in the analysis of the full model but the
likelihood ratio test for the interaction was significant, we almost expect such a relationship.
The following estimate statements contrast the two treatments for each farm (Output 7.10).
estimate 'Tx eff. on Farm 1' tx 1 -1 | farm*tx 1 -1;
estimate 'Tx eff. on Farm 2' tx 1 -1 | farm*tx 0 0 1 -1;
estimate 'Tx eff. on Farm 3' tx 1 -1 | farm*tx 0 0 0 0 1 -1;
estimate 'Tx eff. on Farm 4' tx 1 -1 | farm*tx 0 0 0 0 0 0 1 -1;
estimate 'Tx eff. on Farm 5' tx 1 -1 | farm*tx 0 0 0 0 0 0 0 0 1 -1;
estimate 'Tx eff. on Farm 6' tx 1 -1 | farm*tx 0 0 0 0 0 0 0 0 0 0 1 -1;
estimate 'Tx eff. on Farm 7' tx 1 -1 | farm*tx 0 0 0 0 0 0 0 0 0 0 0 0 1 -1;
estimate 'Tx eff. Farm 8' tx 1 -1 | farm*tx 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -1;

Output 7.9.
Estimates

Standard
Label Estimate Error DF t Value Pr > |t|

Blup Farm 1 tx A 31.1595 0.9168 29.2 33.99 <.0001


Blup Farm 1 tx B 33.0960 0.9168 29.2 36.10 <.0001
Blup Farm 2 tx A 30.5384 0.9168 29.2 33.31 <.0001
Blup Farm 2 tx B 25.2136 0.9168 29.2 27.50 <.0001
Blup Farm 3 tx A 37.7253 0.9168 29.2 41.15 <.0001
Blup Farm 3 tx B 41.8261 0.9168 29.2 45.62 <.0001
Blup Farm 4 tx A 36.1593 0.9168 29.2 39.44 <.0001
Blup Farm 4 tx B 32.2394 0.9168 29.2 35.17 <.0001
Blup Farm 5 tx A 23.4730 0.9168 29.2 25.60 <.0001
Blup Farm 5 tx B 23.8930 0.9168 29.2 26.06 <.0001
Blup Farm 6 tx A 27.1144 0.9168 29.2 29.58 <.0001
Blup Farm 6 tx B 26.8735 0.9168 29.2 29.31 <.0001
Blup Farm 7 tx A 26.7533 0.9168 29.2 29.18 <.0001
Blup Farm 7 tx B 34.5251 0.9168 29.2 37.66 <.0001
Blup Farm 8 tx A 28.6076 0.9168 29.2 31.20 <.0001
Blup Farm 8 tx B 30.7610 0.9168 29.2 33.55 <.0001

On farms # and % the old treatment significantly outperforms the new treatment. On
farms $, (, and ) the new treatment significantly outperforms the current standard, however
(at the &% significance level). This reversal of the treatment effects masked the treatment
main effect. Whereas the recommendation based on the treatment main effect would have
been that one may as well stick with the old treatment and not transfer technology from

© 2003 by CRC Press LLC


Applications 479

experiment station research, the analysis of the interaction shows that for farms $, (, and )
(and by implication farms in the target region that are alike) the new treatment holds promise.

Output 7.10.
Estimates

Standard
Label Estimate Error DF t Value Pr > |t|
Tx eff. on Farm 1 -1.9364 1.0117 17.6 -1.91 0.0720
Tx eff. on Farm 2 5.3249 1.0117 17.6 5.26 <.0001
Tx eff. on Farm 3 -4.1009 1.0117 17.6 -4.05 0.0008
Tx eff. on Farm 4 3.9200 1.0117 17.6 3.87 0.0011
Tx eff. on Farm 5 -0.4200 1.0117 17.6 -0.42 0.6831
Tx eff. on Farm 6 0.2409 1.0117 17.6 0.24 0.8145
Tx eff. on Farm 7 -7.7718 1.0117 17.6 -7.68 <.0001
Tx eff. Farm 8 -2.1535 1.0117 17.6 -2.13 0.0477

7.6.3 Nested Errors through Subsampling


Subsampling in designed experiments is the recording of multiple, independent observations
on the experimental units. We term the experimental material on which subsamples are col-
lected the observational units to distinguish them from the experimental units to which treat-
ments are assigned. Subsamples are sometimes referred to as pseudo-replications, an unfortu-
nate terminology, because they are not replications in the proper sense. The variation among
subsamples from the same experimental unit expresses the homogeneity within the unit; we
term this source of variation observational error. The proper error variation to compare
treatment effects is variation among experimental units that received the same treatment,
termed experimental error variance. In experimental designs with subsampling, care must
be exercised to (i) not consider subsamples as replications of the treatments, (ii) separate
experimental from observational error, and (iii) to perform tests of hypotheses properly. To
demonstrate that subsamples do not substitute for treatment replication consider a $ ‚ #
factorial experiment with six experimental units. Assume that these units are pots containing
four plants each. The data structure for such an experiment could be displayed as in Table
7.8.

Table 7.8. Generic data layout of $ ‚ # factorial without replication and


four subsamples per experimental unit
Factor Combination
Plant E" F" E" F# E# F" E# F# E$ F" E$ F#
" $Þ& &Þ! &Þ& (Þ! &Þ& 'Þ!
# %Þ! &Þ& 'Þ! *Þ! %Þ& 'Þ&
$ $Þ! %Þ! &Þ! )Þ! )Þ& *Þ&
% %Þ& $Þ& &Þ! 'Þ& (Þ! (Þ!

It is tempting to analyze these data with a two-way factorial analysis of variance based on
the linear model
‡ ‡
]345 œ .34 € /345 œ . € !3 € "4 € a!" b34 € /345 , [7.64]

© 2003 by CRC Press LLC


480 Chapter 7  Linear Mixed Models

‡
assuming that /345 is the "experimental" error for the 5 th experimental unit receiving level 3
‡
of factor E and level 4 of factor F . As will be shown shortly, /345 in [7.64] is the observation-
al error and the experimental error has been confounded with the treatment means. Since The
SAS® System cannot know whether repeated values in the data set that share the same treat-
ment assignment represent subsamples or replicates, an analysis of the data in Table 7.8 with
model [7.64] will be successful. Using proc glm significant main effects of factors E and F
a: œ !Þ!!!% and !Þ!"%%b and a nonsignificant interaction are inferred (Output 7.11).
data noreps;
input A B plant y;
datalines;
1 1 1 3.5
1 1 2 4.0
1 1 3 3.0
1 1 4 4.5
1 2 1 5.0
1 2 2 5.5
1 2 3 4.0
... and so forth ...
;;
run;

proc glm data=noreps;


class A B;
model y = A B A*B;
run; quit;

Output 7.11. The GLM Procedure

Class Level Information


Class Levels Values
A 3 1 2 3
B 2 1 2
Number of observations 24

Dependent Variable: y
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 5 47.34375000 9.46875000 6.94 0.0009
Error 18 24.56250000 1.36458333
Corrected Total 23 71.90625000

R-Square Coeff Var Root MSE y Mean


0.658409 20.09727 1.168154 5.812500

Source DF Type I SS Mean Square F Value Pr > F


A 2 34.56250000 17.28125000 12.66 0.0004
B 1 10.01041667 10.01041667 7.34 0.0144
A*B 2 2.77083333 1.38541667 1.02 0.3821

Source DF Type III SS Mean Square F Value Pr > F


A 2 34.56250000 17.28125000 12.66 0.0004
B 1 10.01041667 10.01041667 7.34 0.0144
A*B 2 2.77083333 1.38541667 1.02 0.3821

Notice that the J statistics on which the :-values are based are obtained by dividing the
main effects or interaction mean square by the mean square error of "Þ$'%&. This mean square
error is based on ") degrees of freedom, %  " œ $ degrees of freedom for the subsamples in
each of ' experimental units. This analysis is clearly wrong, since the experimental error in
this design has >a<  "b degrees of freedom where > denotes the number of treatments and <

© 2003 by CRC Press LLC


Applications 481

the number of replications for each treatment. Since each of the > œ ' treatments was
assigned to only one pot, we have < œ " and >a<  "b œ !. What SAS® terms the Error
s # œ "Þ$'%&) is an estimate of the obser-
source in this model is the observational error and 5
vational error variance. The correct model for the subsampling design contains separate ran-
dom terms for experimental and observational error. In the two-factor design we obtain
]3456 œ .34 € /345 € .3456 [7.65]
Varc/345 d œ 5/# , Varc.3456 d œ 5.# ,

where 5 œ "ß âß < indexes the replications, /345 is the experimental error as defined above,
and .3456 is the observational (subsampling) error for subample 6 œ "ß âß 8 on replicate 5 . 5/#
and 5.# are the experimental and observational error variances, respectively. If 5 œ ", as for
the data in Table 7.8, the model becomes
]346 œ .34 € /34 € .346 œ .‡34 € .346 . [7.66]

and the experimental error is now confounded with the treatments. This is model [7.64] where
‡
/345 is replaced with .346 and .34 is replaced with .‡34 . Because .34 and /34 in [7.66] have the
same subscript the two sources of variability are confounded. The only random variation that
can be estimated is the variance of .346 , the observational error. Finally, the observational
error mean square is not the correct denominator for J -tests (Table 7.9). The statistic
Q W aTreatmentbÎQ W aObs. Errorb thus is not a test statistic for the absence of treatment
effects Ð0 Ð.#34 Ñ œ !Ñ but for the simultaneous absence of treatment effects and the experimen-
tal error, a nonsensical proposition.

Table 7.9. Expected mean squares in subsampling design without


treatment replications (model [7.66])
Source of Variation DF EcQ W d
Treatments + Experimental Error >" 5.# € 85/# € 0 ˆ.#34 ‰
Observational Error >a8  "b 5.#

Table 7.10. Expected mean squares in completely randomized design with subsampling (>
denotes number of treatments, < number of replicates, and 8 number of subsamples)
Source of Variation DF EcQ W d
Treatments aX Bb >" 5.# € 85/# € 0 a73# b
Experimental Error aII b >a<  "b 5.# € 85/#
Observational Error aSI b ><a8  "b 5.#

In subsampling designs with treatment replication, experimental and observational error


variances are estimable and not confounded with effects. Table 7.10 displays the expected
mean squares for > treatments in a completely randomized design with < replications per treat-
ment and 8 subsamples per experimental unit for the linear model
]345 œ . € 73 € /34 € .345
3 œ "ß âß >à 4 œ "ß âß <à 5 œ "ß âß 8.

© 2003 by CRC Press LLC


482 Chapter 7  Linear Mixed Models

Notice that experimental error degrees of freedom are not affected by the number of sub-
samples, and that J9,= œ Q W aX BbÎQ W aII b is the test statistic for testing treatment effects.
The data in Table 7.11, taken from Steel, Torrie, and Dickey (1997, p. 159), represent a
$ ‚ # factorial treatment structure arranged in completely randomized design with < œ $
replicates and 8 œ % subsamples per experimental unit. From a large group of plants four
were randomly assigned to each of ") pots. Six treatments were then randomly assigned to
the pots such that each treatment was replicated three times. The treatments consisted of all
possible combinations of three hours of daylight a), "#, "' hrsb and two levels of night tem-
peratures (low, high). The outcome of interest was the stem growth of mint plants grown in
nutrient solution under the assigned conditions. The experimental units are the pots since
treatments were assigned to those. Stem growth was measured for each plant in a pot, hence
there are four subsamples per experimental unit.

Table 7.11. One-week stem growth of mint plants data from Steel et al. (1997, p. 159)
Low Night Temperature
8 hrs 12 hrs 16 hrs
Plant Pot 1 Pot 2 Pot 3 Pot 1 Pot 2 Pot 3 Pot 1 Pot 2 Pot 3
" $Þ& #Þ& $Þ! &Þ! $Þ& %Þ& &Þ! &Þ& &Þ&
# %Þ! %Þ& $Þ! &Þ& $Þ& %Þ! %Þ& 'Þ! %Þ&
$ $Þ! &Þ& #Þ& %Þ! $Þ! %Þ! &Þ! &Þ! 'Þ&
% %Þ& &Þ! $Þ! $Þ& %Þ! &Þ! %Þ& &Þ! &Þ&

High Night Temperature


8 hrs 12 hrs 16 hrs
Plant Pot 1 Pot 2 Pot 3 Pot 1 Pot 2 Pot 3 Pot 1 Pot 2 Pot 3
" )Þ& 'Þ& (Þ! 'Þ! 'Þ! 'Þ& (Þ! 'Þ! ""Þ!
# 'Þ! (Þ! (Þ! &Þ& )Þ& 'Þ& *Þ! (Þ! (Þ!
$ *Þ! )Þ! (Þ! $Þ& %Þ& )Þ& )Þ& (Þ! *Þ!
% )Þ& 'Þ& (Þ! (Þ! (Þ& (Þ& )Þ& (Þ! )Þ!
Copyright © 1997 by The McGraw-Hill Companies, Inc. Reproduced from Table 7.8 of Steel,
R.G.D., Torrie, J.H., and Dickey, D.A. (1997), Principles and Procedures of Statistics. A
Biometrical Approach, McGraw-Hill, New York, with permission.

The linear model for these data is


]3456 œ . € !3 € "4 € a!" b34 € /345 € .3456 [7.67]
3 œ "ß âß + œ $à 4 œ "ß âß , œ #à 5 œ "ß âß < œ $à 6 œ "ß âß 8 œ %
Varc/345 d œ 5/# à Varc.3456 d œ 5.# ,

where the /345 and .3456 are zero-mean uncorrelated random variables. The analysis of
variance is shown in Table 7.12.

© 2003 by CRC Press LLC


Applications 483

Table 7.12. Analysis of variance of mint plants data


Source df
Hours +"œ#
Temperature ,"œ"
Hours ‚ Temperature a+  "ba,  "b œ #
Experimental Error +,a<  "b œ 'a$  "b œ "#
Observational Error +,<a8  "b œ ")a%  "b œ &%
Total +,<8  " œ ("

The analysis of variance can be obtained with proc glm of The SAS® System (Output
7.12):
proc glm data=mintstems;
class hour night pot;
model growth = hour night hour*night pot(hour*night);
run; quit;

The sequential (Type I) and partial (Type III) sums of squares are identical because the
design is orthogonal. Notice that the source denoted Error is again the obervational error as
can be seen from the associated degrees of freedom and the experimental error is modeled as
pot(hour*night). The J statistics calculated by proc glm are obtained by dividing the mean
square of a source of variability by the mean square for the Error source; hence they use the
observational error mean square as a denominator and are incorrect. The two error mean
square estimates in Output 7.12 are
s #. œ !Þ*$%!
5
s #.
5 s /# œ #Þ"&#(,
€ 85

hence dividing by the observational mean square error is detrimental in two ways. The J
statistic is inflated and the : -value is calculated from an distribution with incorrect (too many)
degrees of freedom.
The correct tests can be obtained in two ways with proc glm. One can add a random
statement indicating which terms of the model statement are random variables and The SAS®
System will construct the appropriate test statistics based on the formulas of expected mean
squares. Alternatively one can use the test statement if the correct error term is known. The
two methods lead to the following procedure calls (output not shown).
proc glm data=mintstems;
class hour night pot;
model growth = hour night hour*night pot(hour*night);
random pot(hour*night) / test;
run; quit;

proc glm data=mintstems;


class hour night pot;
model growth = hour night hour*night pot(hour*night);
test h=hour night hour*night e=pot(hour*night);
run; quit;

© 2003 by CRC Press LLC


484 Chapter 7  Linear Mixed Models

Output 7.12.
The GLM Procedure

Class Level Information


Class Levels Values
hour 3 8 12 16
night 2 Hig Low
pot 3 1 2 3

Number of observations 72

Dependent Variable: growth


Sum of
Source DF Squares Mean Square F Value Pr > F
Model 17 205.4756944 12.0868056 12.94 <.0001
Error 54 50.4375000 0.9340278
Corrected Total 71 255.9131944

R-Square Coeff Var Root MSE growth Mean


0.802912 16.70696 0.966451 5.784722

Source DF Type I SS Mean Square F Value Pr > F


hour 2 22.2986111 11.1493056 11.94 <.0001
night 1 151.6701389 151.6701389 162.38 <.0001
hour*night 2 5.6736111 2.8368056 3.04 0.0562
pot(hour*night) 12 25.8333333 2.1527778 2.30 0.0186

Source DF Type III SS Mean Square F Value Pr > F


hour 2 22.2986111 11.1493056 11.94 <.0001
night 1 151.6701389 151.6701389 162.38 <.0001
hour*night 2 5.6736111 2.8368056 3.04 0.0562
pot(hour*night) 12 25.8333333 2.1527778 2.30 0.0186

A more elegant approach is to use proc mixed which is specifically designed for mixed
models. The statements to analyze the two-way factorial with subsampling are
proc mixed data=mintstems;
class hour night pot;
model growth = hour night hour*night;
random pot(hour*night);
run; quit;

or
proc mixed data=mintstems;
class hour night pot;
model growth = hour night hour*night;
random intercept / subject=pot(hour*night);
run; quit;

The two versions of proc mixed code differ only in the form of the random statement and
yield identical results. The second form explicitly defines the experimental units
pot(hour*night) as clusters and the columns of the Z3 matrix as having an intercept only.
The residual maximum likelihood estimates of the variance components 5/# and 5.# are
s #/ œ !Þ$!%( and 5
5 s #.ß œ !Þ*$%!, respectively (Output 7.13).

© 2003 by CRC Press LLC


Applications 485

Output 7.13.
The Mixed Procedure

Model Information

Data Set WORK.MINTSTEMS


Dependent Variable growth
Covariance Structure Variance Components
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Containment

Class Level Information


Class Levels Values
hour 3 8 12 16
night 2 Hig Low
pot 3 1 2 3

Dimensions
Covariance Parameters 2
Columns in X 12
Columns in Z 18
Subjects 1
Max Obs Per Subject 72
Observations Used 72
Observations Not Used 0
Total Observations 72

Covariance Parameter Estimates

Cov Parm Subject Estimate


Intercept pot(hour*night) 0.3047
Residual 0.9340
Fit Statistics

Res Log Likelihood -103.9


Akaike's Information Criterion -105.9
Schwarz's Bayesian Criterion -106.8
-2 Res Log Likelihood 207.7

Type 3 Tests of Fixed Effects

Num Den
Effect DF DF F Value Pr > F
hour 2 12 5.18 0.0239
night 1 12 70.45 <.0001
hour*night 2 12 1.32 0.3038

The latter estimate is labeled as Residual in the Covariance Parameter Estimates table.
Since the data are completely balanced these estimates are identical to the method of moment
estimator implied by the analysis of variance. From 5 s #. € 85
s /# œ #Þ"&#( and 5
s .# œ !Þ*$% one
obtains the moment estimator of the experimental error variance as
s #/ œ a#Þ"&#(  !Þ*$%!bÎ% œ !Þ$!%(.
5

Results of the main effects and interaction tests are shown in the Type 3 Tests of Fixed
Effects table. The J statistics are identical to those obtained in proc glm if one uses the
correct mean square error term there. For example from Output 7.12 one obtains

© 2003 by CRC Press LLC


486 Chapter 7  Linear Mixed Models

Q W aL9?<b ""Þ"%*$
J9,= œ œ œ &Þ")
Q W aII b #Þ"&#(
Q W aR 312>b "&"Þ'(!
J9,= œ œ œ (!Þ%&
Q W aII b #Þ"&#(
Q W aL9?< ‚ R 312>b #Þ)$')
J9,= œ œ œ "Þ$#.
Q W aII b #Þ"&#(

These are the J statistics shown in Output 7.13. Also notice that the denominator degrees of
freedom are set to the correct degrees of freedom associated with the experimental error
(Table 7.12).
The marginal correlation structure in the subsampling model [7.67] is compound
symmetric, observational errors are nested within experimental errors. Adding the v=list
option to the random statement of proc mixed requests a printout of the (estimated) marginal
variance-covariance matrices of the clusters (subjects) in list. For example,
random intercept / subject=pot(hour*night) v=1;

requests a printout of the variance-covariance matrix for the first cluster (Output 7.14). It is
easy to verify that this matrix is of the form
s #/ J% € 5
5 s #. I% .

Output 7.14.
Estimated V Matrix for pot(hour*night) 1 8 High

Row Col1 Col2 Col3 Col4

1 1.2387 0.3047 0.3047 0.3047


2 0.3047 1.2387 0.3047 0.3047
3 0.3047 0.3047 1.2387 0.3047
4 0.3047 0.3047 0.3047 1.2387

The same analysis can thus be obtained by modeling the marginal variance-covariance
matrix VarcY3 d directly as a compound symmetric matrix. Replacing the random statement
with a repeated statement and choosing the appropriate covariance structure (type=cs), the
statements
proc mixed data=mintstems noitprint;
class hour night pot;
model growth = hour night hour*night;
repeated / sub=pot(hour*night) type=cs r=1;
run; quit;

lead to the same results as in Output 7.13 and Output 7.14, only the covariance parameter
Intercept in Output 7.13 has been renamed to CS (Output 7.15).

© 2003 by CRC Press LLC


Applications 487

Output 7.15.
The Mixed Procedure

Model Information

Data Set WORK.MINTSTEMS


Dependent Variable growth
Covariance Structure Compound Symmetry
Subject Effect pot(hour*night)
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Between-Within

Class Level Information

Class Levels Values

hour 3 8 12 16
night 2 Hig Low
pot 3 1 2 3

Dimensions

Covariance Parameters 2
Columns in X 12
Columns in Z 0
Subjects 18
Max Obs Per Subject 4
Observations Used 72
Observations Not Used 0
Total Observations 72

Estimated R Matrix for pot(hour*night) 1 8 Hig

Row Col1 Col2 Col3 Col4

1 1.2387 0.3047 0.3047 0.3047


2 0.3047 1.2387 0.3047 0.3047
3 0.3047 0.3047 1.2387 0.3047
4 0.3047 0.3047 0.3047 1.2387

Covariance Parameter Estimates

Cov Parm Subject Estimate


CS pot(hour*night) 0.3047
Residual 0.9340

Fit Statistics

Res Log Likelihood -103.9


Akaike's Information Criterion -105.9
Schwarz's Bayesian Criterion -106.8
-2 Res Log Likelihood 207.7

Type 3 Tests of Fixed Effects


Num Den
Effect DF DF F Value Pr > F
hour 2 12 5.18 0.0239
night 1 12 70.45 <.0001
hour*night 2 12 1.32 0.3038

© 2003 by CRC Press LLC


488 Chapter 7  Linear Mixed Models

7.6.4 Recovery of Inter-Block Information in Incomplete Block


Designs
Incompleteness of block designs can have many causes. Some are by design, others by
accident. Among the accidental causes are destruction or loss of experimental units and dis-
carding of erroneous measurements. Frequently incompleteness is a design feature if the size
of the blocks is such that not all treatments can be accommodated. Since calculations for in-
complete designs are considerably more involved than for completely balanced designs, ex-
perimental plans were developed that ensure some sort of balance in the treatment allocation
to either reduce computational burden and/or to ensure a certain precision in treatment com-
parisons. Incomplete block designs in this category are known as balanced incomplete block
designs (BIBs), partially balanced incomplete block designs (PBIBs) and various special
cases thereof, such as the lattice designs (see, e.g. Yates 1936, 1940; Bose and Nair 1939;
Cochran and Cox 1957 as some of the historically significant references on the subject). We
will not discuss the various forms of incomplete block designs here in detail, but the basic
issues that come to bear when not all treatments are allocated in every block. Hoshmand
(1994, Ch. 4.3) discusses various forms of agronomic lattice designs, which are special cases
of BIBs or PBIBs.
To illustrate the problem that arises in incomplete block designs consider the following
treatment layout in a BIB with > œ & treatments in , œ "! blocks of size 5 œ $.

Table 7.13. A balanced incomplete block design (BIB) (Treatments that


appear in a particular block are marked as B)
Treatment
Block E F G H I
" B B B
# B B B
$ B B B
% B B B
& B B B
' B B B
( B B B
) B B B
* B B B
"! B B B

This design is balanced in two ways. Each treatment is replicated the same number of
times a'b throughout the experiment and each pair of treatments appears the same number of
times a$b within a block. For example, treatments F and G appear in block ", (, ). Treat-
ments E and F appear in blocks &, ', and (. As a result, all treatment comparisons will be
made with the same precision in the experiment. However, because of the incompleteness,
block and treatment effects are not orthogonal. Whether block effects are removed or not
prior to assessing treatment effects is critical. To see this consider a comparison of treatments
E and F . The naïve approach is to base this comparison on the two arithmetic averages CE
and CF . Their difference is not an estimate of the treatment effect; however, since these are

© 2003 by CRC Press LLC


Applications 489

averages calculated over different blocks. CE is calculated from information in blocks #ß &ß 'ß
(ß *ß "! and C F from information in blocks "ß $ß &ß 'ß (ß ). The difference C E  C F carries not
only information about differences between the treatments but also about block effects. To
obtain a fair comparison of the treatments unaffected by the block effects, the treatment sum
of squares must be adjusted for the block effects and treatment means are not estimated as
arithmetic averages. A statistical model for the design in Table 7.13 is ]34 œ . € 33 € 74 € /34
where the 33 are block effects a3 œ "ß ÞÞÞß "!b, 74 are the treatment effects a4 œ "ß ÞÞÞß &b and /34
are independent experimental errors with mean ! and variance 5 # . The only difference
between this linear model and one for a randomized complete block design is that not all
combinations 34 are possible. The appropriate estimate of the mean of the 4th treatment in the
incomplete design is . s € s7 4 where carets denote the least squares estimate. . s € s7 4 is also
known as the least squares mean for treatment 4. In fact, these estimates are always appro-
priate. In a balanced design it turns out that the least squares estimates are identical to the
arithmetic averages. The question thus should not be when one should use the least squares
means for treatment comparisons, but when one can rely on arithmetic means.
We illustrate the effect of nonorthogonality with data from a balanced incomplete block
design reported by Cochran and Cox (1957, p. 448). Thirteen hybrids of corn were arranged
in a field experiment in blocks of size 5 œ % such that each pair of treatments appeared once
in a block throughout the experiment and each treatment is replicated four times. This
arrangement requires , œ "$ blocks.

Table 7.14. Experimental layout of BIB in Cochran and Cox (1957, p. 448)†
(showing yield of corn in pounds per plot)
Hybrid
Block " # $ % & ' ( ) * "! "" "# "$
" #&Þ$ "*Þ* #*Þ! #%Þ'
# #$Þ! "*Þ) $$Þ$ ##Þ(
$ "'Þ# "*Þ$ $"Þ( #'Þ'
% #(Þ$ #(Þ! $&Þ' "(Þ%
& #$Þ% $!Þ& $!Þ) $#Þ%
' $!Þ' $#Þ% #(Þ# $#Þ)
( $%Þ( $"Þ" #&Þ( $!Þ&
) $%Þ% $#Þ% $$Þ$ $'Þ*
* $)Þ# $#Þ* $(Þ$ $"Þ$
"! #)Þ( $!Þ( #'Þ* $&Þ$
"" $'Þ' $"Þ" $"Þ" #)Þ%
"# $"Þ) $$Þ( #(Þ) %"Þ"
"$ $!Þ$ $"Þ& $*Þ$ #'Þ(

Cochran, W.G. and Cox, G.M. (1957), Experimental Design, 2nd Edition. Copyright © 1957 by
John Wiley and Sons, Inc. This material is used by permission of John Wiley and Sons, Inc.

We obtain the analysis of variance for these data with proc glm (Output 7.16).
proc glm data=cornyld;
class block hybrid;
model yield = block hybrid;
lsmeans hybrid / stderr;
means hybrid;
run; quit;

© 2003 by CRC Press LLC


490 Chapter 7  Linear Mixed Models

Output 7.16. The GLM Procedure

Class Level Information

Class Levels Values


block 13 1 2 3 4 5 6 7 8 9 10 11 12 13
hybrid 13 1 2 3 4 5 6 7 8 9 10 11 12 13

Number of observations 52

Dependent Variable: yield corn yield in pounds per plot

Sum of
Source DF Squares Mean Square F Value Pr > F
Model 24 1017.929231 42.413718 2.13 0.0298
Error 27 538.217500 19.933981
Corrected Total 51 1556.146731

R-Square Coeff Var Root MSE yield Mean


0.654134 14.99302 4.464749 29.77885

Source DF Type I SS Mean Square F Value Pr > F


block 12 689.3842308 57.4486859 2.88 0.0109
hybrid 12 328.5450000 27.3787500 1.37 0.2378

Source DF Type III SS Mean Square F Value Pr > F


block 12 475.2650000 39.6054167 1.99 0.0677
hybrid 12 328.5450000 27.3787500 1.37 0.2378

Least Squares Means

Standard
hybrid yield LSMEAN Error Pr > |t|

1 33.0019231 2.4586721 <.0001


2 28.2711538 2.4586721 <.0001
3 30.2173077 2.4586721 <.0001
4 28.1019231 2.4586721 <.0001
5 29.9557692 2.4586721 <.0001
6 27.1019231 2.4586721 <.0001
7 29.7250000 2.4586721 <.0001
8 33.7173077 2.4586721 <.0001
9 29.0173077 2.4586721 <.0001
10 28.0250000 2.4586721 <.0001
11 24.5250000 2.4586721 <.0001
12 30.0865385 2.4586721 <.0001
13 35.3788462 2.4586721 <.0001

Level of ------------yield------------
hybrid N Mean Std Dev

1 4 35.3250000 2.75121185
2 4 29.8000000 2.40277617
3 4 30.0000000 6.92194578
4 4 28.0500000 5.50424079
5 4 30.7250000 2.55783111
6 4 28.0750000 6.08187197
7 4 31.7750000 6.57133929
8 4 31.8000000 3.38526218
9 4 28.1000000 2.25831796
10 4 28.1750000 8.00848508
11 4 22.4250000 5.01489448
12 4 27.9000000 4.06939799
13 4 34.9750000 6.09555849

© 2003 by CRC Press LLC


Applications 491

What SAS® terms Type I SS and Type III SS are sequential and partial sums of squares,
respectively. Sequential sums of squares are the sum of squares contributions of sources
given that variability of the previously listed sources has been accounted for. The sequential
block sum of squares of ')*Þ$) is the sum of squares among block averages and the sequen-
tial hybrid sum of squares of $#)Þ&% is the contribution of the treatment variability after
adjusting for block effects. Inferences about treatment effects are to be based on the partial
sums of squares. The nonorthogonality of this design is evidenced by the fact that the Type I
SS and the Type III SS differ. In an orthogonal design, the two sets of sums of squares would
be identical. Whenever the design is nonorthogonal great care must be exercised to estimate
treatment means properly. The list of least squares means shows the estimates . s €s7 4 that are
adjusted for the block effects. Notice that all least squares means have the same standard
error, since every treatment is replicated the same number of times. The final part of the
output shows the result of the means statement. These are the arithmetic sample averages of
the observations for a particular treatment which do not estimate treatment means unbiasedly
unless every treatment appears in every block. One must not base treatment comparisons on
these quantities in a nonorthogonal design. The column Std Dev is the standard deviation of
the four observations for each treatment. It is not the standard deviation of a treatment based
on the analysis of variance.
An analysis of an incomplete block design such as the proc glm analysis above is termed
an intra-block analysis that obtains treatment information by comparing block-adjusted least
squares estimates. Yates (1936, 1940) coined the term along with the term inter-block
analysis that also recovers treatment information contained in the block totals (averages). In
incomplete block designs contrasts of block averages also contain contrasts among the treat-
ments. To see this consider blocks " and $ in Table 7.13. The first block contains treatments
F , G , and I , the third block contains treatments F , H, and I . If ] "Þ denotes the average in
block " and ] $Þ the average in block $, then we have
"
E] "Þ ‘ œ . € 3" € a7F € 7G € 7I b
$
"
 ‘
E ] $Þ œ . € 3$ € a7F € 7H € 7I b.
$
The difference of the block averages contains information about the treatments, namely,
EÒ] "Þ  ] $Þ Ó œ 3"  3$ € "$ a7G  7H b. Unfortunately, this is not just a contrast among treat-
ments, but involves the effects of the two blocks. The solution to uncovering the inter-block
information is to let the block effects be random (with mean !) since then
EÒ] "Þ  ] $Þ Ó œ !  ! € "$ a7G  7H b œ "$ a7G  7H b, a contrast between treatment effects. The
linear mixed model for the incomplete block design now becomes
]34 œ . € 33 € 74 € /34 ß /34 µ ˆ!ß 5 # ‰ß 34 µ ˆ!ß 53# ‰,

where the /34 and 34 are independent. The term 3"  3$ € "$ a7G  7H b now represents the
conditional (narrow inference space) comparison of the two block means and the uncon-
ditional (broad inference space) comparison is EÒ] "Þ  ] $Þ Ó œ EÒEÒ] "Þ  ] $Þ l3ÓÓ œ
"
$ a 7G  7 H b .

For the corn hybrid experiment of Cochran and Cox (1957, p. 448) the inter-block
analysis is carried out with the following proc mixed statements.

© 2003 by CRC Press LLC


492 Chapter 7  Linear Mixed Models

proc mixed data=cornyld;


class block hybrid;
model yield = hybrid;
random block;
lsmeans hybrid ;
estimate 'hybrid 1 broad ' intercept 1 hybrid 1;
estimate 'hybrid 1 narrow' intercept 13 hybrid 13 |
block 1 1 1 1 1 1 1 1 1 1 1 1 1 / divisor=13;
estimate 'hybrid 1 in block 1' intercept 1 hybrid 1 | block 1;
estimate 'hybrid 1 in block 2' intercept 1 hybrid 1 | block 0 1;
estimate 'hybrid 1 vs hybrid2' hybrid 1 -1;
run; quit;

The inter-block analysis is invoked by moving the block term from the model statement
to the random statement. The estimate statements are not necessary unless one wants to esti-
mate treatment means in the narrow or intermediate inference spaces (§7.3). The lsmeans
statement requests block-adjusted estimates of the hybrid means in the broad inference space.
On Output 7.17 we notice that the J statistic for hybrid differences in the mixed analysis
aJ9,= œ "Þ'(b has changed from the intra-block analysis in proc glm aJ9,= œ "Þ$(b. This re-
flects the additional treatment information recovered by the inter-block analysis. Furthermore,
the estimates of the treatment means have changed as compared to the least squares means re-
ported by proc glm. The additional information recovered from block averages surfaces here
again. In the example of §7.3 it was noted that the estimates of factor means would be the
same if all factors would have been considered fixed and only the standard errors would
differ between the fixed effects and mixed effects analysis. This statement was correct there
because the design was completely balanced and hence orthogonal. In a nonorthogonal in-
complete block design both the estimates of the treatment means as well as their standard
errors differ between the fixed effects and mixed effects analysis.

Output 7.17. The Mixed Procedure

Model Information
Data Set WORK.CORNYLD
Dependent Variable yield
Covariance Structure Variance Components
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Containment

Class Level Information


Class Levels Values
block 13 1 2 3 4 5 6 7 8 9 10 11 12 13
hybrid 13 1 2 3 4 5 6 7 8 9 10 11 12 13

Dimensions
Covariance Parameters 2
Columns in X 14
Columns in Z 13
Subjects 1
Max Obs Per Subject 52
Observations Used 52
Observations Not Used 0
Total Observations 52

Covariance Parameter Estimates


Cov Parm Estimate
block 6.0527
Residual 19.9340

© 2003 by CRC Press LLC


Applications 493

Output 7.17 (continued).


Fit Statistics
Res Log Likelihood -126.8
Akaike's Information Criterion -128.8
Schwarz's Bayesian Criterion -129.4
-2 Res Log Likelihood 253.6

Type 3 Tests of Fixed Effects


Num Den
Effect DF DF F Value Pr > F
hybrid 12 27 1.67 0.1293

Estimates
Standard
Label Estimate Error DF t Value Pr > |t|
hybrid 1 broad 34.1712 2.4447 27 13.98 <.0001
hybrid 1 narrow 34.1712 2.3475 27 14.56 <.0001
hybrid 1 in block 1 32.6735 2.9651 27 11.02 <.0001
hybrid 1 in block 2 31.2751 2.9651 27 10.55 <.0001
hybrid 1 in block 7 34.1635 2.6960 27 12.67 <.0001
hybrid 1 vs hybrid2 5.1305 3.3331 27 1.54 0.1354

Least Squares Means


Standard
Effect hybrid Estimate Error DF t Value Pr > |t|
hybrid 1 34.1712 2.4447 27 13.98 <.0001
hybrid 2 29.0406 2.4447 27 11.88 <.0001
hybrid 3 30.1079 2.4447 27 12.32 <.0001
hybrid 4 28.0758 2.4447 27 11.48 <.0001
hybrid 5 30.3429 2.4447 27 12.41 <.0001
hybrid 6 27.5917 2.4447 27 11.29 <.0001
hybrid 7 30.7568 2.4447 27 12.58 <.0001
hybrid 8 32.7523 2.4447 27 13.40 <.0001
hybrid 9 28.5556 2.4447 27 11.68 <.0001
hybrid 10 28.1005 2.4447 27 11.49 <.0001
hybrid 11 23.4680 2.4447 27 9.60 <.0001
hybrid 12 28.9860 2.4447 27 11.86 <.0001
hybrid 13 35.1756 2.4447 27 14.39 <.0001

The first two estimate statements produce the hybrid " estimate in the broad and narrow
inference space and show that the lsmeans statement operates in the broad inference space.
The third through fifth estimate statements show how to estimate the hybrid mean in a
particular plot. Notice that hybrid " did not appear in blocks " or # in the experiment but did
in block (. Nevertheless, we are able to predict how well the hybrid would have done in
blocks " and #, although this prediction is less precise than prediction of the hybrid's perform-
ance in blocks were the hybrid was observed. In an intra-block analysis where blocks are
fixed, it is not possible to differentiate a hybrid's performance by block.

7.6.5 A Split-Strip-Plot Experiment for Soybean Yield


An experiment was conducted at the Tidewater Agricultural Research and Extension Center
in Suffolk, Virginia to investigate how soybean yield response depended on soybean cultivar,
row spacing, and plant population. The three factors and their levels considered in the experi-
ment were
• Cultivar (AG3601, AG3701, AG4601, AG4701)

© 2003 by CRC Press LLC


494 Chapter 7  Linear Mixed Models

• Plant Population ('!, "#!, ")!, #%!, $!! thousand per acre)
• Row spacing (*", ")").

Cultivar X assigned,
EU for cultivar

Populations assigned Row spacings assigned


to rows to columns
EU for row spacing

EU for
population

EU for row
spacing*population

Figure 7.18. Experimental layout for a single block in the soybean yield experiment.
Cultivars (varieties) were assigned to the large experimental units (plots), row spacings and
population densities to perpendicular strips within the plots.The experiment was brought to
our attention and the data were made kindly available by Dr. David Holshouser, Tidewater
Agricultural Research and Extension Center, Virginia Polytechnic Institute and State
University. Used with permission.

Although the experiment was conducted in four site-years, we consider only a single site-
year here. At each site four replications of the cultivars were arranged in a randomized block
design. Because of technical limitations, it was decided to apply the row spacing and popula-
tion densities in strips within a cultivar experimental unit (plot). It was determined at random
which side (strip) of the plot received *" spacing. Then the population densities were assigned
randomly to five strips running perpendicular to the row spacing strips. Figure 7.18 displays a
schematic layout of one of the four blocks in the experiment.
The factors Row Spacing and Population Density are a split of the experimental unit to
which a cultivar is assigned, but are not arranged in a # ‚ & factorial structure. Considering
the cultivar experimental units, Row Spacing and Population Density form a strip-plot (split-
block) design with "' blocks (replications). Each cultivar experimental unit serves as a repli-
cate for the split-block design of the other two factors. We call this design a split-strip-plot
design.
There are experimental units of four different sizes in this experiment, hence the linear
model will contain four different experimental error sources of variability associated with the

© 2003 by CRC Press LLC


Applications 495

plot, the columns, the rows, and their intersection. Before engaging in an analysis of data
from a complex experiment such as this, it is helpful to develop the source of variability and
degree of freedom decomposition. Correct specification of the programming statements can
then be more easily checked. As is good practice for designs with a split, the whole-plot and
subplot design analysis of variance can be developed separately. On the whole-plot (Cultivar)
level we have a simple randomized complete block design of four treatments in four blocks
(Table 7.15). The sub-plot source and degree of freedom decomposition regards each experi-
mental unit in the whole-plot design as a replicate. Hence, there are +<  " œ "& replicate
degrees of freedom for the sub-plot analysis, which is a strip-plot (split-block) design.

Table 7.15. Whole-plot analysis of variance in soybean yield example


Source df œ
Block <" $
Cultivar +" $
Errora"b a<  "ba+  "b *
Total +<  " "&

Table 7.16. Sub-plot analysis of variance in soybean yield example


Source df œ
Replicate +<  " "&
Row Spacing ," "
Errora#‡ b a+<  "ba,  "b "&
Population -" %
Errora$‡ b a+<  "ba-  "b '!
Row Sp. ‚ Population a,  "ba-  "b %
Errora%‡ b a+<  "ba,  "ba-  "b '!
Total +<,-  " "&*

Upon combining the whole-plot and sub-plot analysis, the Replicate source in Table 7.16
is replaced with the whole-plot decomposition in Table 7.15. Furthermore, interactions
between whole-plot factor Cultivar and all subplot factors are added. The degrees of freedom
for the interactions are removed from the corresponding sub-plot errors Errora#‡ b through
Errora%‡ b.
The degrees of freedom for Errora#b, for example, are obtained as
.0I<<9<a#‡ b  .0G?6>3@+<‚V9AW:Þ œ a+<  "ba,  "b  a+  "ba,  "b
œ a+<  "  + € "ba,  "b
œ a+<  +ba,  "b œ +a<  "ba,  "b,

and similarly for the other sub-plot error terms. The linear model for this experiment has as
many terms as there are rows in Table 7.17. In two steps the model can be defined as
]3456 œ .345 € <6 € /36" € /346
# $
€ /356 %
€ /3456
[7.68]
.345 œ . € !3 € "4 € a!" b34 € #5 € a!# b35 € a"# b45 € a!"# b345 .

.345 denotes the mean of the treatment combination of the 3th cultivar a3 œ "ß âß %b, 4th row

© 2003 by CRC Press LLC


496 Chapter 7  Linear Mixed Models

spacing a4 œ "ß #b, and 5 th population a5 œ "ß âß &b. It is decomposed into a grand mean
a.b, main effects of Cultivar a!3 b, Row Spacing a"4 b, Population a#5 b and their respective
interactions. The first line of model [7.68] expresses the observation ]3456 as a sum of the
mean .3456 and various random components. <6 is the random effect of the 6th block (whole-
plot), assumed Ka!ß 5<# b. /36" is the experimental error on the whole-plot assumed KÐ!ß 5"# Ñ, /346
#
# $
is the experimental error on a row spacing strip assumed KÐ!ß 5# Ñ, /356 is the experimental
error on a population density strip assumed KÐ!ß 5$# Ñ, and finally, /3456%
is the experimental
#
error on the intersection of perpendicular strips assumed KÐ!ß 5% Ñ (see Figure 7.18). All ran-
dom components are independent by virtue of independent randomizations.

Table 7.17. Sources of variability and degrees of freedom in soybean yield example
Source df œ
Block <" $
Cultivar +" $
Errora"b a<  "ba+  "b *
Row Spacing ," "
Cultivar ‚ Row Sp. a+  "ba,  "b $
Errora#b +a<  "ba,  "b "#
Population -" %
Cultivar ‚ Population a+  "ba-  "b "#
Errora$b +a<  "ba-  "b %)
Row Sp. ‚ Population a,  "ba-  "b %
Cultivar ‚ Row Sp. ‚ Population a+  "ba,  "ba-  "b "#
Errora%b +a<  "ba,  "ba-  "b %)
Total +<,-  " "&*

We consider the blocks random in this analysis for two reasons. We posit that the blocks
are only a smaller subset of possible conditions over which inferences are to be drawn.
Secondly, the Block ‚ Cultivar interaction serves as the experimental error term on the
whole-plot. How can this interaction be random if Cultivar and Block factors are fixed? Some
research workers adopt the viewpoint that this apparent inconsistency should not be of
concern. A block*cultivar term will be used in the SAS® code only to generate the
necessary error term. We do remind the reader, however, that treating Block ‚ Cultivar
interactions as random and blocks as fixed corresponds to choosing an intermediate inference
space. As discussed in §7.3 this choice results in smaller standard errrors (and :-values)
compared to the broad inference space in which random effects are allowed to vary. Treating
the blocks as random in the analysis could be viewed as a somewhat conservative approach.
In a three-factor experiment it is difficult to roadmap the analysis from main effect and
interaction tests to contrasts, multiple comparisons, slices, etc. Whether marginal mean com-
parisons are meaningful depends on which factors interact. The first step in the analysis is
thus to produce tests of the main effects and interactions. The proc mixed statements

© 2003 by CRC Press LLC


Applications 497

proc mixed data=soybeanyield;


/* rep = whole-plot replication variable */
/* tpop = target population density */
class rep cultivar tpop rowspace;
model yield = cultivar
rowspace rowspace*cultivar
tpop tpop*cultivar
rowspace*tpop rowspace*tpop*cultivar / ddfm=satterth;
random rep
rep*cultivar
rep*cultivar*rowspace
rep*cultivar*tpop;
run;

accomplish that (Output 7.18).


Notice that all random terms in model [7.68] are listed in the random statement and only
fixed effects appear in the model statement of the procedure. Furthermore, the error term with
%
the most subscripts Ð/3456 Ñ and the constant term a.b do not need to be specified. Altogether
there are eleven effects specified between the model and the random statements (compare to
Table 7.17 and model [7.68]). In split-type designs great care must be exercised to ensure that
the inference is as accurate as possible. For example, in a regular split-plot design no exact
test exists to compare whole-plot treatment means for a given level of the sub-plot treatment,
even if the design is balanced. In split-block (strip-plot) designs where two factors A and B
are applied perpendicular to each other, no exact test exists to compare two A means at the
same level of B and two B means at the same level of A. In these cases we rely on
approximate procedures to calculate test statistics, degrees of freedom, and :-values. We
choose Satterthwaite's method (Satterthwaite 1946) for split-type designs throughout. If data
are balanced and all treatment factors are fixed, the tests for main effects and interactions in
split-plot and split-block models are exact and do not require further approximations. When
data are unbalanced and/or some treatment factors are fixed while others are random, exact
J -tests cannot necessarily be constructed and the Satterthwaite approximation again becomes
important. In proc mixed this approximation is invoked with the ddfm=satterth option of
the model statement. In the soybean trial, six yield observations were missing, reducing the
total number of observations from "'! to "&%. If no observations were missing, the
Satterthwaite approximation would not be necessary to test main effects and interactions.

Output 7.18.
The Mixed Procedure

Model Information
Data Set WORK.SOYBEANYIELD
Dependent Variable YIELD
Covariance Structure Variance Components
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Satterthwaite

Class Level Information


Class Levels Values
REP 4 1 2 3 4
CULTIVAR 4 AG3601 AG3701 AG4601 AG4701
TPOP 5 60 120 180 240 300
ROWSPACE 2 9 18

© 2003 by CRC Press LLC


498 Chapter 7  Linear Mixed Models

Output 7.18 (continued).


Dimensions

Covariance Parameters 5
Columns in X 90
Columns in Z 132
Subjects 1
Max Obs Per Subject 160
Observations Used 154
Observations Not Used 6
Total Observations 160

Covariance Parameter Estimates

Cov Parm Estimate

REP 3.0368
REP*CULTIVAR 0.4524
REP*CULTIVAR*ROWSPACE 1.2442
REP*CULTIVAR*TPOP 2.4215
Residual 3.9276

Fit Statistics

Res Log Likelihood -302.6


Akaike's Information Criterion -307.6
Schwarz's Bayesian Criterion -306.1
-2 Res Log Likelihood 605.3

Type 3 Tests of Fixed Effects

Num Den
Effect DF DF F Value Pr > F

CULTIVAR 3 9.16 8.77 0.0047


ROWSPACE 1 11.3 3.72 0.0795
CULTIVAR*ROWSPACE 3 11.3 6.01 0.0108
TPOP 4 46.9 32.08 <.0001
CULTIVAR*TPOP 12 46.8 1.26 0.2749
TPOP*ROWSPACE 4 45.7 1.06 0.3870
CULTIVAR*TPOP*ROWSPAC 12 45.6 2.60 0.0100

The estimates of the variance components are shown in the Covariance Parameter
Estimates s #< œ $Þ!$(, 5
table as 5 s #" œ !Þ%&#, 5
s ## œ "Þ#%%, 5
s #$ œ #Þ%#", and 5
s #% œ $Þ*#). The
denominator degrees of freedom for the J statistics in the Type 3 Tests of Fixed Effects
table were adjusted by the Satterthwaite procedure due to the missingness of the observations.
For complete data we would have expected denominator degrees of freedom of *, "#, "#, %),
%), %), and %). At the &% significance level the three-way interaction, the Population Density
main effect, the Cultivar ‚ Row Spacing interaction, and the Cultivar main effect are signi-
ficant. Because of the significance of the three-way interaction, the two-way interactions that
appear nonsignificant may be masked, and similarly for the Row Spacing main effect.
The next step in the analysis is to investigate the interaction pattern more closely.
Because of the significance of the three-way interaction, we start there. Figure 7.19 shows the
estimated three-way cell means (least square means) for the Cultivar ‚ Population ‚ Row
Spacing combinations. Since the factor Population Density is quantitative, trends of soybean
yield in density are investigated later via regression contrasts. The Row Spacing effect for a
given population density and cultivar combination and the Cultivar effect for a given density

© 2003 by CRC Press LLC


Applications 499

and row spacing can be assessed by slicing the three-way interaction. To this end add the
statement
lsmeans cultivar*rowspace*tpop / slice=(cultivar*tpop tpop*rowspace);

to the proc mixed code (Output 7.19). The first block of tests in the Tests of Effect
Slices table compares * inches vs. ") inches row spacing for cultivar AG3601 at various
population densities. The second block for cultivar AG3701 and so forth. The last block
compares cultivars for a given combination of population density and row spacing.
100 200 300

AG4601 AG4701

30
Yield (bushels/acre)

20

AG3601 AG3701

30

20

100 200 300


Target Population (in Thousand/acre)

Figure 7.19. Three-way least squares means for factors Cultivar, Row Spacing, and
Population Density. Solid line refers to *-inch spacing, dashed line to ")-inch spacing.

For cultivar AG3601 it is striking that there is no spacing effect at '!,!!! plants per acre
a: œ !Þ))!%b, but there are significant spacing effects for all greater population densities.
This effect is also visible in Figure 7.19. For the other cultivars the row spacing effects are
absent with two exceptions: AG4601 and AG4701 at "#!ß !!! plants per acre
a: œ !Þ!!$! and !Þ!%&*, respectivelyb. The last block of tests reveals that at *-inch spacing
there are significant differences among the cultivars at any population density (e.g.,
: œ !Þ!"*) at '!ß !!! R Î+-</). For the wider row spacing cultivar effects are mostly absent.

Output 7.19.
Tests of Effect Slices

Num Den
Effect CULTIVAR TPOP ROWSPACE DF DF F Value Pr > F

CULTIVA*TPOP*ROWSPAC AG3601 60 1 54.4 0.02 0.8804


CULTIVA*TPOP*ROWSPAC AG3601 120 1 45.7 7.66 0.0081
CULTIVA*TPOP*ROWSPAC AG3601 180 1 45.7 11.59 0.0014
CULTIVA*TPOP*ROWSPAC AG3601 240 1 51.6 13.19 0.0006
CULTIVA*TPOP*ROWSPAC AG3601 300 1 45.7 14.51 0.0004
CULTIVA*TPOP*ROWSPAC AG3701 60 1 45.7 0.23 0.6321
CULTIVA*TPOP*ROWSPAC AG3701 120 1 54.4 0.04 0.8426
CULTIVA*TPOP*ROWSPAC AG3701 180 1 51.6 0.85 0.3622
CULTIVA*TPOP*ROWSPAC AG3701 240 1 45.7 1.12 0.2960
CULTIVA*TPOP*ROWSPAC AG3701 300 1 45.7 0.58 0.4501

© 2003 by CRC Press LLC


500 Chapter 7  Linear Mixed Models

Output 7.19 (continued).


CULTIVA*TPOP*ROWSPAC AG4601 60 1 45.7 0.76 0.3885
CULTIVA*TPOP*ROWSPAC AG4601 120 1 45.7 9.86 0.0030
CULTIVA*TPOP*ROWSPAC AG4601 180 1 45.7 0.30 0.5890
CULTIVA*TPOP*ROWSPAC AG4601 240 1 45.7 2.00 0.1639
CULTIVA*TPOP*ROWSPAC AG4601 300 1 45.7 0.18 0.6766
CULTIVA*TPOP*ROWSPAC AG4701 60 1 45.7 0.05 0.8166
CULTIVA*TPOP*ROWSPAC AG4701 120 1 45.7 4.21 0.0459
CULTIVA*TPOP*ROWSPAC AG4701 180 1 45.7 0.60 0.4410
CULTIVA*TPOP*ROWSPAC AG4701 240 1 45.7 0.04 0.8407
CULTIVA*TPOP*ROWSPAC AG4701 300 1 45.7 1.55 0.2199
CULTIVA*TPOP*ROWSPAC 60 9 3 76.5 3.49 0.0198
CULTIVA*TPOP*ROWSPAC 120 9 3 76.5 5.54 0.0017
CULTIVA*TPOP*ROWSPAC 180 9 3 79.2 7.19 0.0003
CULTIVA*TPOP*ROWSPAC 240 9 3 79.1 9.00 <.0001
CULTIVA*TPOP*ROWSPAC 300 9 3 76.5 4.23 0.0081
CULTIVA*TPOP*ROWSPAC 60 18 3 80.1 3.68 0.0154
CULTIVA*TPOP*ROWSPAC 120 18 3 80.3 1.39 0.2530
CULTIVA*TPOP*ROWSPAC 180 18 3 79.2 2.39 0.0750
CULTIVA*TPOP*ROWSPAC 240 18 3 79.1 0.34 0.7982
CULTIVA*TPOP*ROWSPAC 300 18 3 76.5 0.68 0.5660

To investigate the nature of the yield dependency on Population Density, the information
in the table of slices is very helpful. It suggests that for cultivars AG3701, AG4601, and
AG4701 it is not necessary to distinguish trends among row spacings. Determining the nature
of the trend averaged across row spacings for these cultivars is sufficient. The levels of the
factor Population Density are equally spaced and we use the standard orthogonal polynomial
coefficients to test for linear, quadratic, cubic, and quartic trends. The following twelve
contrast statements are added to the proc mixed code to test trends for AG3701, AG4601,
and AG4701 across row spacings (Output 7.20).
contrast 'AG3701 quartic ' tpop 1 -4 6 -4 1
cultivar*tpop 0 0 0 0 0 1 -4 6 -4 1;

contrast 'AG3701 cubic ' tpop -1 2 0 -2 1


cultivar*tpop 0 0 0 0 0 -1 2 0 -2 1;

contrast 'AG3701 quadratic' tpop 2 -1 -2 -1 2


cultivar*tpop 0 0 0 0 0 2 -1 -2 -1 2;

contrast 'AG3701 linear ' tpop -2 -1 0 1 2


cultivar*tpop 0 0 0 0 0 -2 -1 0 1 2;

contrast 'AG4601 quartic ' tpop 1 -4 6 -4 1


cultivar*tpop 0 0 0 0 0 0 0 0 0 0 1 -4 6 -4 1;

contrast 'AG4601 cubic ' tpop -1 2 0 -2 1


cultivar*tpop 0 0 0 0 0 0 0 0 0 0 -1 2 0 -2 1;

contrast 'AG4601 quadratic' tpop 2 -1 -2 -1 2


cultivar*tpop 0 0 0 0 0 0 0 0 0 0 2 -1 -2 -1 2;

contrast 'AG4601 linear ' tpop -2 -1 0 1 2


cultivar*tpop 0 0 0 0 0 0 0 0 0 0 -2 -1 0 1 2;

contrast 'AG4701 quartic ' tpop 1 -4 6 -4 1


cultivar*tpop 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -4 6 -4 1;

contrast 'AG4701 cubic ' tpop -1 2 0 -2 1


cultivar*tpop 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 2 0 -2 1;

© 2003 by CRC Press LLC


Applications 501

contrast 'AG4701 quadratic' tpop 2 -1 -2 -1 2


cultivar*tpop 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 -1 -2 -1 2;

contrast 'AG4701 linear ' tpop -2 -1 0 1 2


cultivar*tpop 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -2 -1 0 1 2;

Output 7.20.
Contrasts

Num Den
Label DF DF F Value Pr > F

AG3701 quartic 1 49 1.14 0.2899


AG3701 cubic 1 48.9 0.02 0.9004
AG3701 quadratic 1 46.9 0.18 0.6710
AG3701 linear 1 46.6 27.53 <.0001

AG4601 quartic 1 45.8 0.71 0.4050


AG4601 cubic 1 45.8 4.08 0.0493
AG4601 quadratic 1 45.8 0.55 0.4610
AG4601 linear 1 45.8 34.46 <.0001

AG4701 quartic 1 45.8 1.13 0.2933


AG4701 cubic 1 45.8 4.75 0.0346
AG4701 quadratic 1 45.8 4.61 0.0371
AG4701 linear 1 45.8 42.61 <.0001

Yield is a linearly increasing function of population density for AG3701. Cultivar


AG4601 shows a slight cubic effect in addition to a linear term and AG4701 shows poly-
nomial terms up to the third order. To model the yield response a] b as a function of popu-
lation a\ b, one would thus choose the models
AG3701: ] œ "! € "" B € /
AG4601: ] œ "! € "" B € "# B$ € /
AG4701: ] œ "! € "" B € "# B# € "$ B$ € /.

The contrast statements to discern the row-spacing specific trends for cultivar AG3601
are more involved (Output 7.21):
contrast "AG3601 quart., 9inch" tpop 1 -4 6 -4 1
cultivar*tpop 1 -4 6 -4 1
tpop*rowspace 1 0 -4 0 6 0 -4 0 1 0
cultivar*tpop*rowspace 1 0 -4 0 6 0 -4 0 1 0;

contrast "AG3601 cubic , 9inch" tpop -1 2 0 -2 1


cultivar*tpop -1 2 0 -2 1
tpop*rowspace -1 0 2 0 0 0 -2 0 1 0
cultivar*tpop*rowspace -1 0 2 0 0 0 -2 0 1 0;

contrast "AG3601 quadr., 9inch" tpop 2 -1 -2 -1 2


cultivar*tpop 2 -1 -2 -1 2
tpop*rowspace 2 0 -1 0 -2 0 -1 0 2 0
cultivar*tpop*rowspace 2 0 -1 0 -2 0 -1 0 2 0;

contrast "AG3601 linear, 9inch" tpop -2 -1 0 1 2


cultivar*tpop -2 -1 0 1 2

© 2003 by CRC Press LLC


502 Chapter 7  Linear Mixed Models

tpop*rowspace -2 0 -1 0 0 0 1 0 2 0
cultivar*tpop*rowspace -2 0 -1 0 0 0 1 0 2 0;

contrast "AG3601 quart.,18inch" tpop 1 -4 6 -4 1


cultivar*tpop 1 -4 6 -4 1
tpop*rowspace 0 1 0 -4 0 6 0 -4 0 1
cultivar*tpop*rowspace 0 1 0 -4 0 6 0 -4 0 1;

contrast "AG3601 cubic ,18inch" tpop -1 2 0 -2 1


cultivar*tpop -1 2 0 -2 1
tpop*rowspace 0 -1 0 2 0 0 0 -2 0 1
cultivar*tpop*rowspace 0 -1 0 2 0 0 0 -2 0 1;

contrast "AG3601 quadr.,18inch" tpop 2 -1 -2 -1 2


cultivar*tpop 2 -1 -2 -1 2
tpop*rowspace 0 2 0 -1 0 -2 0 -1 0 2
cultivar*tpop*rowspace 0 2 0 -1 0 -2 0 -1 0 2;

contrast "AG3601 linear,18inch" tpop -2 -1 0 1 2


cultivar*tpop -2 -1 0 1 2
tpop*rowspace 0 -2 0 -1 0 0 0 1 0 2
cultivar*tpop*rowspace 0 -2 0 -1 0 0 0 1 0 2;

For this cultivar at *-inch row spacing, yield depends on population density in quadratic
and linear fashion. At ")-inch row spacing, yield is not responsive to changes in the
population density. The slight yield increase at ")-inch spacing (Figure 7.19) is evident in the
marginally significant linear trend a: œ !Þ!&#'b.

Output 7.21.
Contrasts

Num Den
Label DF DF F Value Pr > F

AG3601 quart., 9inch 1 80.3 0.01 0.9037


AG3601 cubic , 9inch 1 80.9 0.42 0.5196
AG3601 quadr., 9inch 1 79.6 5.14 0.0261
AG3601 linear, 9inch 1 79.7 30.40 <.0001

AG3601 quart.,18inch 1 80.8 0.12 0.7350


AG3601 cubic ,18inch 1 81.6 0.97 0.3271
AG3601 quadr.,18inch 1 84.2 0.10 0.7519
AG3601 linear,18inch 1 85.9 3.86 0.0526

This analysis of the three-way interaction leads to the overall conclusion that only for
cultivar AG3601 is row spacing of importance for a given population density. Does this con-
clusion prevail when yields are averaged across different population densities? A look at the
significant Cultivar ‚ Rowspace interaction confirms this. Figure 7.20 shows the correspond-
ing two-way least square means and the :-values from slicing this interaction by cultivar
(bottom margin) and by spacing (right margin). Significant differences exist among varieties
for *-inch spacing a:  !Þ!!!"b but not for ")-inch spacing a: œ !Þ%#!%b. Averaged across
the population densities, only variety AG3601 shows a significant yield difference among the
two row spacings (: œ !Þ!!!)). The :-values in Figure 7.20 were obtained with the statement
(Output 7.22)
lsmeans cultivar*rowspace / slice=(cultivar rowspace);

© 2003 by CRC Press LLC


Applications 503

30

29 9" Row spacing


18" Row spacing

Yield (bushels/acre) 28

27

26

25 p< 0.0001

p=0.4204
24

23
p=0.0008 p=0.8759 p=0.2804 p=0.5323
22
AG3601 AG3601 AG3601 AG3601
Variety

Figure 7.20. Two-way Cultivar ‚ Row Spacing least squares means. :-values from slices of
the two-way interaction are shown in the margins.

Output 7.22. Tests of Effect Slices

Num Den
Effect CULTIVAR TPOP ROWSPACE DF DF F Value Pr > F
CULTIVAR*ROWSPACE AG3601 1 11.8 19.80 0.0008
CULTIVAR*ROWSPACE AG3701 1 11.8 0.03 0.8759
CULTIVAR*ROWSPACE AG4601 1 10.8 1.29 0.2804
CULTIVAR*ROWSPACE AG4701 1 10.8 0.42 0.5323
CULTIVAR*ROWSPACE 9 3 17.3 14.79 <.0001
CULTIVAR*ROWSPACE 18 3 17.7 0.99 0.4204

Finally, we can raise the question, which varieties differ significantly from each other at
*-inch row spacing. The previous slice shows that the question is not of interest at ")-inch
spacing. The statement
lsmeans cultivar*rowspace / diff;

compares all Cultivar ‚ Row Spacing combinations in pairs and produces many comparisons
that are not of interest. The trimmed output that follows shows only those comparisons where
factor spacing was held fixed at * inches. Cultivar AG3601 is significantly higher yielding
(Estimates are positive) than any of the other cultivars.

Output 7.23. Differences of Least Squares Means

CULTIVAR ROWSPACE _CULTIVAR _ROWSPACE Estimate Pr > |t|


AG3601 9 AG3701 9 6.7783 <.0001
AG3601 9 AG4601 9 7.3617 <.0001
AG3601 9 AG4701 9 5.3167 0.0004
AG3701 9 AG4601 9 0.5834 0.6406
AG3701 9 AG4701 9 -1.4616 0.2500
AG4601 9 AG4701 9 -2.0450 0.1117

© 2003 by CRC Press LLC


504 Chapter 7  Linear Mixed Models

7.6.6 Repeated Measures in a Completely Randomized Design


Water was leached through soil columns to examine secondary minerals formed from
weathering, in particular to observe changes in the weathering products of biotite. Eighteen
columns were filled with silt and clay sized biotite. #Þ& grams of surface material were added
to each column. The surface material was collected from the A horizon of spruce and
hardwood forests or consisted of washed quartz sand. The columns were kept at a constant
temperature of %°G and three times per week were treated with either & or #! ml of water,
simulating two different rainfall rates. Three replicate columns were available for each Rain-
fall Rate ‚ Surface Material combination and arranged on a rack in a completely randomized
design. At days &, #", %!, and &( water leaching through the columns was sampled. This is a
repeated measures study with % repeat observations. The basic experimental design is a
completely randomized design with a # ‚ $ factorial treatment structure of factors Rainfall
Rate a& ml, #! mlb and Surface Material (spruce, hardwood, sand). Table 7.18 shows the pH
values of the leachate at the four sampling dates for the eighteen experimental units and
Figure 7Þ21 shows the estimated means over time for the six treatments (obtained from
analysis that follows).

Table 7.18. Repeated measures leachate data for # ‚ $ factorial in a CRD†


Rainfall Rate & ml Rainfall Rate #! ml
Day Rep Sand Spruce Hardwood Sand Spruce Hardwood
& " 'Þ!) &Þ(' 'Þ&" 'Þ%! &Þ'* 'Þ(&
# 'Þ&! &Þ") 'Þ%& 'Þ%* %Þ*( 'Þ**
$ 'Þ&% &Þ&# 'Þ'! 'Þ&( &Þ#* 'Þ*%
#" " &Þ#' &Þ)! 'Þ!! 'Þ!& &Þ*' 'Þ!!
# 'Þ#% &Þ&( 'Þ!) 'Þ"' &Þ&% 'Þ$'
$ 'Þ!# &Þ%! 'Þ!! &Þ*' &Þ%& 'Þ$*
%! " &Þ)' 'Þ"! 'Þ$% 'Þ&" 'Þ$& 'Þ('
# 'Þ%! 'Þ#$ 'Þ"' 'Þ&" 'Þ"( 'Þ''
$ 'Þ!' &Þ"$ 'Þ!# 'Þ%) &Þ*( 'Þ&'
&( " 'Þ!( 'Þ#% 'Þ$) 'Þ'& &Þ*$ 'Þ""
# 'Þ$) &Þ(% 'Þ!$ 'Þ$& &Þ)% 'Þ'!
$ 'Þ&! &Þ$* &Þ** 'Þ)* 'Þ!* 'Þ&%

Data kindly provided by Dr. Lucian W. Zelazny and Mr. Ryan Reed, Department of Crop and Soil
Environmental Sciences, Virginia Poytechnic Institute and State University (see Reed 2000).
Used with permission.

The basic model for the analysis of these data comprises fixed effects for the Surface and
Rainfall Rate effects, temporal effects and all possible two-way interactions and one three-
way interaction. Random effects are associated with the replicates stemming from random
assignment of treatments to the columns and within-column disturbances over time. The
model can be expressed as
] œ . € !3 € "4 € a!" b34 € /345 € 76 € a!7 b36 € a"7 b46 € a!"7 b346 € 03456 [7.69]

where
!3 is the effect of the 3th surface type a3 œ "ß âß $b

© 2003 by CRC Press LLC


Applications 505

"4 is the effect of the 4th rainfall rate a4 œ "ß #b


/345 is the experimental error associated with replicate (column) 5 a5 œ "ß âß $b,
assumed independent Gaussian with mean ! and variance 5/# .
76 is the effect of the 6th time point a6 œ "ß âß %b
03456 is a random disturbance associated with the 6th repeated measurement for the 5 th
replicate.
The remaining terms in model [7.69] denote interactions between the various factors in
obvious fashion.

5 21 40 57

LEACHATE: 1 LEACHATE: 1 LEACHATE: 1


Hardwd Sand Spruce

6.2
Estimated mean pH

5.1
LEACHATE: 2 LEACHATE: 2 LEACHATE: 2
Hardwd Sand Spruce

6.2

5.1

5 21 40 57 5 21 40 57
Day of Measurement

Figure 7.21. Leachate ‚ Surface means by day of measurement.

If this basic model is accepted the first question that needs to be addressed is that of the
possible correlation model to be used for the 03456 from the same column. We fit seven dif-
ferent correlation models and compare their respective AIC and Schwarz criteria (Table
7.19). These criteria point to several correlation models. The ARa"b, exponential, gaussian,
and compound symmetric models have similar fit statistics. The AIC criterion is calculated as
the residual log likelihood minus the number of the covariance parameters (“larger is better”
version). The unstructured models appear to fit the data well as judged by the negative of the
residual log likelihood (which we try to minimize). Their large number of covariance parame-
ters carries a hefty penalty in the determination of AIC, however.
The power model [7.56] is a reparameterization of the exponential model [7.54]. Hence,
their fit statistics are identical. The AR(1) and the exponential model are identical if the data
are equally spaced. For the data considered here, the measurement intervals of "', "*, and "(
days are almost identical which explains the small difference in AIC between the two models.
The ante-dependence model is also penalized substantially because of its many covariance

© 2003 by CRC Press LLC


506 Chapter 7  Linear Mixed Models

parameters. Should we ignore the possibility of serial correlations and continue with a com-
pound symmetry model, akin to a split-plot design where the whole-plot factor is a # ‚ $ fac-
torial? Fortunately, the power and compound symmetry models are nested. A test of
L! :3 œ ! in the power model can be carried out as a likelihood-ratio test. The test statistic is
PV œ #%Þ!  #"Þ' œ #Þ% and the probability that a Chi-squared random variable with one de-
gree of freedom exceeds #Þ% is PrÐ;#"   #Þ%Ñ œ !Þ"#". At the &% level, we cannot reject the
hypothesis, however. To continue with a compound symmetry model implies acceptance of
L! , and in our opinion the :-value is not large enough to rule out the possibility of a Type-II
error. We are not sold on independence of the repeated measurements. For the gaussian
model [7.57] the likelihood ratio statistic would be even greater, but the two models are not
directly nested. The gaussian model approaches compound symmetry as the parameter !
approaches !. At exactly ! the model is no longer defined. It appears from Table 7.19 that the
gaussian model is the model of choice for these data. Because of its high degree of regularity
at short lag distances, we do not recommend it in general for most repeated measures data
(see comments in §7.5.2). Since it outperforms the other parsimonious models we use it here.

Table 7.19. Akaike's information criterion and Schwarz' criterion for various covariance
models (The last column contains the number of covariance parameters)
Covariance Model AIC Schwarz  # Res. Log Likelihood Parameters
Compound Symmetry  "%Þ!  "%Þ* #%Þ! #

Unstructured  ")Þ!  ##Þ* "$Þ* ""
Unstructureda#b  "&Þ*  "*Þ% "&Þ( )
Exponential, [7.54]  "$Þ)  "&Þ" #"Þ' $
Power, [7.56]  "$Þ)  "&Þ" #"Þ' $
Gaussian, [7.57]  "$Þ#  "%Þ& #!Þ$ $
ARa"b, [7.52]  "$Þ*  "&Þ# #"Þ) $
Antedependence, [7.59]  "(Þ&  #"Þ! "*Þ! )

The unstructured model led to a second derivative matrix of the log likelihood function
which was not positive definite. It is not considered further for these data.

The proc mixed code to analyze this repeated measures design follows. The model state-
ment contains the main effects and interactions of the three factors Rainfall Rate, Surface Ma-
terial, and Time. The random statement identifies the experimental errors /345 and the
repeated statement instructs proc mixed to treat the observations from the same replicate as
correlated according to the gaussian covariance model. The r=1 and rcorr=1 options of the
repeated statement request a printout of the Ra!b and the correlation matrix for the first
subject in the data set. The initial sorting of the data set is good practice to ensure that the ob-
servations are ordered by increasing time of measurement within each experimental unit. The
variable t created in the data set is used in the type=sp(gau)(t) option to represent temporal
distance between repeated observations on the same experimental unit. It is measured in num-
ber of weeks since the initial measurement (since time is measured in days).

proc sort data=Leachate; by surface leachate rep time; run;


data Leachate; set Leachate; t = (time - 5)/7; run;
proc mixed data=Leachate noitprint;
class rep surface rainfall time;
model ph = surface rainfall surface*rainfall

© 2003 by CRC Press LLC


Applications 507

time time*surface time*rainfall time*surface*rainfall;


random rep(surface*rainfall);
repeated / subject=rep(surface*rainfall) type=sp(gau)(t) r=1 rcorr=1;
run;

The restricted maximum likelihood algorithm converged to the covariance parameter


s #/ œ !Þ!"**(, 5
estimates 5 s œ #Þ&)(! (Output 7.24).
s #0 œ !Þ!&!#*, and 9

Output 7.24.
The Mixed Procedure

Model Information
Data Set WORK.LEACHATE
Dependent Variable PH
Covariance Structures Variance Components, Spatial Gaussian
Subject Effect REP(SURFACE*RAINFAL)
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Containment

Class Level Information


Class Levels Values
REP 3 1 2 3
SURFACE 3 hardwd sand spruce
RAINFALL 2 1 2
TIME 4 5 21 40 57

Dimensions
Covariance Parameters 3
Columns in X 60
Columns in Z 18
Subjects 1
Max Obs Per Subject 72
Observations Used 72
Observations Not Used 0
Total Observations 72

Estimated R Matrix for Subject 1


Row Col1 Col2 Col3 Col4
1 0.05029 0.02304 0.001200 0.000013
2 0.02304 0.05029 0.01673 0.000967
3 0.001200 0.01673 0.05029 0.02083
4 0.000013 0.000967 0.02083 0.05029

Estimated R Correlation Matrix for Subject 1


Row Col1 Col2 Col3 Col4
1 1.0000 0.4581 0.02386 0.000263
2 0.4581 1.0000 0.3326 0.01922
3 0.02386 0.3326 1.0000 0.4143
4 0.000263 0.01922 0.4143 1.0000

Covariance Parameter Estimates


Cov Parm Subject Estimate
REP(SURFACE*RAINFAL) 0.01997
SP(GAU) REP(SURFACE*RAINFAL) 2.5870
Residual 0.05029
Fit Statistics
Res Log Likelihood -10.2
Akaike's Information Criterion -13.2
Schwarz's Bayesian Criterion -14.5
-2 Res Log Likelihood 20.3

© 2003 by CRC Press LLC


508 Chapter 7  Linear Mixed Models

Output 7.24 (continued).


Type 3 Tests of Fixed Effects

Num Den
Effect DF DF F Value Pr > F

SURFACE 2 12 19.13 0.0002


RAINFALL 1 12 6.09 0.0296
SURFACE*RAINFALL 2 12 0.56 0.5875
TIME 3 36 15.57 <.0001
SURFACE*TIME 6 36 8.84 <.0001
RAINFALL*TIME 3 36 1.79 0.1665
SURFACE*RAINFAL*TIME 6 36 0.48 0.8203

The covariance between the first and second remeasurement disturbances


a> œ a#"  &bÎ( œ #Þ#)& weeksb is estimated as
#Þ#)&#
sc0345" ß 0345# d œ !Þ!&!#* ‚ expœ 
Cov  œ !Þ!#$!&
#Þ&)(#

and the correlation is expe  #Þ#)&# Î#Þ&)(# f œ !Þ%&). These values can be found in the first
row, second column of the Estimated R Matrix for Subject 1 and the Estimated R
Correlation Matrix for Subject 1. The continuity of the gaussian covariance model near
the origin lets correlations decay rather quickly. The correlation between the first and third
measurement a> œ &Þ! weeksb is only !Þ!#$). The estimated practical range of the correlation
process is È$ ‚ 9 s œ %Þ%) weeks. After four and a half weeks leachate pHs from the same
soil column are essentially uncorrelated.
The variance component for the experimental error is rather small. If the fixed effects
part of the model contains numerous effects and the within-cluster correlations are also
modeled, one will frequently find that the algorithm fails to provide estimates for some or all
effects in the random statement. The likelihood solutions for the covariance parameters of
these effects are outside the permissible range. After dropping the random statement (or indi-
vidual effects in the random statement), and retaining the repeated statement the algorithm
then often converges successfully. While this is a reasonable approach, one should be
cautioned that this effectively alters the statistical model being fit. Dropping the random state-
ment here corresponds to the assumption that 5/# ´ !, i.e., there is no experimental error
variability associated with the experimental unit and all stochastic variation arises from the
within cluster process.
For some covariance models, combining a random and repeated statement is impossible,
since the effects are confounded. For example, compound symmetry is implied by two nested
random effects. In the absence of a repeated statement a residual error is always added to the
model and hence the statements
proc mixed data=whatever;
class ...;
model y = ... ;
random rep(surface*rainfall);
run;

© 2003 by CRC Press LLC


Applications 509

and
proc mixed data=whatever;
class ...;
model y = ... ;
repeated / subject=rep(surface*rainfall) type=cs;
run;

will fit the same model. Combining the random and repeated statement with type=cs will
lead to aliasing of one of the variance components.
Returning to the example at hand, we glean from the Table of Type 3 Tests of Fixed
Effects in Output 7.24 that the Surface ‚ Time interaction, the Time main effect, the
Rainfall main effect, and the Surface main effect are significant at the &% level.

Hardwood

6.6
Sand
Estimated Mean pH

6.1

5.6
Spruce

5.1
5 21 40 57
Day of Measurement

Figure 7.22. Surface least squares means by day of measurement. Estimates of the marginal
surface means are shown along the vertical axis with the same symbols as the trends over
time.

The next step is to investigate the significant effects, starting with the two-way interac-
tion. The presence of the interaction is not surprising considering the graph of the two-way
least square means (Figure 7.22). After studying the interaction pattern we can also conduct
comparisons of the marginal surface means since the trends over time do not criss-cross
wildly. The proc mixed code that follows requests least squares means and their differences
for the two rainfall levels (there will be only one difference), slices of the surface*time
interaction by either factor and differences of the Surface ‚ Time least squares means that are
shown in Figure 7.22.
ods listing close;
proc mixed data=Leachate;
class rep surface rainfall time;
model ph = surface rainfall surface*rainfall
time time*surface time*rainfall time*surface*rainfall;
random rep(surface*rainfall);
repeated / subject=rep(surface*rainfall) type=sp(gau)(t) r=1 rcorr=1;
lsmeans rainfall / diff;
lsmeans surface*time /slice=(time surface) diff;

© 2003 by CRC Press LLC


510 Chapter 7  Linear Mixed Models

ods output diffs=diffs;


ods output slices=slices;
run;
ods listing;
title "Tests of Effect Slices";
proc print data=slices; run;

/* process least square means differences */


data diffs; set diffs;
if upcase(effect)="SURFACE*TIME" then do;
if (surface ne _surface) and (time ne _time) then delete;
end;
LSD05 = tinv(0.975,df)*stderr;
if probt > 0.05 then sig=' ';
else if probt > 0.025 then sig = '*';
else if probt > 0.01 then sig = '**'; else sig = '***';
run;
proc sort data=diffs; by descending rainfall time _time surface _surface; run;
title "Least Square Means Differences";
proc print data=diffs noobs;
var surface rainfall time _surface _rainfall _time
estimate LSD05 probt sig;
run;

The ods listing close statement prior to the proc mixed call suppresses printing of
procedural output to the screen. Instead, the output of interest (slices and least squares mean
differences) is saved to data sets (diffs and slices) with the ods output statement. We do
this for two reasons: (1) parts of the proc mixed output such as the Dimensions table, the
Type 3 Tests of Fixed Effects have already been studied above. (2) the set of least
squares mean differences for interactions contains many comparisons not of interest. For
example, a comparison of spruce surface at day & vs. sand surface at day %! is hardly mean-
ingful (it is also not a simple effect!). The data step following the proc mixed code processes
the data set containing the least squares mean differences. It deletes two-way mean com-
parisons that do not correspond to simple effects, calculates the least significant differences
(LSD) at the &% level for all comparisons, and indicates the significance of the comparisons
at the &% a‡b, #Þ&% a‡‡b, and "% level a‡‡‡b.
The slices show that at any time point there are significant differences in pH among the
three surfaces (Output 7.25). From Figure 7.22 we surmise that these differences are probably
not between the sand and hardwood surface, but are between these surfaces and the spruce
surface. The least squares mean differences confirm that. The first observation among the
least squares mean differences compares the marginal Rainfall means, the next three compare
the marginal Surface means that are graphed along the vertical axis of Figure 7.22. Compari-
sons of the Surface ‚ Time means start with the fifth observation. For example, the difference
between estimated means of hardwood and sand surfaces at day & is !Þ#('( with an LSD of
!Þ$"!$. At any given time point hardwood and sand surfaces are not statistically different at
the &% level, whereas the spruce surface leads to significantly higher acidity of the leachate.
Notice that at any point in time the least significant difference (LSD) to compare surfaces is
!Þ$"!$. The least significant differences to compare time points for a given surface depend on
the temporal separation, however. If the repeated measurements were uncorrelated, there
would be only one LSD to compare surfaces at a given time point and one LSD to compare
time points for a given surface.

© 2003 by CRC Press LLC


Applications 511

Output 7.25.
Tests of Effect Slices
Num Den
Obs Effect SURFACE TIME DF DF FValue ProbF

1 SURFACE*TIME 5 2 36 40.38 <.0001


2 SURFACE*TIME 21 2 36 5.87 0.0062
3 SURFACE*TIME 40 2 36 4.21 0.0228
4 SURFACE*TIME 57 2 36 8.03 0.0013
5 SURFACE*TIME hardwd _ 3 36 15.13 <.0001
6 SURFACE*TIME sand _ 3 36 10.92 <.0001
7 SURFACE*TIME spruce _ 3 36 7.19 0.0007

Least Square Means Differences


_
R _ R E
S A S A s
U I U I t
R N R N _ i L P
F F T F F T m S r
A A I A A I a D o s
C L M C L M t 0 b i
E L E E L E e 5 t g

1 _ 2 _ -0.2339 0.20643 0.0296 *


hardwd _ _ sand _ _ 0.09542 0.25282 0.4269
hardwd _ _ spruce _ _ 0.6637 0.25282 <.0001 ***
sand _ _ spruce _ _ 0.5683 0.25282 0.0004 ***
hardwd _ 5 sand _ 5 0.2767 0.31038 0.0790
hardwd _ 5 spruce _ 5 1.3050 0.31038 <.0001 ***
sand _ 5 spruce _ 5 1.0283 0.31038 <.0001 ***
hardwd _ 5 hardwd _ 21 0.5683 0.19330 <.0001 ***
sand _ 5 sand _ 21 0.4817 0.19330 <.0001 ***
spruce _ 5 spruce _ 21 -0.2183 0.19330 0.0279 *
hardwd _ 5 hardwd _ 40 0.2900 0.25944 0.0295 *
sand _ 5 sand _ 40 0.1267 0.25944 0.3287
spruce _ 5 spruce _ 40 -0.5867 0.25944 <.0001 ***
hardwd _ 5 hardwd _ 57 0.4317 0.26256 0.0020 ***
sand _ 5 sand _ 57 -0.04333 0.26256 0.7398
spruce _ 5 spruce _ 57 -0.4700 0.26256 0.0009 ***

hardwd _ 21 sand _ 21 0.1900 0.31038 0.2225


hardwd _ 21 spruce _ 21 0.5183 0.31038 0.0017 ***
sand _ 21 spruce _ 21 0.3283 0.31038 0.0387 *
hardwd _ 21 hardwd _ 40 -0.2783 0.21452 0.0124 **
sand _ 21 sand _ 40 -0.3550 0.21452 0.0019 ***
spruce _ 21 spruce _ 40 -0.3683 0.21452 0.0013 ***
hardwd _ 21 hardwd _ 57 -0.1367 0.26005 0.2936
sand _ 21 sand _ 57 -0.5250 0.26005 0.0002 ***
spruce _ 21 spruce _ 57 -0.2517 0.26005 0.0574

hardwd _ 40 sand _ 40 0.1133 0.31038 0.4638


hardwd _ 40 spruce _ 40 0.4283 0.31038 0.0082 ***
sand _ 40 spruce _ 40 0.3150 0.31038 0.0469 *
hardwd _ 40 hardwd _ 57 0.1417 0.20097 0.1614
sand _ 40 sand _ 57 -0.1700 0.20097 0.0948
spruce _ 40 spruce _ 57 0.1167 0.20097 0.2468

hardwd _ 57 sand _ 57 -0.1983 0.31038 0.2032


hardwd _ 57 spruce _ 57 0.4033 0.31038 0.0123 **
sand _ 57 spruce _ 57 0.6017 0.31038 0.0004 ***

© 2003 by CRC Press LLC


512 Chapter 7  Linear Mixed Models

7.6.7 A Longitudinal Study of Water Usage in Horticultural


Trees
In this application we fit mixed polynomial models to describe the change over time in water
usage during production of landscape trees. Four groups of trees — consisting of two age
classes and two species — were of interest to the investigators. Ten trees were randomly
selected for each species ‚ age class combination. We label the groups simply as Age 1, 2
and Species 1, 2. Over a period of approximately %Þ& months the water usage of the trees was
assessed regularly, creating a longitudinal time series for each tree. Although it was originally
intended to measure water usage on every fifth day, slight variations in the measurement in-
tervals created an unequally spaced longitudinal sequence. For the four groups of trees the
water usage averaged across the trees shows distinct trends (Figure 7.23). For species " water
usage throughout the growing season is quadratic or possibly cubic with a maximum at
approximately day #&!. There seems to be little difference between older (Age class 2) and
younger trees (Age class 1) for species 1. For the second species the trends appear quadratic
but with very different maxima. Not only do young trees vary little in their water usage over
time, there is a sharp drop-off for older trees of species # at about day #&! (Figure 7.23).
150 200 250 300

AGE: 1 AGE: 1
SPECIES: 1 SPECIES: 2
4

2
Mean Water Usage

0
AGE: 2 AGE: 2
SPECIES: 1 SPECIES: 2
4

0
150 200 250 300
Day of Measurement

Figure 7.23. Age and species-specific averages for water usage data. The dashed line
represents an exploratory loess fit to suggest a parametric polynomial model. Data made
kindly available by Dr. Roger Harris and Dr. Robert Witmer, Department of Horticulture,
Virginia Polytechnic Institute and State University. Used with permission.

A model for the age and species-specific trends over time based on Figure 7.23 could be
the following. Let ]345 denote the water usage at time >5 of the average tree in age group 3
a3 œ "ß #b for species 4 a4 œ "ß #b. Define +3 to be a binary regressor (dummy) variable taking
on value " if an observation is from age group " and ! otherwise. Similarly define =4 as a
dummy variable taking on value " for species " and value ! for species #. Then,

© 2003 by CRC Press LLC


Applications 513

Ec]345 d œ "! € "" +3 € "# =4 € "$ +3 =4 € "% >5 € "& >#5 € "' +3 >5 € "( +3 >5#
€ ") =4 >5 € "* =4 >#5 € ""! +3 =4 >5 € """ +3 =4 >#5 . [7.70]

This model appears highly parameterized, but is really not. It allows for separate linear and
quadratic trends among the four groups. The intercepts, linear, and quadratic slopes are
constructed from "! through """ according to Table 7.20.

Table 7.20. Intercepts and Gradients for Age and Species Groups in Model [7.70]
Group +3 =4 Intercept Linear Gradient Quadratic Gradient
Age 1, Species " " " "! € "" € "# € "$ "% € "' € ") € ""! "& € "( € "* € """
Age 1, Species # " ! "! € "" "% € "' "& € "(
Age #, Species " ! " "! € "# "% € ") "& € "*
Age #, Species # ! ! "! "% "&

Model [7.70] is our notion of a population-average model that permits inference about
the effects of age, species, and their interaction. It does not accommodate the sequence of
measurements for individual trees, however. First, we notice that varying the parameters of
model [7.70] on a tree-by-tree basis, the total number of parameters in the mean function
alone would be "! ‚ "# œ "#!, an unreasonable number. Second, not all of the trees will de-
viate significantly from the population average, finding out which ones are different is a time-
consuming exercise in a model with that many parameters. The hierarchical, two-stage mixed
model idea comes to our rescue. Rather than modeling tree-to-tree variability within each
group through a large number of fixed effects, we allow one or more of the parameters in
model [7.70] to vary at random among trees. This approach is supported by the random selec-
tion of trees within each age class and for each species. The BLUPs of these random effects
then can be used to (i) assess whether a tree differs in its water usage significantly from the
group average and (ii) to produce tree-specific predictions of water usage. To decide which of
the polynomial parameters to vary at random, we first focus on a single group, trees of
species # at age #. The observed water usage over time for the "! trees in this group is shown
in Figure 7.24. If we adopt a quadratic response model for the average tree in this group we
can posit random coefficients for the intercept, the linear, and the quadratic gradient. It is un-
likely that the model will support all three parameters being random.
Either a random intercept, random linear slope, or both seem possible. Nevertheless, we
commence modeling with the largest possible model
]75 œ ˆ"! € ,7
! ‰
€ ˆ"" € ,7
" ‰
>5 € ˆ"# € ,7
# ‰ #
>5 € /75 , [7.71]
! #
where 7 denotes the tree a7 œ "ß #ß âß "!b, 5 the time point and ,7 through ,7 are random
effects/coefficients assumed independent of the model disturbances /75 and independent of
each other. The variances of these random effects are denoted 5!# through 5## , respectively.
The SAS® proc mixed code to fit model [7.71] to the data from age group #, species # is as
follows (Output 7.26).

data age2sp2; set wateruse(where=((age=2) and (species=2))); t=time/100; run;


proc mixed data=age2sp2 covtest noitprint;
class treecnt;
model wu = t t*t / s;

© 2003 by CRC Press LLC


514 Chapter 7  Linear Mixed Models

random intercept t t*t / subject=treecnt;


run;

160 220 280


1 2 3
5

4 5 6
5
Observed Water Usage

7 8 9
5

10
5
Age 2, Species 2
2

160 220 280


Day of Measurement

Figure 7.24. Longitudinal observations for ten trees of species 2 in age group 2. The
trajectories suggest quadratic effects with randomly varying slope or intercept.

Output 7.26.

The Mixed Procedure

Model Information

Data Set WORK.AGE2SP2


Dependent Variable wu
Covariance Structure Variance Components
Subject Effect treecnt
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Containment

Class Level Information

Class Levels Values


treecnt 10 31 32 33 34 35 36 37 38 39 40

Dimensions

Covariance Parameters 4
Columns in X 3
Columns in Z Per Subject 3
Subjects 10
Max Obs Per Subject 25
Observations Used 239
Observations Not Used 0
Total Observations 239

© 2003 by CRC Press LLC


Applications 515

Output 7.26 (continued).


Covariance Parameter Estimates

Standard Z
Cov Parm Subject Estimate Error Value Pr Z
Intercept treecnt 0.2691 0.1298 2.07 0.0190
t treecnt 0 . . .
t*t treecnt 0 . . .
Residual 0.1472 0.01382 10.65 <.0001

Fit Statistics
-2 Res Log Likelihood 262.5
AIC (smaller is better) 266.5
AICC (smaller is better) 266.6
BIC (smaller is better) 267.1

The time variable (measured in days since Jan 01) has been rescaled to allow estimates of
the fixed effects and BLUPs to be shown on the output with sufficient decimal places. The
estimates for the variance components 5"# and 5## are zero (Output 7.26). This is an indication
that not all three random effects can be supported by these data, as was already anticipated.
The -2 Res Log Likelihood of #'#Þ& is the same value one would achieve in a model where
only the intercept varies at random. Fitting models with either 5## œ ! or 5"# œ ! also leads to
zero estimates for the linear or quadratic variance component (apart from the intercept). We
interpret this as evidence that the data support only one of the parameters being random, not
that the intercept being random necessarily provides the best fit. Next all models with a single
random effect are investigated, as well as the purely fixed effects model:
]75 œ ˆ"! € ,7
! ‰
€ "" >5 € "# >#5 € /75
]75 œ "! € ˆ"" € ,7
" ‰
>5 € "# >#5 € /75
]75 œ "! € "" >5 € ˆ"# € ,7
# ‰ #
>5 € /75
]75 œ "! € "" >5 € "# >#5 € /75 .

The -2 Res Log Likelihoods of these models are, respectively, #'#Þ&, #)!Þ&, $"%Þ', and
%'"Þ). The last model is the fixed effects model without any random effects and likelihood
ratio tests can be constructed to test the significance of any of the random components.
L! À 5!# œ ! A œ %'"Þ)  #'#Þ& œ "**Þ$ :  !Þ!!!"
L! À 5"# œ ! A œ %'"Þ)  #)!Þ& œ ")"Þ$ :  !Þ!!!"
L! À 5## œ ! A œ %'"Þ)  $"%Þ' œ "%(Þ# :  !Þ!!!".

Incorporating any of the random effects provides a significant improvement in fit over a
purely fixed effects model. The largest improvement (smallest -2 Res Log Likelihood) is
obtained with a randomly varying intercept.
So far, it has been assumed that the model disturbances /75 are uncorrelated. Since the
data are longitudinal in nature, it is conceivable that residual serial correlation remains even
after inclusion of one or more random effects. To check this possibility, we fit a model with
an exponential correlation structure (since the measurement times are not quite equally
spaced) via

© 2003 by CRC Press LLC


516 Chapter 7  Linear Mixed Models

proc mixed data=age2sp2 noitprint covtest;


class treecnt;
model wu = t t*t;
random intercept /subject=treecnt s;
repeated /subject=treecnt type=sp(exp)(time);
run;

Another drop in minus twice the residual log likelihood from #'#Þ& to #&"Þ$ is confirmed
with Output 7.27. The difference of A œ #'#Þ&  #&"Þ$ œ ""Þ# is significant with :-value of
PrÐ;#"   ""Þ#Ñ œ !Þ!!!). Adding an autoregressive correlation structure for the /75 signifi-
cantly improved the model fit. It should be noted that the :-value from the likelihood ratio
test differs from the :-value of !Þ!!'( reported by proc mixed in the Covariance Parameter
Estimates table. The :-value reported there is for a Wald-type test statistic obtained from
comparing a D test statistic against an asymptotic standard Gaussian distribution. The test
statistic is simply the estimate of the covariance parameter divided by its standard error. Esti-
mates of variances and covariances are usually far from Gaussian-distributed and the Wald-
type tests for covariance parameters produced by the covtest option of the proc mixed state-
ment are not very reliable. We prefer the likelihood ratio test whenever possible.

Output 7.27.
The Mixed Procedure

Model Information
Data Set WORK.AGE2SP2
Dependent Variable wu
Covariance Structures Variance Components,
Spatial Exponential
Subject Effects treecnt, treecnt
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Containment

Class Level Information


Class Levels Values
treecnt 10 1 2 3 4 5 6 7 8 9 10

Dimensions
Covariance Parameters 3
Columns in X 3
Columns in Z Per Subject 1
Subjects 10
Max Obs Per Subject 25
Observations Used 239
Observations Not Used 0
Total Observations 239

Covariance Parameter Estimates


Standard Z
Cov Parm Subject Estimate Error Value Pr Z
Intercept treecnt 0.2656 0.1301 2.04 0.0206
SP(EXP) treecnt 3.7945 0.8175 4.64 <.0001
Residual 0.1541 0.01624 9.49 <.0001

Fit Statistics
-2 Res Log Likelihood 251.3
AIC (smaller is better) 257.3
AICC (smaller is better) 257.4
BIC (smaller is better) 258.2

© 2003 by CRC Press LLC


Applications 517

Output 7.27 (continued).


Solution for Fixed Effects
Standard
Effect Estimate Error DF t Value Pr > |t|
Intercept -11.2233 1.0989 9 -10.21 <.0001
t 12.7121 0.9794 227 12.98 <.0001
t*t -2.9137 0.2149 227 -13.56 <.0001

Solution for Random Effects


Std Err
Effect treecnt Estimate Pred DF t Value Pr > |t|
Intercept 1 0.1930 0.1877 227 1.03 0.3049
Intercept 2 0.3425 0.1877 227 1.82 0.0694
Intercept 3 -0.1988 0.1881 227 -1.06 0.2917
Intercept 4 0.4539 0.1918 227 2.37 0.0188
Intercept 5 -0.6414 0.1877 227 -3.42 0.0007
Intercept 6 0.3769 0.1877 227 2.01 0.0458
Intercept 7 0.8410 0.1877 227 4.48 <.0001
Intercept 8 -0.5528 0.1877 227 -2.95 0.0036
Intercept 9 -0.4453 0.1877 227 -2.37 0.0185
Intercept 10 -0.3689 0.1918 227 -1.92 0.0556

The /s options on the model and random statements of proc mixed yield printouts of the
fixed effects estimates and the BLUPs. We obtain " s ! œ  ""Þ##, "
s " œ "#Þ(", and
s
" # œ  #Þ*"% for the fixed effects estimates. Water usage of the average tree in this group is
thus predicted as sC œ  ""Þ## € "#Þ(">  #Þ*"%># . The best linear unbiased predictors of
b œ Ò,"! ß ,#! ß âß ,7
!
Ó are displayed as Solutions for Random Effects. For example the tree-
specific prediction for tree #" is
sC œ  ""Þ## € !Þ"*$! € "#Þ(">  #Þ*"%># .
The BLUPs are significantly different from zero (at the &% level) for trees %, &, ', (, ), and *.

150 200 250 300

1 2 3
5

2
Observed and Predicted Water Usage

4 5 6
5

7 8 9
5

10
5
Age 2, Species 2
2

150 200 250 300


Day of Measurement

Figure 7.25. Predictions from random intercept model with continuous ARa"b errors for
species 2 at age 2. Population average fit shown as a dashed lines, cluster-specific predictions
shown as solid lines.

© 2003 by CRC Press LLC


518 Chapter 7  Linear Mixed Models

The tree-specific trends show a noticeable deviation from the population-averaged pre-
diction for trees that have a significant BLUP (Figure 7.25). For any of these trees it is
obvious that the tree-specific prediction provides a much better fit to the data than the popu-
lation average.
The previous proc mixed runs and inferences were for one of the four groups only. Next
we need to combine the data across the age and species groups which entails adding a mixed
effects structure to model [7.70]. Based on what was learned from investigating the Age 2,
Species 2 group, it is tempting to add a random intercept and an autoregressive structure for
the within-tree disturbances and leave it at that. The mixed model analyst will quickly notice
when dealing with combined data sets that random effects that may not be estimable for a
subset of the data can successfully be estimated based on a larger set of data. In this appli-
cation it turned out that upon combining the data from the four groups not only a random
intercept, but also a random linear slope could be estimated. The statements
proc mixed data=wateruse noitprint;
class age species treecnt;
model wu = age*species age*species*t age*species*t*t / s;
estimate 'Intercpt Age1-Age2 = Age Main' age*species 1 1 -1 -1;
estimate 'Intercpt Sp1 -Sp2 = Sp. Main' age*species 1 -1 1 -1;
estimate 'Intercpt Age*Sp = Age*Sp. ' age*species 1 -1 -1 1;
random intercept t /subject=age*species*treecnt s;
repeated /subject=age*species*treecnt type=sp(exp)(time);
run;

fit the model with random intercept and random (linear) slope and an autoregressive correla-
tion structure for the within-tree errors (Output 7.28). The subject= options of the random
and repeated statements identify the units that are to be considered uncorrelated in the
analysis. Observations with different values of the variables age, species, and treecnt are
considered to be from different clusters and hence uncorrelated. Any set of observations with
the same values of these variables is considered correlated. The use of the age and species
variables in the class statement allows a more concise expression of the fixed effects part of
model [7.70]. The term age*species fits the four intercepts, the term age*species*t the four
linear slopes and so forth. The first two estimate statements compare the intercept estimates
between ages " and #, and species " and #. These are inquiries into the Age or Species main
effect. The third estimate statement tests for the Age ‚ Species interaction averaged across
time.
This model achieves a -2 Res Log Likelihood of **&Þ). Removing the repeated state-
ment and treating the repeated observations on the same tree as uncorrelated, twice the
negative of the residual log likelihood becomes "!(%Þ#. The likelihood ratio statistic
"!(%Þ#  **&Þ) œ ()Þ% indicates a highly significant temporal correlation among the repeated
measurements. The estimate of the correlation parameter is 9 s œ %Þ(#"(. Since the measure-
ment times are coded in days, this estimate implies that water usage exhibits temporal correla-
tions over $9 s œ "%Þ"' days. Although there are *&$ observations in the data set, notice that
the test statistics for the fixed effects estimates are associated with $' degrees of freedom, the
number of clusters (subjects) minus the number of estimated covariance parameters.
The tests for main effects and interactions at the end of the output suggest no differences
in the intercepts between ages " and #, but differences in intercepts between the species.
Because of the significant interaction between Age Class and Species we decide to retain all

© 2003 by CRC Press LLC


Applications 519

four fixed effect intercepts in model [7.70]. Similar tests can be performed to determine
whether the linear and quadratic gradients differ among the species and ages.

Output 7.28. (abridged)


The Mixed Procedure

Model Information

Data Set WORK.WATERUSE


Dependent Variable wu
Covariance Structures Variance Components,
Spatial Exponential
Subject Effects age*species*treecnt,
age*species*treecnt
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Containment

Class Level Information

Class Levels Values

age 2 1 2
species 2 1 2
treecnt 40 1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33
34 35 36 37 38 39 40

Covariance Parameter Estimates

Cov Parm Subject Estimate


Intercept age*species*treecnt 0.1549
t age*species*treecnt 0.02785
SP(EXP) age*species*treecnt 4.7217
Residual 0.1600

Fit Statistics

-2 Res Log Likelihood 995.8


AIC (smaller is better) 1003.8
AICC (smaller is better) 1003.8
BIC (smaller is better) 1010.5

Solution for Fixed Effects

Standard
Effect age species Estimate Error DF t Value Pr > |t|

age*species 1 1 -14.5695 1.1780 36 -12.37 <.0001


age*species 1 2 -6.7496 1.1779 36 -5.73 <.0001
age*species 2 1 -12.3060 1.1771 36 -10.45 <.0001
age*species 2 2 -11.3629 1.1771 36 -9.65 <.0001

t*age*species 1 1 14.4083 1.0572 36 13.63 <.0001


t*age*species 1 2 7.9204 1.0571 36 7.49 <.0001
t*age*species 2 1 13.5380 1.0563 36 12.82 <.0001
t*age*species 2 2 12.8370 1.0563 36 12.15 <.0001

t*t*age*species 1 1 -2.9805 0.2316 36 -12.87 <.0001


t*t*age*species 1 2 -1.8449 0.2316 36 -7.97 <.0001
t*t*age*species 2 1 -2.8567 0.2314 36 -12.34 <.0001
t*t*age*species 2 2 -2.9403 0.2314 36 -12.71 <.0001

© 2003 by CRC Press LLC


520 Chapter 7  Linear Mixed Models

Output 7.28 (continued).


Solution for Random Effects

Std Err
Effect age species treecnt Estimate Pred DF t Value Pr > |t|
Intercept 1 1 1 -0.04041 0.2731 869 -0.15 0.8824
t 1 1 1 0.04677 0.1180 869 0.40 0.6921
Intercept 1 1 2 -0.03392 0.2738 869 -0.12 0.9014
t 1 1 2 0.07863 0.1181 869 0.67 0.5057
Intercept 1 1 3 0.08246 0.2731 869 0.30 0.7628
t 1 1 3 -0.05849 0.1180 869 -0.50 0.6204
Intercept 1 1 4 0.07395 0.2731 869 0.27 0.7866
t 1 1 4 -0.01493 0.1180 869 -0.13 0.8994
Intercept 1 1 5 -0.1568 0.2731 869 -0.57 0.5660
t 1 1 5 -0.07856 0.1180 869 -0.67 0.5059

(and so forth for all trees and Age ‚ Species combinations)

Estimates

Standard
Label Estimate Error DF t Value Pr > |t|
Intercpt Age1-Age2 = Age Main 2.3498 2.3551 36 1.00 0.3251
Intercpt Sp1 -Sp2 = Sp. Main -8.7630 2.3551 36 -3.72 0.0007
Intercpt Age*Sp = Age*Sp. -6.8769 2.3551 36 -2.92 0.0060

7.6.8 Cumulative Growth of Muskmelons in Subsampling


Design
To study the cumulative yield of muskmelons under various irrigation and mulch application
strategies an experiment was conducted between 1997 and 1999 at the Tidewater Agricultural
Research and Extension Center, Virginia Polytechnic Institute and State University. On plots
of size ' feet by #& feet, melons were grown under the following four treatments.
• Nonirrigated without mulch or row cover;
• Drip/trickle irrigation without mulch or row cover;
• Drip/trickle irrigation with black plastic mulch;
• Drip/trickle irrigation with red plastic mulch.

There were four replicate plots for each treatment. In 1997, mature melons were harvested on
each plot at days )", )%, )), and *". The average yield per plot was converted into yield per
hectare and added to the previous yield. Figure 7.26 shows the mean cumulative yields in
tons ‚ ha" for the four treatments in 1997. The cumulative yield increases over time since it
is obtained from adding positive yield figures. In the absence of mulching the cumulative
yields are considerably lower compared to the red or black mulch applications. There seems
to be little difference in yields between the two plastic mulch types.
These data display several interesting features. The experimental units are the plots to
which a particular treatment was applied. The mature melons harvested at a particular day
represent observational units. The number of matured melons varies from plot to plot and
hence the number of subsamples from which the yield per hectare is calculated differs. If the

© 2003 by CRC Press LLC


Applications 521

variability of melon weights is homogeneous across all plots, the variability of observed
yields per hectare increases with the number of melons harvested. Also, the cumulative yields
which are the focus of the analysis (Figure 7.26) are not independent, even if the muskmelons
matured independently on a given plot. The cumulative yield at day )) is the sum of the
cumulative yield at day )% and the yield observed at day )). To build a statistical model for
these data the two issues of subsampling and correlated responses must be kept in mind.

50

Irrig. Red Mulch


Cumulative Yield (t ha-1)

40
Irrig. Black Mulch

30

Irrigated
20

10
Not irrigated

80 82 84 86 88 90
Days After Planting

Figure 7.26. Average cumulative yields vs. days after planting in muskmelon study for 1997.
The data for this experiment was kindly provided by Dr. Norris L. Powell, Tidewater
Agricultural Research and Extension Center, Virginia Polytechnic Institute and State
University. Used with permission.

The investigators were particularly interested in the following questions and hypotheses:
ó In the absence of mulch, is there a benefit of irrigation?
ô Is there a difference in cumulative yields between red and black plastic mulch?
õ Is there a benefit of mulching beyond irrigation?
ö What is the mulch effect?
÷ Can mulching shorten the growing period, i.e., is the yield of mulched plots at the
beginning of the observation period comparable to the yield of unmulched plots at the
end of the observation period?

These questions have a temporal component; in the case of ó, for example, there may be
a beneficial effect late in the season rather than early in the season. In building a statistical
model for this experiment we commence by formulating a model for the yield per hectare on
a given plot at a given point in time a]345 b:
]345 œ .345 € /34 € .345 . [7.72]

In [7.72] .345 denotes the mean of the 3th treatment a3 œ "ß âß %b on replicate (plot) 4 at the
5 th harvesting day after planting a5 œ "ß âß %b. The /34 are experimental errors associated

© 2003 by CRC Press LLC


522 Chapter 7  Linear Mixed Models

with replicate (plot) 4 of treatment 3 and .345 is the subsampling error from harvesting 8345
#
melons on replicate 4 of treatment 3 at time 5 . If 57 is the variability of melon weights
(converted to hectare basis) and Varc/34 d œ 5 # is the experimental error variability, then
Varc]345 d œ 5 # € 8345 57
#
.

Model [7.72] is a relatively straightforward mixed model, but the ]345 are not the outcome of
interest. Rather, we wish to analyze the cumulative yields
Y34" œ ]34"
Y34# œ ]34" € ]34#
ã
5
Y345 œ "]34: .
:œ"

That the Y345 are correlated even if the ]345 are independent is easy to establish. For example,
CovcY34" ß Y34# d œ Covc]34" ß ]34" € ]34# d œ Varc]34" d. The accumulation of yields results in an
accumulation of the experimental errors and the subsampling errors. The resulting correlation
structure is rather complicated and difficult to code in a statistical computing package. As an
alternative, we choose the following route: an unstructured variance-covariance matrix for the
accumulated .345 is combined with a random effect for the experimental errors. The mean
model is a cubic response model with different trends for each treatment. This mean model
will produce the same estimates of treatment means at days )", )%, )), and *" as a block
design analysis but also allows for estimating treatment differences at other time points. The
comparisons ó  ÷ are coded with the estimate statement of proc mixed. The variable t in
the code that follows is )! days less than time after planting. The first harvesting time thus
coincides with > œ ". Divisors in estimate statements are used here to ensure that the resulting
estimates are properly scaled. For example, a statement such as
estimate 'A1+A2 vs. B1+B2' tx 1 1 -1 -1;

compares the sum of treatment means for A" and A# to the sum of B" and B# . Using a divisor
has no effect on the significance level of the comparison, and the statement
estimate 'Average(A1 A2) vs. Average(B1 B2)' tx 1 1 -1 -1 / divisor=2;

will produce the same :-value. The estimate itself will be the difference of sums in the first
case and the difference of averages in the second case.
proc mixed data=melons97 convh=0.002 noitprint;
class rep tx;
model cumyld = tx tx*t tx*t*t tx*t*t*t / noint;
random rep*tx; /* experimental error */
repeated /subject=rep*tx type=un;
/* ó at days 81, 84, 88, 91 corresponding to t=1, 4, 8, 11 */
estimate 'Bare vs. Irrig (81)' tx 1 -1 0 0 tx*t 1 -1 0 0
tx*t*t 1 -1 0 0 tx*t*t*t 1 -1 0 0;
estimate 'Bare vs. Irrig (84)' tx 1 -1 0 0 tx*t 4 -4 0 0
tx*t*t 16 -16 0 0 tx*t*t*t 64 -64 0 0;
estimate 'Bare vs. Irrig (88)' tx 1 -1 0 0 tx*t 8 -8 0 0
tx*t*t 64 -64 0 0 tx*t*t*t 512 -512 0 0;
estimate 'Bare vs. Irrig (91)' tx 1 -1 0 0 tx*t 11 -11 0 0
tx*t*t 121 -121 0 0 tx*t*t*t 1331 -1331 0 0;

© 2003 by CRC Press LLC


Applications 523

/* ô at days 81, 84, 88, 91 corresponding to t=1, 4, 8, 11 */


estimate 'IrrBl vs. IrrRed (81)' tx 0 0 1 -1 tx*t 0 0 1 -1
tx*t*t 0 0 1 -1 tx*t*t*t 0 0 1 -1 ;
estimate 'IrrBl vs. IrrRed (84)' tx 0 0 1 -1 tx*t 0 0 4 -4
tx*t*t 0 0 16 -16 tx*t*t*t 0 0 64 -64;
estimate 'IrrBl vs. IrrRed (88)' tx 0 0 1 -1 tx*t 0 0 8 -8
tx*t*t 0 0 64 -64 tx*t*t*t 0 0 512 -512;
estimate 'IrrBl vs. IrrRed (91)' tx 0 0 1 -1 tx*t 0 0 11 -11
tx*t*t 0 0 121 -121 tx*t*t*t 0 0 1331 -1331;

/* õ at days 81, 84, 88, 91 corresponding to t=1, 4, 8, 11 */


estimate 'Irrig vs. Mulch (81)' tx 0 2 -1 -1 tx*t 0 2 -1 -1
tx*t*t 0 2 -1 -1 tx*t*t*t 0 2 -1 -1 /divisor=2;
estimate 'Irrig vs. Mulch (84)' tx 0 2 -1 -1 tx*t 0 8 -4 -4
tx*t*t 0 32 -16 -16 tx*t*t*t 0 128 -64 -64 /divisor=2;
estimate 'Irrig vs. Mulch (88)' tx 0 2 -1 -1 tx*t 0 16 -8 -8
tx*t*t 0 128 -64 -64 tx*t*t*t 0 1024 -512 -512 /divisor=2;
estimate 'Irrig vs. Mulch (91)' tx 0 2 -1 -1 tx*t 0 22 -11 -11
tx*t*t 0 242 -121 -121 tx*t*t*t 0 2662 -1331 -1331 /divisor=2;

/* ö at days 81, 84, 88, 91 corresponding to t=1, 4, 8, 11 */


estimate 'Mulch vs. NoMulch (81)' tx 1 1 -1 -1 tx*t 1 1 -1 -1
tx*t*t 1 1 -1 -1 tx*t*t*t 1 1 -1 -1 /divisor=2;
estimate 'Mulch vs. NoMulch (84)' tx 1 1 -1 -1 tx*t 4 4 -4 -4
tx*t*t 16 16 -16 -16 tx*t*t*t 64 64 -64 -64 /divisor=2;
estimate 'Mulch vs. NoMulch (88)' tx 1 1 -1 -1 tx*t 8 8 -8 -8
tx*t*t 64 64 -64 -64 tx*t*t*t 512 512 -512 -512
/divisor=2;
estimate 'Mulch vs. NoMulch (91)' tx 1 1 -1 -1 tx*t 11 11 -11 -11
tx*t*t 121 121 -121 -121 tx*t*t*t 1331 1331 -1331 -1331
/divisor=2;

/* ÷ */
estimate 'Irrig(91) - Mulch(81)' tx 0 2 -1 -1 tx*t 0 22 -1 -1
tx*t*t 0 242 -1 -1 tx*t*t*t 0 2662 -1 -1 /divisor=2;
run;

The parameters of the unstructured variance-covariance matrix of the accumulated sub-


sampling errors reflect the differences in variability (Output 7.29). The variance at the initial
measurement occasion (day )"), shown as UN(1,1), is considerably larger than the other
variances (UN(2,2) through UN(4,4)). This reflects the fact that especially on the plastic
mulch-treated plots, initially the most melons were harvested at day )". From the results of
the estimates statements it can be seen that the separation between nonirrigated and irrigated
plots (ó) increases over time, but the difference in cumulative yields never achieves signifi-
cance at the &% level (even if the comparison would be one-sided). Similarly, there is no
difference between red and black mulch cumulative yields at any point in time (ô). Applying
either plastic mulch type in addition to irrigation raises cumulative melon yields by #&Þ) to
#*Þ& tons ‚ ha" . These differences are highly significant (õ) and amplified even more
strongly if the two no-mulch treatments are combined (ö). Interestingly, the initial yield of
the mulched plots was &Þ&' tons per hectare higher than the final yield *" days after planting
on the irrigated plots, although this difference was not significant (: œ !Þ%&), ÷ÑÞ

© 2003 by CRC Press LLC


524 Chapter 7  Linear Mixed Models

Output 7.29. (abridged) The Mixed Procedure


Model Information
Data Set WORK.MELONS97
Dependent Variable CUMYLD
Covariance Structures Variance Components, Unstructured
Subject Effect REP*TX
Estimation Method REML
Residual Variance Method None
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Containment

Class Level Information


Class Levels Values
REP 4 1 2 3 4
TX 4 BareSoil Irrig IrrigBlack IrrigRed

Covariance Parameter Estimates


Cov Parm Subject Estimate
REP*TX 64.2255
UN(1,1) REP*TX 146.67
UN(2,1) REP*TX 36.3238
UN(2,2) REP*TX 38.2103
UN(3,1) REP*TX 18.8947
UN(3,2) REP*TX 15.9693
UN(3,3) REP*TX 19.0087
UN(4,1) REP*TX 24.4520
UN(4,2) REP*TX 14.1106
UN(4,3) REP*TX 29.5125
UN(4,4) REP*TX 50.5624

Fit Statistics
Res Log Likelihood -190.3
Akaike's Information Criterion -201.3
Schwarz's Bayesian Criterion -205.5
-2 Res Log Likelihood 380.5

Estimates
Standard
Label Estimate Error DF t Value Pr > |t|
ó
Bare vs. Irrig (81) 1.6425 10.2688 36 0.16 0.8738
Bare vs. Irrig (84) -5.9495 7.1567 36 -0.83 0.4113
Bare vs. Irrig (88) -4.8272 6.4511 36 -0.75 0.4592
Bare vs. Irrig (91) -10.3727 7.5759 36 -1.37 0.1794
ô
IrrBl vs. IrrRed (81) 4.5288 10.2688 36 0.44 0.6618
IrrBl vs. IrrRed (84) 0.7070 7.1567 36 0.10 0.9219
IrrBl vs. IrrRed (88) 2.8887 6.4511 36 0.45 0.6570
IrrBl vs. IrrRed (91) 0.7420 7.5759 36 0.10 0.9225
õ
Irrig vs. Mulch (81) -25.8126 8.8931 36 -2.90 0.0063
Irrig vs. Mulch (84) -28.7948 6.1979 36 -4.65 <.0001
Irrig vs. Mulch (88) -29.5604 5.5868 36 -5.29 <.0001
Irrig vs. Mulch (91) -26.3408 6.5609 36 -4.01 0.0003
ö
Mulch vs. NoMulch (81) -24.9914 7.2611 36 -3.44 0.0015
Mulch vs. NoMulch (84) -31.7695 5.0605 36 -6.28 <.0001
Mulch vs. NoMulch (88) -31.9740 4.5616 36 -7.01 <.0001
Mulch vs. NoMulch (91) -31.5271 5.3570 36 -5.89 <.0001
÷
Irrig(91) - Mulch(81) -5.5644 7.4202 36 -0.75 0.4582

© 2003 by CRC Press LLC


Chapter 8

Nonlinear Models for Clustered


Data

“Although this may seem a paradox, all exact science is dominated


by the idea of approximation.” Bertrand Russell.

8.1 Introduction
8.2 Nonlinear and Generalized Linear Mixed Models
8.3 Toward an Approximate Objective Function
8.3.1 Three Linearizations
8.3.2 Linearization in Generalized Linear Mixed Models
8.3.3 Integral Approximation Methods
8.4 Applications
8.4.1 A Nonlinear Mixed Model for Cumulative Tree Bole Volume
8.4.2 Poppy Counts Revisited — a Generalized Linear Mixed Model
for Overdispersed Count Data
8.4.3 Repeated Measures with an Ordinal Response

© 2003 by CRC Press LLC




8.1 Introduction
Box 8.1 NLMMs and GLMMs

• Models for data that are clustered or otherwise call for the inclusion of
random effects do not necessarily have a linear mean function.

• Nonlinear mixed models (NLMMs) arise when nonlinear mean functions as


in §5 are applied to clustered data.

• Generalized linear mixed models (GLMMs) arise when clustered data are
modeled where the (conditional) response has a distribution in the
exponential family.

Chapters 5 and 6 discussed general nonlinear models and generalized linear models (GLM)
for independent data. It was emphasized in §6 that GLMs are special cases of nonlinear
models where a linear predictor is placed inside a nonlinear function. Chapter 7 digressed
from the nonlinear model theme by introducing linear models for clustered data where the
observations within a cluster are possibly correlated. To capture cluster-to-cluster as well as
within-cluster variability we appealed to the idea of randomly varying cluster effects which
gave rise to the Laird-Ware model
Y3 œ X3 " € Z3 b3 € e3 .
Recall that the b3 are random effects or coefficients, with mean 0 and variance-covariance
matrix D, that vary across clusters and the e3 are the within-cluster errors. Throughout
Chapter 7 it was assumed that the mean function is linear. There are situations, however,
where the model calls for the inclusion of random effects and the mean function is nonlinear.
Figure 8.1 shows the cumulative bole volume profiles of three yellow poplar (Lirioden-
dron tulipifera L.) trees that are part of a data set of $$' randomly selected trees. The volume
of a bole was obtained by felling the tree, delimbing the bole and cutting it into four-foot-long
sections. The volume of each section was determined by geometric principles assuming a
circular shape of the bole cross-section and accumulated with the volumes of lower sections.
This process is repeated to the top of the tree bole if total-bole volume is the desired response
variable, or to the point where the bole diameter has tapered to the merchantable diameter, if
merchantable volume is the response variable. If .34 is the cross-sectional diameter of the bole
of the 3th tree at the 4th height of measurement, then <34 œ "  .34 Îmaxa.34 b is termed the
complementary diameter. It is zero at the stump and approaches one at the tree tip.
The trees differ considerably in size (total height and breast height diameter), hence their
total cumulative volumes differ greatly. The general shapes of the tree profiles are similar,
however, suggesting a sigmoidal increase of cumulative volume with increasing complemen-
tary diameter. These data are furthermore clustered. Each of the 8 œ $$' trees represents a
cluster of observations. Since the individual bole segments were cut at equal four-foot inter-
vals the number of observations within a cluster a83 b varies from tree to tree. A cumulative
bole volume model for these data will have a nonlinear mean function that captures the sig-
moidal behavior of the response and account for differences in size and shape of the trees

© 2003 by CRC Press LLC


through random effects capturing tree-to-tree variability. It will be a nonlinear mixed model
combining concepts of §5 and §7. These data are modeled in application §8.4.1.

No. 308

150

No. 5
Cumulative volume (ft3)

100

No. 151

50

0.0 0.2 0.4 0.6 0.8 1.0


Complementary diameter (1-dij/max(dij))

Figure 8.1. Cumulative volume profiles of three yellow poplar (Liriodendron tulipifera L.)
trees as a function of the complementary bole diameter. Data kindly provided by Dr. David
Loftis, USDA Forest Service, originally collected by Dr. Donald E. Beck (see Beck 1963).

Mixed models with nonlinear mean function also come about when generalized linear
models are applied to data with clustered structure. Recall the Hessian fly experiment of
§6.7.2 where "' varieties were arranged in a randomized block design with four blocks and
the outcome was the proportion of plants on an experimental unit infested with the Hessian
fly. The model applied there was
134
logœ  œ . € 73 € 34 ,
"  134

where 134 is the probability of infestation for variety 3 in block 4, and 73 , 34 are the treatment
and block effects, respectively. Expressed as a statistical model for the observed data this
model can be written as
"
]34 œ € /34 , [8.1]
" € expe  .  73  34 f

where ]34 is the proportion of infested plants and /34 is a random variable with mean ! and
variance 134 a"  134 bÎ834 (a so-called shifted binomial random variable). If the blocks were
not predetermined but selected at random, then the 34 are random effects and model [8.1] is a
nonlinear mixed model. It is nonlinear because the linear predictor . € 73 € 34 is inside a
nonlinear function, and it is a mixed model because of fixed treatment effects and random
block effects.
The problem of overdispersion in generalized linear models was discussed in §6.6.
Different strategies for modeling overdispersed data were (i) the addition of extra scale

© 2003 by CRC Press LLC




parameters that alter the dispersion but not the mean of the response, (ii) parameter mixing,
and (iii) generalized linear mixed models. An extra scale parameter was used to model the
Hessian fly data in §6.7.2. For the poppy count data in §6.7.8 parameter mixing was applied.
Recall that ]34 denoted the number of poppies for treatment 3 in block 4. It was assumed that
given -34 , the average number of poppies per unit area, the counts ]34 l-34 were Poisson
distributed and that -34 was a Gamma-distributed random variable. This led to a Negative
Binomial model for the poppy counts ]34 . Because this distribution is in the exponential
family of distributions (§6.2.1), the model could be fit as a standard generalized linear model.
The parameter mixing approach is intuitive because a quantity assumed to be fixed in a
reference model is allowed to vary at random thereby introducing more uncertainty in the
marginal distribution of the response and accounting for the overdispersion in the data. The
generalized linear mixed model approach is equally intuitive. Let ]34 denote the poppy count
for treatment 3 in block 4 and assume that ]34 l.34 are Poisson distributed with mean
-34 œ expe. € 73 € 34 € .34 f.

The model for the log intensity of poppy counts appears to be a classical model for a
randomized block design with error term .34 . In fact, we specify that the .34 are independent
Gaussian random variables with mean ! and variance 5 # . Compare this to the standard
Poisson model without overdispersion, ]34 µ Poissonaexpe. € 73 € 34 fb. The uncertainty in
.34 increases the uncertainty in ]34 . The model
]34 l.34 µ Poissonˆ/.€73 €34 €.34 ‰ [8.2]

is also a parameter mixing model. If the distribution of the .34 is carefully chosen, one can
average over it analytically to derive the marginal distribution of ]34 on which inference is
based. In §6.7.8 the distribution of -34 was chosen as Gamma because the marginal distribu-
tion then had a known form. In other situations the marginal distribution may be difficult to
obtain or intractable. We then start with a generalized linear mixed model such as [8.2] and
approximate the marginal distribution (see §8.3.2). The poppy count data are modeled with a
generalized linear mixed model in §8.4.2.

8.2 Nonlinear and Generalized Linear Mixed Models


Denote as ]34 the 4th response from cluster 3 a3 œ "ß âß 8à 4 œ "ß âß 83 b. In §5 we used the
general notation ]3 œ 0 ax3 ß )b € /3 to denote a nonlinear model with mean parameters ). To
reflect the clustered nature of the data and the involvement of random effects we now write
]34 œ 0 ax34 ß )ß b3 b € /34 . [8.3]

The vector x34 is a vector of regressor or design variables and ) is a vector of fixed effects.
The vector b3 denotes a vector of random effects (or coefficients) modeling the cluster-to-
cluster heterogeneity. As in §7, the b3 are zero mean random variables with variance-covar-
iance matrix D. The /34 are within-cluster errors with mean ! which might be correlated. Be-
cause computational difficulties in fitting nonlinear models are greater than those in fitting
linear models it is commonly assumed that the within-cluster errors are uncorrelated, but this

© 2003 by CRC Press LLC


is not necessary. The function 0 a † b can be any nonlinear function. A special case of model
[8.3] is a generalized linear mixed model (GLMM) where 0 ab is the inverse of a link function
and the variation of the /34 is modeled to reflect the variance of the appropriate conditional
distribution in the exponential family (see §6.2.1). For a GLMM we assume that the linear
predictor has the form of a Laird-Ware model and write
]34 œ 1" ˆxw34 " € zw34 b3 ‰ € /34 . [8.4]

When combining the responses for a particular cluster in a vector Y3 œ c]3" ß âß ]383 dw
models [8.3] and [8.4] will be written as
Y 3 œ f ax ß ) ß b 3 b € e 3
Y3 œ g" aX3 " € Z3 b3 b € e3 .

The variance-covariance matrix of the within-cluster error vector e3 will be denoted R3 to


keep with the notation in §7.
The involvement of the random effects b3 inside the inverse link function 1" ab or the
general nonlinear function 0 ab poses a particular complication for nonlinear mixed models. In
the Laird-Ware model Y3 œ X3 " € Z3 b3 € e3 it is easy to obtain the conditional and marginal
means and variances because of the linearity of the model:
EcY3 lb3 d œ X3 " € Z3 b3 VarcY3 lb3 d œ R3
EcY3 d œ X3 " VarcY3 d œ Z3 BZw3 € R3 .

Maximum (or restricted) maximum likelihood estimation is based on the marginal distribution
of Y3 which turns out to be multivariate Gaussian if the b3 and e3 are Gaussian-distributed.
Even when b3 and e3 are Gaussian, the distribution of Y3 in the general model Y3 œ
faxß )ß b3 b € e3 is not necessarily Gaussian since Y3 is no longer a linear combination of
Gaussian random variables. Even deriving the marginal mean and variance of Y3 proves to be
a difficult undertaking. If the distribution of the within-cluster errors e3 is known finding the
conditional distribution of Y3 lb3 is simple. The marginal distribution of Y3 , which is key in
statistical inference, remains elusive in the nonlinear mixed model. The various approaches
put forth in the literature differ in the technique and rationale applied to approximate this mar-
ginal distribution.

8.3 Toward an Approximate Objective Function


Box 8.2 Estimation Approaches

• There are three basic approaches to parameter estimation in a NLMM or


GLMM: individual estimates methods, linearization based methods, and
integral approximation methods. Only the latter two are considered here.

• A linearization method approximates the nonlinear mixed model by a


Taylor series to arrive at a pseudo-model which is typically of the Laird-
Ware form. Assuming Gaussianity of the pseudo-response estimation
proceeds by ML or REML (as in §7).

© 2003 by CRC Press LLC




• Integral approximation methods use quadrature or Monte Carlo integration


to calculate the marginal distribution of the data and maximize its like-
lihood.

Applying first principles the joint distribution of Y3 and b3 can be written as


0Cß, ay3 ß b3 b œ 0Cl, ay3 lb3 b0, ab3 b,

where 0Cß, , 0Cl, , and 0, denote the joint, conditional, and marginal probability density (mass)
functions, respectively. The marginal distribution of Y3 is obtained by integrating this joint
density over the distribution of the random effects,

0C ay3 b œ ( 0Cl, ay3 lb3 b0, ab3 b. b3 . [8.5]

If the distribution 0C ay3 b is known the maximum likelihood principle can be invoked to obtain
estimates of the unknown parameters. Unfortunately, this distribution is usually intractable.
The relevant approaches to arrive at a solution to this problem can be classified into two
broad categories, linearization and integral approximation methods. Linearization methods
replace the nonlinear mixed model with an approximate linear model. They are also called
pseudo-data methods since the response being modeled is not Y3 but a function thereof. Some
linearization methods are parametric in the sense that they assume a distribution for the
pseudo-response. Other linearization methods are semi-parametric in the sense that they esti-
mate the parameters without distributional assumptions beyond the first two moments of the
pseudo-response. Methods of integral approximation assume a particular distribution for the
random effects b3 and for the conditional distribution of Y3 lb3 and approximate the integral
[8.5] by numerical techniques (quadrature methods or Monte Carlo integration).
Before proceeding we need to point out that linearization and integral approximation
techniques are not the only possible methods for estimating the parameters of a nonlinear
mixed model. For example, if the random effects b3 enter the model linearly, Vonesh and
Carter (1992) apply iteratively reweighted least squares estimation akin to that in a general-
ized linear model. Parameter estimates in nonlinear mixed models can also be obtained in a
two-stage approach known as the individual estimates method. In the first stage the non-
linear model is fit to each cluster separately and in the second stage the cluster-specific esti-
mates are combined (averaged) to arrive at population-averaged values. Success of inference
based on individual estimates depends on having sufficient measurements on each cluster to
estimate the nonlinear response reliably and on methods for combining the individual esti-
mates into population average estimates efficiently. One advantage of the linearization
methods is not to depend on the ability to fit the model to each cluster separately. Clusters
that do not contribute information about the entire response profiles, for example, because of
missing values, will nevertheless contribute to estimation in linearization methods. Earlier
applications of the individual estimates method suffered from not borrowing strength across
clusters in the estimation process (see, e.g., Korn and Whittemore 1997 and Biging 1985).
Davidian and Giltinan (1993) improved the method considerably by fitting the nonlinear
model separately to each subject but using weight matrices that are estimated across subjects.
For more details on two-stage methods that build on individual estimates derived from each
cluster the reader is referred to Ch. 5. in Davidian and Giltinan (1995) and references therein.

© 2003 by CRC Press LLC


The distinction between inference based on linearization and inference based on individual
estimates goes back to Sheiner and Beal (1980) who pioneered procedures for fitting non-
linear mixed effects models.
In our experience, most nonlinear mixed models were until recently fit by one of the
linearization techniques. High-dimensional quadrature methods are computationally demand-
ing and were not readily available in standard statistical software packages. The %nlinmix
macro distributed by SAS Institute (www.sas.com), for example, implements linearization
methods. The nlmixed procedure, a recent addition to The SAS® System, implements the
integral approximation method.

8.3.1 Three Linearizations


Box 8.3 Linearizations

• A population-averaged (PA) linearization expands the model faxß )ß bb or


g" axß " ß bb around an estimate s
" of " and the mean of the random effects.

• A subject-specific (SS) or cluster-specific linearization expands the model


around an estimate s" of " and a current predictor sb of b.

• SS expansions are more accurate but more sensitive to model misspeci-


fication.

The Gauss-Newton method (§A5.10.3) for fitting a nonlinear model (for independent data)
starts with a least squares objective function to be minimized, the residual sum of squares
W a)b œ ay  faxß )bbw ay  faxß )bb. It then approximates the mean function faxß )b by a first-
order Taylor series about s) and substitutes the approximate model back into W a)b. This yields
an approximated residual sum of squares to be minimized. In nonlinear mixed models a
similar rationale can be applied. Approximate the conditional mean function by a Taylor
series about some value chosen for ) and some value chosen for b3 . This leads to an approxi-
mate linear mixed model whose parameters are then estimated by one of the techniques from
§7. What are sometimes termed the first- and second-order expansion methods (Littell et al.
1996) differ in whether the mean is expanded about the mean of the b3 , Ecb3 d œ 0, or the esti-
mated BLUP s b3 .
For the discussion that follows we consider the stacked form of the model to eliminate
the cluster subscript 3. Let Y œ cYw" ß âß Yw8 dw , e œ cew" ß âß ew8 dw and denote the model vector as

Ô 0 ax"" ß )ß b" b ×
Ö 0 ax"# ß )ß b" b Ù
Ö Ù
Ö ã Ù
Ö Ù
faxß )ß bb œ Ö 0 ax"8" ß )ß b" b Ù.
Ö Ù
Ö 0 ax#" ß )ß b# b Ù
Ö Ù
ã
Õ 0 ax888 ß )ß b8 b Ø

© 2003 by CRC Press LLC




If b3 has mean 0 and variance-covariance matrix D, then we refer to b œ cbw" ß âß bw8 dw as the
vector of random effects across all clusters, a random variable with mean 0 and variance-co-
variance matrix B œ DiageDf. In the sequel we present the rationale behind the various
linearization methods. Detailed formulas and derivations can be found in §A8.5.1.
The first linearization method expands faxß )ß bb about a current estimate s ) of ) and the
mean of b, Ecbd œ 0. Littell et al. (1996, p. 463) term this the approximate first-order method.
We prefer to call it the population-average (PA) expansion. A subject-specific (SS) expansion
is obtained as a linearization about s) and a predictor of sb, commonly chosen to be the esti-
mated BLUP. Littell et al. (1996, p. 463) refer to it as the approximate second-order method.
Finally, the generalized estimating equations of Zeger, Liang and Albert (1988) can be adapt-
ed for the case of a nonlinear mixed model with continuous response based on an expansion
about Ecbd œ 0 only (see §A8.5.2). We term this case the GEE expansion. The three lineariza-
tions lead to the approximate models shown in Table 8.1.

Table 8.1. Approximate models based on linearizations


Expansion
Type About Pseudo-Response Model
PA s
), Ecbd Y œ Y  fÐxß s
‡
) ß 0Ñ € X ‡ s
) Y œ X ) € Z‡ b € e
‡ ‡

s µ s µ s µ µ µ µ
SS ), s
b bÑ € X ) € Z s
Y œ Y  f Ðx ß ) ß s b Y œX )€Z b€e
†
GEE Ecbd not necessary Y œ fÐxß )ß 0Ñ € Zb € e

µ µ
The matrices X‡ , Z‡ , X , and Z are matrices of derivatives of the function fÐxß )ß bÑ
defined as follows:
` f ax ß ) ß b b ` f ax ß ) ß b b
X‡ œ Z‡ œ
` )w ls
) ß0 ` bw ls
) ß0

µ ` faxß )ß bb µ ` faxß )ß bb † ` faxß )ß bb


X œ Z œ Zœ
` )w ls
) ßs
b ` bw ls
) ßs
b ` bw l0

Starred matrices are evaluated at s


) and Ecbd œ 0, matrices with tildes are evaluated at s
) and
s
b. From the last column of Table 8.1 it is seen that for the PA and SS expansion, the
linearized model is a linear mixed model of the Laird-Ware form where the matrices X and Z
have been replaced by the respective derivative matrices.
The correspondence to estimating the parameters in a regular nonlinear model is worth
pointing out. There, the model is linearized as

Y » f Ðx ß s
)Ñ € ` faxß )bÎ` )w ‚ Ð)  s
)Ñ € e œ fÐxß s
)Ñ € FÐ)  s
)Ñ € e,

which yields an approximate linear model

Y ‡ œ Y  f Ðx ß s
)Ñ € Fs
) œ F) € e.

After an estimate of ) is obtained, the pseudo-response Y‡ and the derivative matrix F are up-
dated and the linear model for the next iteration is obtained. This process continues until a
convergence criterion is met.

© 2003 by CRC Press LLC


In a nonlinear mixed model the approximate model is a linear mixed model. Given some
starting values for ) in the PA expansion and ) and b in the SS expansion the linear mixed
model is fit and new estimates and EBLUPs are obtained. The pseudo-response and the deriv-
ative matrices are updated and the next linear mixed model is fit. This process continues until
some convergence criterion is met. Notice that with the PA expansion the EBLUPs s b are not
needed, whereas in the SS expansion the pseudo-response depends on them. In a PA expan-
sion it is sufficient to obtain the EBLUPs at the end of the iterative process. If the number of
clusters is large, this can speed up the estimation process.
If the random effects b and the within-cluster errors e are Gaussian distributed the mar-
µ
ginal distribution of Y‡ or Y is also Gaussian and the linearized models can be fit with the
mixed procedure of The SAS® System. This is the implementation behind the %nlinmix
macro available from SAS Institute (www.sas.com). Either the maximum or restricted maxi-
mum likelihood principle can then be employed. Fitting a series of nonlinear mixed effects
models Gregoire and Schabenberger (1996b) noted that the restricted log likelihood was uni-
formly larger in the SS expansion than in the PA expansion indicating a better fit in the
former. Although some care must be exercised in comparing likelihoods across models that
differ in their response, it is clear that the SS expansion provides a better approximation to the
nonlinear mixed model problem than the PA expansion. The PA approximation, on the other
hand, allows both cluster-specific and population-averaged inference and is relatively robust
to model misspecification. Only cluster-specific inference is supported by the SS expansion
which also suffers from greater sensitivity to model misspecification and potential conver-
gence problems. The estimates from a PA expansion are similar to those of Sheiner and Beal
(1980, 1985). Although these authors employ an expansion about Ecbd, the estimates are not
identical since Sheiner and Beal use an extended least squares criterion. The estimates
obtained from an SS expansion are identical to the estimates of Lindstrom and Bates (1990).
Their estimation algorithm differs from the one outlined here, however. For more details on
these computational alternatives see Lindstrom and Bates (1988, 1990) and Wolfinger
(1993b).
The process of fitting nonlinear mixed models based on linearizations and the Gaussian
error assumption is doubly iterative. The estimation of the covariance parameters in a Gaus-
sian linear mixed model by ML or REML is an iterative process (see §7) and once estimates
have been obtained, the components of the linearized mixed model are updated and the proc-
ess is repeated. Computation time required for these models can be formidable. Estimation
algorithms where estimates of the covariance parameters are obtained in noniterative fashion
are thus appealing. Relaxing the assumption of Gaussian distributions for b and e is also of
great interest. One such method is based on an extension of the generalized estimating equa-
tions (GEEs) of Liang and Zeger (1986), Zeger and Liang (1986), and Zeger, Liang, and
Albert (1988) to nonlinear mixed models (details in §A8.5.2). The GEE expansion in Table
†
8.1 gives rise to the model Y œ fÐxß )ß 0Ñ € Zb € e. Assuming only that b µ a0ß Bb and
e µ a0ß Rb, the marginal mean and variance of Y are approximated as fÐxß )ß 0Ñ and
† † †
V œ ZBZw € R, respectively. Estimates of ) are obtained by solving the estimating equation
` fÐxß ) ß 0Ñ † "
Y a);yß Vb œ V ay  fÐxß ) ß 0Ñb ´ 0. [8.6]
` )w
†
Since V contains unknown quantities, it must be estimated from the data before estimates of )
can be calculated. For the case where R œ 5 # I, Schabenberger (1994) derives method of

© 2003 by CRC Press LLC




†
moment estimators for the unknowns in V. Because these estimates depend on the current
†
solution s ), the process remains iterative. After an update of s
), given a current estimate of V,
Vs† is re-estimated followed by another update of s ) and so forth. Because the moment estima-
tors are noniterative this procedure usually converges rather quickly. The performance of the
method of moments estimators will improve with increasing number of clusters. The esti-
mates so obtained can be used as starting values for a parametric fit of the nonlinear mixed
model based on a PA or SS expansion assuming Gaussian pseudo-response. Gregoire and
Schabenberger (1996a, 1996b) note that the predicted cluster-specific profiles from a GEE
and REML fit were nearly identical and indistinguishable on a graph of plotted response
curves. These authors conclude that the GEE estimation method constitutes a viable estima-
tion approach in its own right and is more than just a vehicle to produce starting values for
the other linearization methods.

8.3.2 Linearization in Generalized Linear Mixed Models


With generalized linear mixed models the analyst faces simplifying and complicating circum-
stances. A simplification arises because of the linearity of the predictor. A complication arises
because the variance of the responses for non-normal data can depend on the mean itself and
the assumption of Gaussian errors is not tenable. Of the various linearization methods pro-
posed to fit a generalized linear mixed model the most significant in our opinion is the
pseudo-likelihood method of Wolfinger and O'Connell (1993) which subsumes other methods
such as those proposed by Breslow and Clayton (1993). This approach has been coded in the
%glimmix macro available from SAS Institute (www.sas.com). It is by far not the only
approach, however, and much research has taken place in this important area. Schabenberger
and Gregoire (1996), without being exhaustive, describe eight subject-specific and two popu-
lation-averaged methods for estimating the parameters of a generalized linear mixed models,
most of which invoke linearizations at some stage. In this subsection we discuss the pseudo-
likelihood approach of Wolfinger and O'Connell (1993) to show the relationship of their
linearization method to those of the previous subsection and to discuss the additional approxi-
mations that are involved because the response may be non-Gaussian.
We commence by formulating a generalized linear mixed model as a transformation of a
Laird-Ware model. Define the linear predictor vector for the observations from cluster 3 as
(3 œ X3 " € Z3 b3 . The first stage of the Laird-Ware model reckons the response conditional
on the random effects. In the case of a generalized linear mixed model with link function 1ab
we obtain
gaEcY3 lb3 db œ gÐ.b3 Ñ œ X3 " € Z3 b3 .

To incorporate the stochastic nature of the response the observational model is expressed as
random deviations of Y3 lb3 from its mean,
Y3 lb3 œ g" aX3 " € Z3 b3 b € e3 . [8.7]

The conditional distribution of the ]34 lb3 is chosen as an exponential family member (this can
be relaxed) with variance function Varc/34 d œ <2a.b. In many situations < will be one (see
Table 6.2) although the method of Wolfinger and O'Connell allows for estimation of an extra
scale parameter. The variance-covariance matrix of e3 can be expressed as Varce3 d œ

© 2003 by CRC Press LLC


A½ Ð.b3 ÑC3 A½ Ð.b3 Ñ, where A is a diagonal matrix containing the variance functions and C3 is a
within-cluster correlation matrix. If, conditional on the random effects, the observations from
a cluster are uncorrelated, then C3 œ I and Varce3 d œ AÐ.b3 Ñ.
The problem is to get from [8.7] to the marginal distribution of Y3 . The approach by
Wolfinger and O'Connell (1993) considers three separate approximations to accomplish that.
The analytic approximation expands g" aX3 " € Z3 b3 b into a Taylor series about " s and sb (or
s †
" and Ecbd). Substituting back into [8.7] yields an approximated residual vector e 3 lb3 . The
moment approximation equates the mean and variance of e† 3 lb3 with the mean and variance
of e3 lb3 . Finally, the probabilistic approximation assumes e† 3 lb3 to be Gaussian-distributed
(Lindstrom and Bates 1990, Laird and Louis 1982). This yields an approximate linear mixed
model which is still conditioned on b3 (as in the first stage of the Laird-Ware model). Since
the distribution of the pseudo-model is Gaussian, however, finding the mean and variance of
the marginal distribution (which is also Gaussian) is straightforward. The details of the three-
step linearization/approximation can be found in §A8.5.3. From the standpoint of linearizing
the model this approach is not very different from the linearizations in §8.3.1 and the
approach can be carried out with PA, SS, or GEE expansions (see Schabenberger and
Gregoire 1996). Compared to a nonlinear mixed model for continuous response the extra step
required is the probabilistic approximation that invokes the Gaussian framework. The end
result is a vector of pseudo-responses that follows a Gaussian linear mixed model. If ?3 is the
diagonal matrix of first derivatives of the mean with respect to the linear predictor the vector
of pseudo-responses for cluster 3 is

/3 œ ?"
3
˜Y3 lb3  g" a( s € Z3 s
s 3 b™ € X 3 " b3 . [8.8]

The conditional and marginal distributions of /3 are


/3 lb3 µ KÐX3 " € Z3 b3 ß ?" ½ ½ "
3 A3 C3 A3 ?3 Ñ
[8.9]
/3 µ KˆX3 " ß Z3 DZw3 € ?" ½ ½ " ‰
3 A3 C3 A3 ?3 .

The role of the within-cluster error dispersion matrix R is now played by ?" ½ ½ "
3 A 3 C 3 A 3 ?3 .
Because of the linearity of the linear predictor the Z and X matrices in [8.9] are the same
matrices as in model [8.7]. They do not depend on the expansion locus and/or current solu-
tions of the fixed and random effects. The linear predictor (
s 3 and the gradient matrix ?3 are
evaluated at the current solutions, however. The process of fitting a generalized linear mixed
model based on this linearization is thus again doubly iterative. The parameters in D and C
are estimated iteratively by maximum or restricted maximum likelihood as in any Gaussian
linear mixed model. Once these estimates and updates of " s have been obtained the pseudo-
mixed model is recalculated for the next iteration and the process is repeated. Noniterative
methods of estimating the covariance parameters are available. For details see Schabenberger
and Gregoire (1996).

8.3.3 Integral Approximation Methods


The initial problem that led to the various linearization methods was the need to obtain the
marginal distribution

© 2003 by CRC Press LLC




0C ay3 b œ ( 0Cl, ay3 lb3 b0, ab3 b. b3 , [8.10]

which is needed to calculate the likelihood for the entire data. Assuming that clusters are
independent, this likelihood is simply the product 0C ayb œ #83œ" 0C ay3 b. The linearizations
combined with a Gaussian assumption for the errors e3 lead to pseudo-models where
0Cl, ay3‡ lb3 b, 0Cl, Ðyµ3 lb3 Ñ, or 0/ l, Ð/3 lb3 Ñ are Gaussian. Consequently, the marginal distributions
0C ay‡3 b, 0C Ðyµ3 Ñ, and 0/ a/3 b are Gaussian if b3 µ K a0ß Db and the problem of calculating the
high-dimensional integral in [8.10] is defused. The linearization methods rest on the assump-
tion that the values which maximize the approximate likelihoods
8
0C ay‡ b œ $0C ay‡3 b
3œ"
8
0C е
y Ñ œ $0C Ðyµ3 Ñ
3œ"
8
0/ a/ b œ $0/ a/3 b
3œ"

are close to the values which maximize 0C ayb œ #83œ" 0C ay3 b.


A more direct approach is to avoid linearization altogether and to compute the integral
[8.10]. A closed form solution is usually not available if the random effects enter the model in
nonlinear form but the integral can be approximated. This approach has many merits. It is the
likelihood of the data, not some pseudo-data, that is being maximized. If the conditional
distribution 0Cl, ay3 lb3 b is not Gaussian as, for example, in a generalized linear mixed model,
the linearization methods coerce the pseudo-data in a Laird-Ware model; the assumption that
the pseudo-data follow a Gaussian distribution may be tenuous. After all, these are transfor-
mations of residuals whose distribution may be far from Gaussian. If the integral approxi-
mation method is independent of 0Cl, ay3 lb3 b any conditional distribution can be accommodat-
ed. A loose comparison of linearization vs. integral approximation methods can be drawn by
recalling the difference between the Gauss-Newton and Newton-Raphson methods for fitting
a nonlinear model (in the nonclustered case, §A5.10.3 and §A5.10.5). The objective there is
to minimize the residual sum of squares Wa)b akin to the objective here to maximize the
likelihood of the data. The Gauss-Newton (GN) method starts with a linearization of the non-
linear model, substitutes the linear pseudo-model into the sum of squares and maximizes the
result. The Newton-Raphson (NR) method approximates the residual sum of squares (the
objective function) and seeks the maximum of the approximated Wa)b. Since the NR method
targets the objective function directly while GN targets the model, the former will usually
provide a better approximation. The downsides of the NR method are the need for second
derivatives and possibly increased computing time. This is offset somewhat by NR requiring
fewer iterations. A similar tradeoff exists between linearizations and integral approximations.
The latter target the objective function directly and can be made highly accurate. They may
require considerable computing resources, however, especially if the number of random
effects is large.
Gaussian quadrature methods are numerical devices for approximating an integral,
essentially replacing the integral with a weighted sum. In contrast to simple integral approxi-

© 2003 by CRC Press LLC


mation rules (e.g., the extended trapezoidal rule) which evaluate the integrand at equally
spaced intervals, quadrature rules evaluate the function at unequally spaced intervals (nodes).
Nodes and weights are chosen so that the result is exact for a particular class of polynomial
functions. Different variations of quadrature yield exact results for different classes of
functions. If the integrand can be expressed as [ aBb0 aBb, where 0 aBb is a polynomial in B,
then Gauss-Legendre quadrature yields exact results for 0 aBb provided the number of nodes
is at least one less than the order of the polynomial. Gauss-Hermite quadrature is exact for
functions of the form expe  B# f0 aBb, Gauss-Laguerre quadrature for functions of the form
B! expe  Bf0 aBb, and so forth (see Press et al. 1992, Ch. 4.5 for an excellent discussion and
comparison). By choosing the quadrature variant carefully, a high degree of accuracy can be
obtained with only a few nodes. Fewer than ten nodes are often sufficient to achieve good
results. This is important because the evaluation of a :-dimensional integral requires 8:
function evaluations if 8 is the number of nodes sufficient for a one-dimensional integral. A
second complication with quadrature in several dimensions is the specification of the nodes
which must lie inside the volume of integration. A computationally efficient method of esti-
mating a high-dimensional integral is through importance sampling, a variation of Monte
Carlo integration. Gauss-Hermite quadrature and importance sampling to approximate the
integral [8.10] for a nonlinear mixed model are discussed in Pinheiro and Bates (1995) and
implemented in the nlmixed procedure of The SAS® System. This procedure performs
sophisticated adaptive versions of quadrature and importance sampling. The quadrature adap-
tation consists of centering the nodes at the current estimates of the random effects and by
scaling the nodes by their variances. The number of quadrature nodes is also chosen
adaptively by proc nlmixed. When the relative change between log likelihood calculations is
less than !Þ!!!", the lesser number of quadrature points is used. In the case of a single
quadrature point, the procedure performs the Laplace approximation as described in
Wolfinger (1993b).

8.4 Applications
We have given linearization methods a fair amount of discussion, although we prefer integral
approximations. Until recently, most statistical software capable of fitting nonlinear mixed
models relied on linearizations. Only a few specialized packages performed integral
approximations. The %nlinmix and %glimmix macros distributed by SAS Institute
(http://ftp.sas.com/techsup/download/stat/) perform the SS and PA linearizations discussed in
§8.3.1 and §8.3.2. The text by Littell et al. (1996) gives numerous examples on their usage.
With Release 8.0 of The SAS® System the nlmixed procedure has become available.
Although it has been used in previous chapters to fit various nonmixed models it was
specifically designed to fit nonlinear and generalized linear mixed models by integral
approximation methods. For models that can be fit by either the %nlinmix/%glimmix macros
and the nlmixed procedure, we prefer the latter. Although integral approximations are
computationally intensive, the nlmixed procedure is highly efficient and converges reliably
and faster (in our experience) than the linearization-based macros. It furthermore allows the
optimization of a general log-likelihood function which opens up the possibility of modeling
mixed models with conditional distributions that are not in the exponential family or not al-
ready coded in the procedure. Among the conditional distributions currently (as of Release
8.01) available in proc nlmixed are the Gaussian, Bernoulli, Binomial, Gamma, Negative Bi-

© 2003 by CRC Press LLC




nomial, and Poisson distribution. Among the limitations of the procedure is the restriction to
one level of random effects nesting. Only one level of clustering is possible but multiple ran-
dom effects at the cluster level are permitted. The random effects distribution is restricted to
Gaussian. Since the basic syntax of the procedure resembles that of proc nlin, the user has to
supply starting values for all parameters and the coding of classification variables can be
tedious (see §8.4.2 and §8.4.3 for applications). Furthermore, there is no support for modeling
within-cluster correlations in the conditional distributions. Since proc mixed provides this
possibility through the repeated statement and the linearization algorithms essentially call a
linear mixed model procedure repeatedly, random effects and within-cluster correlations can
be accommodated in linearization approaches. In that case the %glimmix and %nlinmix macro
should be used. It has been our experience, however, that after modeling the heterogeneity
across clusters through random effects the data do not support further modeling of within-
cluster correlations in many situations. The combination of random effects and a nondiagonal
R matrix in nonlinear mixed models appears to invite convergence troubles. Finally, it should
be noted that the integral [8.10] is that of the marginal likelihood of y, not that of Ky, say.
The nlmixed procedure performs approximate maximum likelihood inference and no REML
alternative is available. A REML approach is conceivable if, for example, one were to
approximate the distribution of u3 œ Ky3 ,

0C au3 b œ ( 0Cl, au3 lb3 b0, ab3 b. b3 .

Open questions are the nature of the conditional distribution if y3 lb3 is distributed in the expo-
nential family, the transformation Ky3 that yields EcKy3 d œ 0, and how to obtain the fixed
effects estimates. Another REML approach would be to assume a distribution for ) and inte-
grate over the distributions of b3 and ) (Wolfinger 2001, personal communication). Only
when the number of fixed effects is large would the difference in bias between REML and
ML estimation likely be noticeable. And with a large number of fixed effects quadrature
methods are then likely to prove too cumbersome computationally.
For the applications that follow we have chosen a longitudinal study and two designed
experiments. The yellow poplar cumulative tree-bole volume data are longitudinal in nature.
Instead of a temporal metric, the observations were collected along a spatial metric, the tree
bole. In §8.4.1 the cumulative bole volume, a continuous response, is modeled with a non-
linear volume-ratio model and population-averaged vs. tree-specific predictions are
compared. The responses for applications §8.4.2 and §8.4.3 are not Gaussian and not con-
tinuous. In §8.4.2 the poppy count data is revisited and the overdispersion is modeled with a
Poisson/Gaussian mixing model. In contrast to the Poisson/Gamma mixing model of §6.7.8
this model does not permit the analytic derivation of the marginal distribution, and we use a
generalized linear mixed model approach with integral approximation. The Poisson/Gaussian
model can be thought of as an extension of the Poisson generalized linear model to the mixed
model framework and to clustered data. In §8.4.3 we extend the proportional odds model for
ordinal response to the clustered data framework. There we analyze the data from a repeated
measures experiment where the outcome was a visual rating of plant quality.

© 2003 by CRC Press LLC


Applications 539

8.4.1 A Nonlinear Mixed Model for Cumulative Tree Bole


Volume
Since the introduction of regression methods into forestry more than '! years ago, one of the
pressing questions (to forest biometricians) concerns the (merchantable) volume of standing
trees. This has commonly been addressed by fitting linear or nonlinear regression models. Be-
cause of differences across and within species due to physiographic, climatic, management,
and other effects, these volume equations are typically fit separately to sample data from
regional populations of a tree species. Although the goal is to predict the woody volume of
standing trees, the equations are fitted to measurements conducted on felled trees. A tree is
felled and delimbed and cut into short sections measuring at most several feet. The volume of
each section is determined by geometric principles and accumulated with the volume of the
lower sections. This process is carried out until the tree bole has reached a threshold diameter
marking the limits of merchantability or to the top of the bole if total-bole volume is the re-
sponse of interest. Obviously, the consecutive measurements on sections of a particular tree
bole are not independent. Even if they were, the process of accumulating sections of the tree
bole would induce correlations among the cumulative volume measurements of a tree.
The strategy to fit a new bole-volume equation when the upper-bole merchantability
diameter changes due to changes in milling technology or market conditions is a costly
endeavor. Burkhart (1977) suggested a modeling approach where the woody volume Z. up to
a bole diameter . is expressed as the product of the total-bole volume aZ! b and the ratio V. of
merchantable volume to total volume,
Z. œ Z! V. . [8.11]
Such models are referred to as volume-ratio models (Newberry and Burk 1985, Avery and
Burkhart 1994). Gregoire and Schabenberger (1996b) cite the growing number of applica-
tions of these models in forestry (Golden et al. 1982, Knoebel et al. 1983, Van Deusen et al.
1981, Newberry and Burk 1985, Amateis and Burkhart 1987, Bailey 1994). The correlations
among measurements collected on a single tree bole were not taken into account until the
work by Gregoire and Schabenberger (1996a) who employed nonlinear mixed models. In
their models the correlations in the marginal distribution of observations from the same tree
bole were induced by random effects that varied across trees. Although a direct approach of
modeling the within-cluster correlation structure is more appealing, it is not clear how to
model the correlations among accumulated observations. Using random effects to capture
tree-to-tree variability furthermore allows modeling the volume of the individual tree bole
(cluster-specific predictions) as well as the volume of the average tree (population-averaged
predictions).
Figure 8.2 depicts the cumulative volume profiles for 8 œ $$' yellow poplar
(Liriodendron tulipifera L.) trees. The trees were felled and measured for the purpose of
developing a bole-volume equation for the Appalachian region of the southeastern United
States (Beck 1963). The trees vary greatly in the cumulative volume profiles (Figure 8.2)
which is partly due to the differences in tree size. The total volume ranges from !Þ!# to #&*Þ)
ft$ , total tree height ranges from "#Þ! to "$)Þ! ft, and diameter at breast height from !Þ( to
$!Þ! inches (Table 8.2). Breast height, which is typically %Þ& feet above ground, is the
customary height at which to measure a tree's reference diameter. Differences in tree bole

© 2003 by CRC Press LLC


540 Chapter 8  Nonlinear Models for Clustered Data

shape are apparent after adjusting for tree size and plotting the relative cumulative volumes
Z. ÎZ! (Figure 8.3).

0.00 0.25 0.50 0.75 1.00

12 to 74 ft 74 to 88 ft 88 to 95 ft
250
200
150
100
50
Cumulative volume (ft3)

95 to 99 ft 99 to 104 ft 104 to 109 ft


250
200
150
100
50

109 to 115 ft 115 to 120 ft 120 to 139 ft


250
200
150
100
50

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Complementary diameter rij

Figure 8.2. Cumulative volume profiles for yellow poplar trees graphed against the comple-
mentary diameter <34 œ "  .34 Îmaxa.34 b. .34 denotes the cross-sectional bole diameter of tree
3 at the 4th height of measurement a3 œ "ß âß 8 œ $$'b. Trees are grouped by total tree height
(ft).

0.00 0.25 0.50 0.75 1.00

12 to 74 ft 74 to 88 ft 88 to 95 ft
1.00
0.75
0.50
Relative cumulative volume Vd /V0

0.25
0.00
95 to 99 ft 99 to 104 ft 104 to 109 ft
1.00
0.75
0.50
0.25
0.00
109 to 115 ft 115 to 120 ft 120 to 139 ft
1.00
0.75
0.50
0.25
0.00

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Complementary diameter rij

Figure 8.3. Relative cumulative volume Z. ÎZ! for $$' yellow poplar trees graphed against
complementary diameter. Trees are grouped by total tree height (ft).

© 2003 by CRC Press LLC


Applications 541

Table 8.2. Descriptive statistics for 8 œ $$' yellow poplar trees


Sample
Variable Symbol Mean Std. Dev. Min Max
Diameter at Breast Height (inches) H3 "$Þ## 'Þ&" !Þ( $!Þ!
Total Height (feet) L3 *!Þ)# #'Þ'! "#Þ! "$)Þ!
Max. Diameter (inches) maxe.34 f "&Þ!' (Þ'' "Þ! $&Þ#
Total Volume (cubic feet) Z!3 &%Þ$& &%Þ!% !Þ!# #&*Þ)
Number of Sections 83 "*Þ(& $Þ! $#Þ!

To develop models for Z! and V. with the intent to fit the two simultaneously while
accounting for tree-to-tree differences in size and shape of the volume profiles, we start with
a model for the total bole volume Z! . A simple model relating Z! to easily obtainable tree size
variables is
H3# L3
Z3! œ "! € "" € /3 .
"!!!
For the yellow poplar data this model fits very well (Figure 8.4) although there is some evi-
dence of heteroscedasticity. The variation in total bole volume for small trees is less than that
for larger trees. An ordinary least squares fit yields " s ! œ "Þ!%"', " s " œ #Þ#)!', and
#
V œ !Þ**. The regressor was scaled by the factor "ß !!! so that the estimates are of similar
magnitude.

250
Total Tree Bole Volume V0 (ft3)

200

150

100

50

0 20 40 60 80 100 120
(D2H)*1000-1

Figure 8.4. Simple linear regression of total tree volume Z! against H# L . The data set to
which this regression is fit contains 8 œ $$' observations, one per tree (V # œ !Þ**).

To develop a model for the ratio term V. , one can think of V. as a mathematical switch-
on function in the terminology of §5.8.6. Since these functions range between ! and " and
switch-on behavior is usually nonlinear, a good place to start the search for a ratio model is
with the cumulative distribution function (cdf) of a continuous random variable. Gregoire and
Schabenberger (1996b) modified the cdf of a Type-I extreme value random variable

© 2003 by CRC Press LLC


542 Chapter 8  Nonlinear Models for Clustered Data

(Johnson, Kotz and Balakrishnan 1995, Ch. 22)


Pra\ Ÿ Bb œ expe  expe  aB  0bÎ)ff.

Letting > œ .ÎH, they used


V. œ expe  "# >w expe"$ >ff, [8.12]

where >w œ >Î"ß !!!. The V. term is always positive and tends to one as . Ä !. The logical
constraints (Z.   !, Z. Ÿ Z! , and Z.œ! œ Z! ) any reasonable volume-ratio model must obey
are thus guaranteed. The fixed effects volume-ratio model for the cumulative volume of the
3th tree up to diameter .4 now becomes
H3# L3
exp˜  "# >34 expe"$ >34 f™ € /34 .
w
Z3.4 œ Z3! V3.4 œ Œ"! € "" [8.13]
"!!!

The yellow poplar data are modeled in Gregoire and Schabenberger (1996b) with non-
linear mixed models based on linearization methods and generalized estimating equations.
Here we fit the same basic model selected by these authors as superior from a number of
models that differ in the type and number of random effects based on quadrature integral
approximation methods with proc nlmixed. Tree-to-tree heterogeneity is accounted for as
variability in size, reflected in variations in total volume, and as variability in shape of the
volume profile. The former calls for inclusion of random tree effects in the total volume com-
ponent Z3! , the latter for random effects in the ratio term. The model selected by Gregoire and
Schabenberger (1996b) was
H3# L3
exp˜  e"# € ,#3 f>34 expe"$ >34 f™ € /34 ,
w
Z3.4 œ Z3! V3.4 œ Œ"! € e"" € ,"3 f
"!!!

where the ,"3 model random slopes in the total-volume equation and the ,#3 model the rate of
change and point of inflection in the ratio terms. The variances of these random effects are
denoted 5"# and 5## , respectively. The within-cluster errors /34 are assumed homoscedastic and
uncorrelated Gaussian random variables with mean ! and variance 5 # .
The model is fit in proc nlmixed with the statements that follow. The starting values
were chosen as the converged iterates from the REML fit based on linearization. The
conditional distribution 0Cl, aylbb is specified in the model statement of the procedure. In
contrast to other procedures in The SAS® System, proc nlmixed uses syntax to denote
distributions akin to our mathematical formalism. The statement Z3.4 lb µ KÐZ3! V3.4 ß 5 # Ñ is
translated into model cumv ~ normal(TotV*R,resvar);. The random statement specifies the
distribution of b3 . Since there are two random effects in the model, two means must be
specified, two variances and one covariance. The statement
random u1 u2 ~ normal([0,0],[varu1,0,varu2]) subject=tn;

is the translation of
! 5# !
b 3 µ K Œ ” •ß D œ ” " •à 3 œ "ß âß 8à Covcb3 ß b4 d œ 0.
! ! 5##

The predict statements calculate predicted values for each observation in the data set. The
first of the two statements evaluates the mean without considering the random effects. This is

© 2003 by CRC Press LLC


Applications 543

the approximate population-average mean response after taking a Taylor series of the model
about Ecbd. The second predict statement calculates the cluster-specific predictions.
proc nlmixed data=ypoplar tech=newrap;
parms beta0=0.25 beta1=2.3 beta2=2.87 beta3=6.7 resvar=4.8
varu1=0.023 varu2=0.245; /* resvar denotes 5 # , varu1 5"# and varu2 5## */
X = dbh*dbh*totht/1000;
TotV = beta0 + (beta1+u1)*X;
R = exp(-(beta2+u2)*(t/1000)*exp(beta3*t));
model cumv ~ normal(TotV*R,resvar);
random u1 u2 ~ normal([0,0],[varu1,0,varu2]) subject=tn out=EBlups;
predict (beta0+beta1*X)*exp(-beta2*t/1000*exp(beta3*t)) out=predPA;
predict TotV*R out=predB;
run;

Output 8.1. The NLMIXED Procedure

Specifications
Data Set WORK.YPOPLAR
Dependent Variable cumv
Distribution for Dependent Variable Normal
Random Effects u1 u2
Distribution for Random Effects Normal
Subject Variable tn
Optimization Technique Newton-Raphson
Integration Method Adaptive Gaussian Quadrature
Dimensions
Observations Used 6636
Observations Not Used 0
Total Observations 6636
Subjects 336
Max Obs Per Subject 32
Parameters 7
Quadrature Points 1

Parameters
b0 b1 b2 b3 resvar varu1 varu2 NegLogLike
0.25 2.3 2.87 6.7 4.8 0.023 0.245 15535.9783

Iteration History
Iter Calls NegLogLike Diff MaxGrad Slope
1 18 15532.1097 3.868562 9.49243 -7.48218
2 27 15532.0946 0.015093 0.021953 -0.0301
3 36 15532.0946 3.317E-7 2.185E-7 -6.64E-7
NOTE: GCONV convergence criterion satisfied.

Fit Statistics
-2 Log Likelihood 31064
AIC (smaller is better) 31078
AICC (smaller is better) 31078
BIC (smaller is better) 31105

Parameter Estimates
Standard
Parameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper
b0 0.2535 0.1292 334 1.96 0.0506 0.05 -0.00070 0.5078
b1 2.2939 0.01272 334 180.38 <.0001 0.05 2.2689 2.3189
b2 2.7529 0.06336 334 43.45 <.0001 0.05 2.6282 2.8775
b3 6.7480 0.02237 334 301.69 <.0001 0.05 6.7040 6.7920
resvar 4.9455 0.08923 334 55.42 <.0001 0.05 4.7700 5.1211
varu1 0.02292 0.00214 334 10.69 <.0001 0.05 0.0187 0.0271
varu2 0.2302 0.02334 334 9.86 <.0001 0.05 0.1843 0.2761

© 2003 by CRC Press LLC


544 Chapter 8  Nonlinear Models for Clustered Data

Since the starting values are the converged values of a linearization followed by REML
estimation, the integral approximation method converges quickly after only three iterations
s œ c!Þ#&$&ß #Þ#*$*ß #Þ(&#*ß 'Þ(%)!d and
(Output 8.1). The estimates of the fixed effects are "
# #
those of the random effects are 5s œ %Þ*%&&, 5s " œ !Þ!##*#, and 5s ## œ !Þ#$!#. Notice that the
degrees of freedom equal the number of clusters minus the number of random effects in the
model (apart from /34 ). The asymptotic *&% confidence intervals for the variances of the ran-
dom effects do not include zero and based on this evidence one would conclude that the in-
clusion of the random effects improved the model fit. A better test can be obtained by fitting
the models without random effects or only one random effect and comparing minus twice the
log likelihoods (Table 8.3)

Table 8.3. Minus twice log likelihoods for various models fit to the yellow poplar data
(Models differ only in the number of random effects)
Model Random Effects  # Log Likelihood
ó ," and ,# $"ß !'% (Output 8.1)
ô ," $&ß *)$
õ ,# $*ß ")"
ö none %$ß %!# (Output 8.2)

The model with two random effects has the smallest value for minus twice the log likeli-
hood and is a significant improvement over any of the other models. Note that ö is the
purely fixed effect model which fits only a population-average curve and does not take into
account any clustering. Fit statistics and parameter estimates for model ö are shown in
Output 8.2. Since this model (incorrectly) assumes that all observations are independent, its
degrees of freedom are no longer equal to the number of clusters minus the number of covar-
iance parameters. The estimate of the residual variation is considerably larger than in the ran-
dom effects model ó. Residuals are measured against the population average in model ö
and against the tree-specific predictions in model ó.

Output 8.2. (abridged)


The NLMIXED Procedure

Fit Statistics

-2 Log Likelihood 43402


AIC (smaller is better) 43412
AICC (smaller is better) 43412
BIC (smaller is better) 43446

Parameter Estimates

Standard
Parameter Estimate Error DF t Value Pr > |t|

b0 1.4693 0.1672 6636 8.79 <.0001


b1 2.2430 0.005112 6636 438.79 <.0001
b2 4.1712 0.2198 6636 18.98 <.0001
b3 6.2777 0.05930 6636 105.86 <.0001
resvar 40.5512 0.7040 6636 57.60 <.0001

© 2003 by CRC Press LLC


Applications 545

We selected four trees (&, "&", #(*, and $!)) from the data set to show the difference
between the population-averaged and cluster-specific predictions (Figure 8.5). The trees vary
appreciably in size and total volume. The population average fits fairly well to tree ##(* and
the lower part of the bole of tree #"&". For the medium to large sized tree #& the PA predic-
tions overestimate the cumulative volume in the tree bole. For the large tree #$!), the popula-
tion average overestimates the volume in the lower parts of the tree bole where most of the
valuable timber is accrued. Except for the smallest tree, the tree-specific predictions provide
an excellent fit to the data. An operator that processes high-grade timber in a sawmill where
adjustments of the cutting tools on a tree-by-tree basis are feasible, would use the tree-
specific cumulative volume profiles to maximize the output of high-quality lumber. If
adjustments to the saws on an individual tree basis are not economically feasible, because the
timber is of lesser quality, for example, one can use the population-average profiles to
determine the settings.

0.00 0.25 0.50 0.75 1.00

5 151

70
100
Cumulative Volume (ft3)

30
40
D = 23 in, H = 108 ft D = 18.9 in, H = 91 ft
V0 = 125.8 ft 3 V0 = 73.64 ft3

279 308

100

3
D = 7 in, H = 26 ft
D = 23.2 in, H = 134 ft
V0 = 7.59 ft3
0 V0 = 166.8 ft3

0.00 0.25 0.50 0.75 1.00


Complementary diameter rij =1-dij/max(dij)

Figure 8.5. Population-averaged (dashed line) and cluster-specific (solid line) predictions for
four of the $$' trees. Panel headings are the tree identifiers.

8.4.2 Poppy Counts Revisited — a Generalized Linear Mixed


Model for Overdispersed Count Data
In §6.7.8 we analyzed the poppy count data of Mead et al. (1993, p. 144) which represents
counts obtained in a randomized block design with six treatments and four blocks (see Table
6.24). An analysis as a generalized linear model for Poisson data with linear predictor
(34 œ . € 73 € 34 à 3 œ "ß âß 'à 4 œ "ß âß %,

where 73 denotes treatments and 34 block effects, showed considerable overdispersion. The
overdispersion problem was tackled there by assuming that ]34 , the poppy count for treatment

© 2003 by CRC Press LLC


546 Chapter 8  Nonlinear Models for Clustered Data

3 in block 4, was not a Poisson random variable with mean -34 œ expe(34 f, but that -34 was a
Gamma random variable. The conditional distribution ]34 l-34 was modeled as a Poissona-34 b
random variable. This construction allowed the analytic derivation of the marginal probability
mass function

:aCb œ ( :aCl-b0 a-b. -, [8.14]

which turned out to follow the Negative Binomial law. Since this distribution is a member of
the exponential family (Table 6.2) the resulting model could be fit as a generalized linear
model. We used proc nlmixed to estimate the parameters of the model not because this was a
mixed model, but because of a problem associated with the dist=negbin option of proc
genmod (in the SAS® release we used that has been subsequently corrected) and in anticipa-
tion of fitting the model that follows.
An alternative approach to the Poisson/Gamma mixing procedure is to assume that the
linear predictor is a linear mixed model
(34 œ . € 73 € 34 € .34 , [8.15]

where .34 are independent Gaussian random variables with mean ! and variance 5.# . These
additional random variables introduce extra variability into the system associated with the
experimental units. Conditional on .34 , the poppy counts are again modeled as Poisson ran-
dom variables. The marginal distribution of the counts in the Poisson/Gaussian mixing model
is elusive, however. The integral [8.14] can not be evaluated in closed form. It can be
approximated by the methods of §8.3.3, however. This generalized linear mixed model
becomes
]34 l-34 œ Poissona-34 b
-34 œ expe. € 73 € 34 € .34 f
.34 µ Kˆ!ß 5 # ‰.

The proc nlmixed code to fit this model is somewhat lengthy, because treatment and
block effects are classification variables and must be coded inside the procedure. The block
of if .. else .. statements in the code below sets up the linear predictor for the various
combinations of block and treatment effects. The last level of either factor is set to zero and
its effect is absorbed into the intercept. This parameterization coincides with that of proc
genmod. The variance of the .34 is not estimated directly, because 5 # is bounded by zero from
below. Instead, we estimate the logarithm of the standard deviation which can range over the
real line (parameter logsig).
proc nlmixed data=poppies df=14;
parameters intcpt=3.4 bl1=0.3 bl2=0.3 bl3=0.3
tA=1.5 tB=1.5 tC=1.5 tD=1.5 tE=1.5
logsig=0;
if block=1 then linp = intcpt + bl1;
else if block=2 then linp = intcpt + bl2;
else if block=3 then linp = intcpt + bl3;
else if block=4 then linp = intcpt;
if treatment = 'A' then linp = linp + tA;
else if treatment = 'B' then linp = linp + tB;
else if treatment = 'C' then linp = linp + tC;
else if treatment = 'D' then linp = linp + tD;

© 2003 by CRC Press LLC


Applications 547

else if treatment = 'E' then linp = linp + tE;


else if treatment = 'F' then linp = linp;

lambda = exp(linp + d);

estimate 'A Lsmean' intcpt+0.25*bl1+0.25*bl2+0.25*bl3+tA;


estimate 'B Lsmean' intcpt+0.25*bl1+0.25*bl2+0.25*bl3+tB;
estimate 'C Lsmean' intcpt+0.25*bl1+0.25*bl2+0.25*bl3+tC;
estimate 'D Lsmean' intcpt+0.25*bl1+0.25*bl2+0.25*bl3+tD;
estimate 'E Lsmean' intcpt+0.25*bl1+0.25*bl2+0.25*bl3+tE;
estimate 'F Lsmean' intcpt+0.25*bl1+0.25*bl2+0.25*bl3;
estimate 'sigma^2' exp(2*logsig);

model count ~ poisson(lambda);


random d ~ normal(0,exp(2*logsig)) subject=plot;
run;

The statement lambda = exp(linp + d); calculates the conditional Poisson mean -34 .
Notice that the random effect d does not appear in the parameters statement. Only the disper-
sion of .34 is a parameter of the model. The model statement informs the procedure that the
counts are modeled (conditionally) as Poisson random variables and the random statement
determines the distribution of the random effects. Only the normal() keyword can be used in
the random statement. The first argument of normal() defines the mean of .34 , the second the
variance 5 # . Since we are also interested in the estimate of 5 # this value is calculated in the
last of the estimate statements. The other estimate statements calculate the "least squares"
means of the treatments on the log scale and averaged across the random effects. The
subject=plot statement identifies the experimental unit as the cluster which yields a model
with a single observation per cluster (Dimensions table in Output 8.3). The degrees of
freedom were set to coincide with the Negative Binomial analysis in §6.7.8. The Poisson
model without overdispersion had "& deviance degrees of freedom. The Negative Binomial
model estimated one additional parameter (labeled k there). Similarly, the Poisson/Gaussian
model adds one parameter, the variance of the .34 .
The initial negative log likelihood of "%&Þ") calculated from the starting values improved
during #" iterations that followed. The converged negative log likelihood was ""'Þ)*. The
important question is whether the addition of the random variables .34 in the linear predictor
improved the model over the standard Poisson generalized linear model. If L! : 5 # œ ! can be
rejected, the Poisson/Gaussian model is superior. From the result of the last estimate state-
ment it is seen that the approximate *&% confidence interval for 5 # is c!Þ!#'!ß !Þ"&)#d and
one would conclude that there is extra variation among the experimental units beyond that
accounted for by the Poisson law. A better approach is to fit a Poisson model with linear pre-
dictor (34 œ . € 73 € 34 and to compare the log likelihoods of the two models with a likeli-
hood ratio test. For the reduced model one obtains a negative log-likelihood of #!%Þ*(). The
likelihood ratio test statistic A œ %!*Þ*&  #$$Þ) œ "('Þ"& is highly significant.
To compare this model against the Negative Binomial model in §6.7.8 a likelihood ratio
test can not be employed because the Poisson/Gamma and the Poisson/Gaussian models are
not nested. For the comparison among non-nested models AIC can be used. The information
criteria for the two models are very close (Poisson/Gamma: AIC œ #&$Þ", Poisson/Gaussian:
AIC œ #&$Þ)). Note that since both models have the same number of parameters, the
difference of their AIC values equals twice the difference of their log-likelihoods. From a
statistical point of view either model may be chosen.

© 2003 by CRC Press LLC


548 Chapter 8  Nonlinear Models for Clustered Data

Output 8.3.
The NLMIXED Procedure

Specifications
Data Set WORK.POPPIES
Dependent Variable count
Distribution for Dependent Variable Poisson
Random Effects d
Distribution for Random Effects Normal
Subject Variable plot
Optimization Technique Dual Quasi-Newton
Integration Method Adaptive Gaussian
Quadrature
Dimensions
Observations Used 24
Observations Not Used 0
Total Observations 24
Subjects 24
Max Obs Per Subject 1
Parameters 10
Quadrature Points 1

Parameters
intcpt bl1 bl2 bl3 tA tB tC tD tE logsig NegLogLike
3.4 0.3 0.3 0.3 1.5 1.5 1.5 1.5 1.5 0 145.186427

Iteration History

Iter Calls NegLogLike Diff MaxGrad Slope


1 2 141.890476 3.29595 11.45812 -110.573
2 3 135.687876 6.2026 22.94238 -87.5166
3 6 131.122483 4.565393 34.73824 -429.497
... and so forth ...
20 33 116.899994 7.962E-7 0.001561 -1.85E-6
21 35 116.899994 6.486E-8 0.000319 -1.72E-7

NOTE: GCONV convergence criterion satisfied.

Fit Statistics

-2 Log Likelihood 233.8


AIC (smaller is better) 253.8
AICC (smaller is better) 270.7
BIC (smaller is better) 265.6

Parameter Estimates

Standard
Parameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper

intcpt 3.0246 0.2171 14 13.93 <.0001 0.05 2.5589 3.4903


bl1 0.3615 0.1929 14 1.87 0.0820 0.05 -0.05227 0.7753
bl2 0.2754 0.1927 14 1.43 0.1749 0.05 -0.1379 0.6887
bl3 0.7707 0.1897 14 4.06 0.0012 0.05 0.3639 1.1775
tA 2.6272 0.2363 14 11.12 <.0001 0.05 2.1205 3.1339
tB 2.5930 0.2363 14 10.97 <.0001 0.05 2.0862 3.0998
tC 0.9900 0.2418 14 4.09 0.0011 0.05 0.4713 1.5087
tD 0.9256 0.2422 14 3.82 0.0019 0.05 0.4061 1.4452
tE 0.05305 0.2526 14 0.21 0.8367 0.05 -0.4887 0.5948
logsig -1.1925 0.1673 14 -7.13 <.0001 0.05 -1.5512 -0.8337

© 2003 by CRC Press LLC


Applications 549

Output 8.3 (continued).


Additional Estimates
Standard
Label Estimate Error DF t Value Pr > |t| Alpha Lower Upper

A Lsmean 6.0037 0.1538 14 39.04 <.0001 0.05 5.6739 6.3336


B Lsmean 5.9695 0.1538 14 38.80 <.0001 0.05 5.6396 6.2995
C Lsmean 4.3665 0.1625 14 26.87 <.0001 0.05 4.0180 4.7150
D Lsmean 4.3021 0.1631 14 26.38 <.0001 0.05 3.9524 4.6519
E Lsmean 3.4295 0.1793 14 19.13 <.0001 0.05 3.0451 3.8140
F Lsmean 3.3765 0.1794 14 18.82 <.0001 0.05 2.9918 3.7612
sigma^2 0.09209 0.0308 14 2.99 0.0098 0.05 0.0260 0.1582

Of further interest is a comparison of the parameter estimates and estimates of the


treatment means among the various models that have been fit to these data (Output 8.4, SAS®
code on CD-ROM). We consider
ó: a Poisson generalized linear model;
ô: a Poisson model with multiplicative overdispersion factor;
õ: a Poisson/Gaussian mixing model;
ö: a Poisson/Gamma mixing model.

Output 8.4.
Poi/ |
Poi Gauss Poi/Gam | Poi Poi/OD Poi/Gauss Poi/Gam
effect Est" Est# Est$ | StdErr% StdErr& StdErr' StdErr(

intcpt 3.286 3.025 3.043 | .0915 .3783 .2171 .2121


bl1 .3736 .3615 .3858 | .0452 .1867 .1929 .1856
bl2 .2270 .2754 .2672 | .0466 .1926 .1927 .1864
bl3 .2940 .7707 .8431 | .0459 .1898 .1897 .1902
tA 2.504 2.627 2.630 | .0895 .3700 .2363 .2322
tB 2.459 2.593 2.618 | .0897 .3707 .2363 .2326
tC .9440 .9900 .9618 | .1014 .4193 .2418 .2347
tD .8536 .9256 .9173 | .1028 .4248 .2422 .2371
tE .1120 .0530 .0806 | .1184 .4896 .2526 .2429

A Lsmean 6.014 6.004 6.048 | .0247 .1021 .1538 .1510


B Lsmean 5.969 5.970 6.036 | .0253 .1044 .1538 .1516
C Lsmean 4.454 4.366 4.379 | .0537 .2221 .1625 .1585
D Lsmean 4.363 4.302 4.335 | .0562 .2323 .1631 .1597
E Lsmean 3.622 3.430 3.498 | .0814 .3365 .1793 .1729
F Lsmean 3.510 3.377 3.417 | .0861 .3559 .1794 .1742
k . . 11.44 | . . . 3.750
sigma^2 . .0921 . | . . .0308 .
"
: estimates in Poisson models ó and ô. Standard errors in % and & , respectively.
#
: estimates in Poisson/Gaussian mixing model õ. Standard errors in '
$
: estimates in Poisson/Gamma mixing model ö. Standard errors in 7

A multiplicative overdispersion factor does not alter the estimates, only their precision.
Comparing columns % and & of Output 8.4 the extent to which the regular GLM overstates the
precision of estimates is obvious. Estimates of the parameters as well as the treatment means

© 2003 by CRC Press LLC


550 Chapter 8  Nonlinear Models for Clustered Data

(on the log scale) are close for all methods. The standard errors of the two mixing models are
also very close.
An interesting aspect of the generalized linear mixed model is the ability to predict the
cluster-specific responses. Since each experimental unit serves as a cluster (of size one) this
corresponds to predicting the plot-specific poppy counts. If all effects in the model were
fixed, the term .34 would represent the block ‚ treatment interaction. The model would be
saturated with a deviance of zero and predicted counts would coincide with the observed
counts. In the generalized mixed linear model the .34 are random variables and only a single
degree of freedom is lost to the estimation of its variance (compared to $ ‚ & œ "& degrees of
freedom for a fixed effect interaction). After calculating the predictors s . 34 of the random
effects, cluster-specific predictions of the counts are obtained as
s 34 l.34 œ expš.
] 34 € s
s € s7 3 € s . 34 ›.

In proc nlmixed this is accomplished by adding the statement


predict exp(linp + d) out=sspred;

to the code above. Taking a Taylor series of expe. € 73 € 34 € .34 f about Ec.34 d œ !, the
marginal average Ec]34 d can be approximated as expe. € 73 € 34 f and estimated as
exp˜. 34 ™. These PA predictions can be obtained with the nlmixed statement
s € s7 3 € s
predict exp(linp) out=papred;

Treatment A,
Treatment B, Block 3
Block 3
600

500
Predicted Counts

400

300

200 Subject-Specific Poisson/Gaussian


Population-Average Poisson/Gaussian
Population-Average Poisson/Gamma
100

0
0 100 200 300 400 500
Observed Counts

Figure 8.6. Predicted vs. observed poppy counts in Poisson/Gaussian and Poisson/Gamma
(= Negative Binomial) mixing models.

The population-averaged and cluster-specific predictions are plotted against the observed
poppy counts in Figure 8.6. The cluster-specific predictions for the Poisson/Gaussian model

© 2003 by CRC Press LLC


Applications 551

are very close to the %&° line but the model is not saturated; the predictions do not reproduce
the data. There are still fourteen degrees of freedom left! The Negative Binomial model based
on Poisson/Gamma mixing does not provide the opportunity to predict poppy counts on the
plot level because the conditional distribution is not involved at any stage of estimation. The
PA predictions of the two mixing models are very similar as is expected from the agreement
of their parameter estimates (Output 8.4). The predicted values for treatments A and B in
block $ do not concur well with the observed counts, however. There appears to be block ‚
treatment interaction which is even more evident by plotting the PA predicted counts by treat-
ments against blocks (Figure 8.7).

600

A
400
Predicted Count

200

C/D

E/F
0

1 2 3 4
Block

Figure 8.7. Population-averaged predicted cell means in Poisson/Gaussian mixing model.

8.4.3 Repeated Measures with an Ordinal Response


The proportional odds model (POM) of McCullagh (1980) was introduced in §6.5 as an ex-
tension of generalized linear models to model an ordinal response. Recall that the POM is a
special case of a cumulative link model where the probability that an observation falls into
category 4 or below is modeled. In the case of a logit link with only two categories (a binary
response) the POM reduces to a standard logistic regression or classification model. As with
any other response, repeated measures are a frequent occurrence in agronomic investigations.
They give rise to clustered data structures with correlations among the repeat observations on
the same experimental unit that must be accounted for in the analysis, whether the response is
univariate or multivariate (as is the case for ordinal data).
The data in Table 8.4 stem from an experiment studying turfgrass quality for five
varieties. The varieties were applied independently to seventeen (varieties # and $) or
eighteen (varieties ", % and &) plots. The plots were visited in May, July, and September of
the growing season and turf quality was rated on a three-point ordinal scale as low, medium,
or excellent.

© 2003 by CRC Press LLC


552 Chapter 8  Nonlinear Models for Clustered Data

Table 8.4. Repeated measures of turfgrass ratings in three categories


(Numbers represent the number of plots on which
a particular rating was observed)
No. of May July September
Variety Plots Low Med. Exc. Low Med Exc. Low Med Exc.
" ") % "! % " * ) ! "# '
# "( # "" % ! ( "! ! * )
$ "( # "" % # ) ( # "" %
% ") ) ( $ % ) ' % "$ "
& ") " "" ' $ % "" $ ' *

It appears that the probability to observe a low rating decreases over time and the proba-
bility of excellent turf quality appears to be largest in July. Varietal differences in the ratings
distributions seem to be minor. To confirm the presence or absence of varietal effects, trends
over time, and possibly variety ‚ time interactions, a proportional odds model containing
these effects is fit. A standard model ignoring the possibility of correlations over time can be
fit with the genmod procedure in The SAS® System (see §§6.5.2, 6.7.4., and 6.7.5 for
additional code examples):
data counts;
input rating $ variety month count;
datalines;
low 1 5 4
med 1 5 10
xce 1 5 4
low 1 7 1
med 1 7 9
xce 1 7 8
med 1 9 12
xce 1 9 6
and so forth ...
;;
run;

proc genmod data=counts;


class variety;
model rating = variety month month*month variety*month
/ link=cumlogit dist=multinomial type3;
freq count;
run;

This model incorporates both linear and quadratic time effects because the data in Table 8.4
suggest that the rating distributions in July may be different from those in May or September.
The term variety*month models differences in the linear slopes among the varieties.
The fit achieves a  # log-likelihood of %)$Þ"# (Output 8.5). From the LR Statistics
For Type 3 Analysis table it is seen that only the linear and quadratic effects in time appear to
be significant. Varietal differences in the rating distributions appear to be absent
a: œ !Þ(#**b and trends over time appear not to differ among the five varieties a: œ !Þ()"!b.

© 2003 by CRC Press LLC


Applications 553

Output 8.5.
The GENMOD Procedure

Model Information
Data Set WORK.COUNTS
Distribution Multinomial
Link Function Cumulative Logit
Dependent Variable rating
Frequency Weight Variable count
Observations Used 42
Sum Of Frequency Weights 264
Probabilities Modeled Pr( Low Ordered Values of rating )

Class Level Information


Class Levels Values
variety 5 1 2 3 4 5

Response Profile
Ordered Ordered
Level Value Count
1 low 36
2 med 137
3 xce 91

Criteria For Assessing Goodness Of Fit


Criterion DF Value Value/DF
Log Likelihood -241.5596
Algorithm converged.

Analysis Of Parameter Estimates


Standard Wald 95% Chi- Pr >
Parameter DF Estimate Error Confidence Limits Square ChiSq

Intercept1 1 7.1408 3.2141 0.8413 13.4403 4.94 0.0263


Intercept2 1 9.9000 3.2451 3.5398 16.2603 9.31 0.0023
variety 1 1 1.5787 1.6553 -1.6657 4.8232 0.91 0.3402
variety 2 1 1.2614 1.6611 -1.9943 4.5171 0.58 0.4476
variety 3 1 0.0813 1.6658 -3.1835 3.3462 0.00 0.9611
variety 4 1 1.7832 1.6690 -1.4879 5.0543 1.14 0.2853
variety 5 0 0.0000 0.0000 0.0000 0.0000 . .
month 1 -2.8467 0.9331 -4.6756 -1.0178 9.31 0.0023
month*month 1 0.1975 0.0658 0.0686 0.3264 9.02 0.0027
month*variety 1 1 -0.1559 0.2299 -0.6065 0.2947 0.46 0.4977
month*variety 2 1 -0.1773 0.2321 -0.6322 0.2777 0.58 0.4450
month*variety 3 1 0.0810 0.2332 -0.3760 0.5380 0.12 0.7282
month*variety 4 1 -0.0328 0.2312 -0.4860 0.4204 0.02 0.8871
month*variety 5 0 0.0000 0.0000 0.0000 0.0000 . .
Scale 0 1.0000 0.0000 1.0000 1.0000

NOTE: The scale parameter was held fixed.

LR Statistics For Type 3 Analysis


Chi-
Source DF Square Pr > ChiSq
variety 4 2.03 0.7299
month 1 10.07 0.0015
month*month 1 9.22 0.0024
month*variety 4 1.75 0.7810

© 2003 by CRC Press LLC


554 Chapter 8  Nonlinear Models for Clustered Data

Since this model does not account for correlations over time it is difficult to say whether
these findings persist if the temporal correlations are incorporated in a model. In particular,
because modeling the correlations through random effects will not only change the standard
error estimates but the estimates of the model coefficients themselves. Positive autocorrela-
tion leads to overdispersed data and one approach to remedy the situation is by formulating a
mixed proportional odds model where, given some random effects, the data follow a POM
and to perform maximum likelihood inference based on the marginal distribution of the data.
This indirect approach of modeling correlations (see §7.5.1 for the distinction between direct
and induced correlation models) is reasonable in models for correlated data where the mean
function is nonlinear.
Using the same regressors and fixed effects as in the previous fit, we now add a random
effect that models the plot-to-plot variability. This is reasonable since treatments have been
assigned at random to plots, because extra variation is likely to be related to excess variation
among the experimental units, and the plots have been remeasured (are the clusters). This fit
is obtained with the nlmixed procedure in SAS® . We note in passing that because proc
nlmixed uses an integral approximation based on quadrature, this modeling approach is
identical to the one put forth by Jansen (1990) for ordinal data with overdispersion and
Hedeker and Gibbons (1994) for clustered ordinal data.
The data must be set up differently for the nlmixed operation, however. The counts data
set used with proc genmod lists for all varieties and months the number of plots that were
assigned a particular rating (variable count). The data set CountProfiles used to fit the
mixed model variety of the POM contains the number of response profiles over time. The
first three observations show one unique response profile for variety ". A low rating in May
was followed by two medium ratings in July and September. Two of the ") plots for this
variety exhibited that particular response profile (variable count). The remaining triplets of
observations in the data set CountProfiles give the response profiles for this and the other
varieties. The sub variable identifies the clusters for this study, corresponding to the plots. It
works in conjunction with the replicate statement of proc nlmixed. The first triplet of
observations are identified as belonging to the same plot (cluster) and the value of the count
variable determines that there are two plots (experimental units) with this response profile.
data CountProfiles;
label rating = '1=low, 2=medium, 3=excellent';
input rating variety month sub count;
datalines;
1 1 5 1 2
2 1 7 1 2
2 1 9 1 2
1 1 5 2 1
2 1 7 2 1
3 1 9 2 1
1 1 5 3 1
3 1 7 3 1
2 1 9 3 1
and so forth ...
;;
run;
proc nlmixed data=CountProfiles;
parms i1=7.14 i2=9.900 /* cutoffs */
v1=1.57 v2=1.26 v3=0.08 v4=1.783 /* variety effects */
m=-2.85 m2=0.197 /* month and month^2 slope */
mv1=-0.15 mv2=-0.17 mv3=0.08 mv4=-0.03 /* Variety spec. slopes */
sd=1; /* standard deviation of random plot errors */

© 2003 by CRC Press LLC


Applications 555

if variety=1 then linp = v1 + m*month + m2*month*month + mv1*month;


else if variety=2 then linp = v2 + m*month + m2*month*month + mv2*month;
else if variety=3 then linp = v3 + m*month + m2*month*month + mv3*month;
else if variety=4 then linp = v4 + m*month + m2*month*month + mv4*month;
else linp = m*month + m2*month*month;

linp = linp + ploterror;

/* Now build the category probabilities */


if (rating=1) then do;
catprob = 1/(1+exp(-i1-linp));
end; else if (rating=2) then do;
catprob = 1/(1+exp(-i2-linp)) - 1/(1+exp(-i1-linp));
end; else catprob = 1- 1/(1+exp(-i2-linp));

/* Now build the log-likelihood function */


if (catprob > 1e-8) then ll=log(catprob); else ll=-1e100;
model rating ~ general(ll);
random ploterror ~ normal(0,sd*sd) subject=sub;
replicate count;
run;

The block of if .. else .. statements sets up the linear predictor apart from the two
cutoffs (parameters i1 and i2) needed to model a three-category ordinal response and the
random plot effect. The latter is added in the linp = linp + ploterror; statement. The
second block of if .. then .. else .. statements calculates the category probabilities
from which the multinomial log-likelihood is built. Should a category probability be very
small a log-likelihood contribution of "!"!! is assigned to avoid computational inaccuracies
when taking the logarithm of a quantity close to zero. The random statement models the plot
errors as Gaussian random variables with mean zero and variance 5 # œ sd*sd. In the vernac-
ular of mixing models, this is a Multinomial/Gaussian model. The replicate statement
identifies the variable in the data set which indicates the number of response profiles for a
particular variety. This statement must not be confused with the repeated statement of the
mixed procedure. As starting values for proc nlmixed the estimates from Output 8.5 were
chosen. The starting value for the standard deviation of the random plot errors was guessed.
The Dimensions table shows that the data have been set up properly (Output 8.6).
Although there are three observations in each response profile, the replicate statement uses
only the last observation in each profile to determine the number of plots that have the partic-
ular profile. The number of clusters is correctly determined as )) and the number of repeated
measurements as three (Max Obs Per Subject). The adaptive quadrature determined that three
quadrature points provided sufficient accuracy in the integration problem.
The procedure required thirty-four iterations until further updates did not provide an im-
provement in the log-likelihood. The  # log-likelihood at convergence of %&'Þ! is consider-
ably less than that of the independence model (%)$Þ", Output 8.5). The difference of #(Þ" is
highly significant ÐPrÐ;#" ž #(Þ"Ñ  !Þ!!!"Ñ, an improvement over the independence model
brought about only by the inclusion of the random plot errors.
From the *&% confidence bounds on the parameters it is seen that the linear and quadrat-
ic time effects (m and m2) are significant, their bounds do not include zero. The confidence
interval for the standard deviation of the plot errors also does not include zero, supporting the
finding obtained by the likelihood ratio test, that the inclusion of the random plot errors is a
significant improvement of the model.

© 2003 by CRC Press LLC


556 Chapter 8  Nonlinear Models for Clustered Data

Output 8.6.
The NLMIXED Procedure

Specifications
Data Set WORK.COUNTPROFILES
Dependent Variable rating
Distribution for Dependent Variable General
Random Effects ploterror
Distribution for Random Effects Normal
Subject Variable sub
Replicate Variable count
Optimization Technique Dual Quasi-Newton
Integration Method Adaptive Gaussian Quadrature

Dimensions
Observations Used 129
Observations Not Used 0
Total Observations 129
Subjects 88
Max Obs Per Subject 3
Parameters 13
Quadrature Points 3

Parameters
i1 i2 v1 v2 v3 v4 m
7.14 9.9 1.57 1.26 0.08 1.783 -2.85
m2 mv1 mv2 mv3 mv4 sd NegLogLike
0.197 -0.15 -0.17 0.08 -0.03 1 233.22016

Iteration History
Iter Calls NegLogLike Diff MaxGrad Slope
1 5 233.216559 0.003601 35.97236 -213.078
2 8 232.974984 0.241575 197.669 -0.05738
3 10 232.018721 0.956263 11.48055 -2.90352
ã
33 66 227.995816 0.000395 0.018264 -0.00075
34 68 227.995816 1.656E-7 0.003131 -3.05E-7

NOTE: GCONV convergence criterion satisfied.

Fit Statistics
-2 Log Likelihood 456.0
AIC (smaller is better) 482.0
AICC (smaller is better) 485.2
BIC (smaller is better) 514.2

Parameter Estimates
Standard
Parameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper

i1 9.0840 3.6818 87 2.47 0.0156 0.05 1.7660 16.402


i2 12.8787 3.7694 87 3.42 0.0010 0.05 5.3866 20.371
v1 2.2926 1.9212 87 1.19 0.2360 0.05 -1.5260 6.111
v2 1.8926 1.9490 87 0.97 0.3342 0.05 -1.9812 5.766
v3 0.1835 1.9142 87 0.10 0.9239 0.05 -3.6213 3.988
v4 2.2680 1.9586 87 1.16 0.2500 0.05 -1.6249 6.161
m -3.6865 1.0743 87 -3.43 0.0009 0.05 -5.8217 -1.551
m2 0.2576 0.07556 87 3.41 0.0010 0.05 0.1074 0.408
mv1 -0.2561 0.2572 87 -1.00 0.3221 0.05 -0.7673 0.255
mv2 -0.2951 0.2631 87 -1.12 0.2650 0.05 -0.8179 0.228
mv3 0.08752 0.2562 87 0.34 0.7335 0.05 -0.4218 0.597
mv4 -0.03720 0.2595 87 -0.14 0.8864 0.05 -0.5530 0.479
sd 1.5680 0.2784 87 5.63 <.0001 0.05 1.0146 2.121

© 2003 by CRC Press LLC


Applications 557

The confidence bounds for the varieties and the variety ‚ month interaction terms
include zero and on first glance one would conclude that there are no varietal effects at work.
Because of the coding of the classification variables, v1 for example, does not measure the
intercept for variety ", rather the difference between the intercepts of varieties " and &. To test
the significance of various effects we consider the model whose output is shown in Output
8.6 as the full model and fit various reduced versions of it (Table 8.5).
The likelihood ratio test statistics aAb and :-values represent comparisons to the full
model ó. Removing the variety ‚ month interaction from the model does not significantly
impair the fit (: œ !Þ&!)) but removing any other combination of effects in addition to the
interaction does worsen the model. Based on these results one could adopt model ô as the
new full model. Since variety effects have not been removed by themselves, one can test their
significance by comparing the  # log likelihoods of models ô and ÷. The : -value is calcu-
lated as
: œ Prˆ;#% ž %("Þ(  %&*Þ$‰ œ Prˆ;#% ž "#Þ%‰ œ !Þ!"%.

Variety effects are significant in model ô.

Table 8.5.  # Log-likelihoods for various mixed models fitted to the repeated
measures turf ratings (All models contain a random plot effect)
Fixed Effects
Model included dropped  #logL df † A :
#
ó Variety, >, > , Variety ‚ >  %&'Þ! 
#
ô Variety, >, > Variety ‚ > %&*Þ$ % $Þ$ !Þ&!)
õ Variety, > Variety ‚ >, ># %(#Þ) & "'Þ) !Þ!!&
ö Variety Variety ‚ >, >, ># %(&Þ( ' "*Þ( !Þ!!$
#
÷ >, > Variety, Variety ‚ > %("Þ( ) "&Þ( !Þ!%'

: df denotes the degrees of freedom dropped compared to the full model ó.

Output 8.7.
Parameter Estimates

Standard
Parameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper

i1 9.6333 3.5009 87 2.75 0.0072 0.05 2.6748 16.591


i2 13.3522 3.5946 87 3.71 0.0004 0.05 6.2075 20.497

v1 0.4999 0.6644 87 0.75 0.4538 0.05 -0.8206 1.821


v2 -0.1523 0.6778 87 -0.22 0.8227 0.05 -1.4995 1.195
v3 0.7909 0.6720 87 1.18 0.2424 0.05 -0.5448 2.127
v4 1.9887 0.6803 87 2.92 0.0044 0.05 0.6365 3.341

m -3.7288 1.0575 87 -3.53 0.0007 0.05 -5.8307 -1.627


m2 0.2538 0.07494 87 3.39 0.0011 0.05 0.1048 0.403

sd 1.5205 0.2729 87 5.57 <.0001 0.05 0.9781 2.063

© 2003 by CRC Press LLC


558 Chapter 8  Nonlinear Models for Clustered Data

The model finally selected is ô and its parameter estimates are shown in Output 8.7.
From these estimates the variety specific probability distributions over time can be calculated
(Figure 8.8). Perhaps surprisingly, the drop in low rating probabilities is less striking than
appears in Table 8.4. Except for variety % the probability of receiving a low rating remains
constant throughout the three-month period. Excellent ratings are most common around July
but only for varieties # and & is excellent turf quality in that period more likely than medium
quality.

5 6 7 8 9

variety: 1 variety: 2
0.7

0.3
Category Probabilities

variety: 3 variety: 4
0.7

0.3

variety: 5
0.7
Pr(low rating)
Pr(medium rating)
0.3
Pr(excellent rating)

5 6 7 8 9
Month

Figure 8.8. Change in category probabilities over time by varieties.

5 6 7 8 9

variety: 1 variety: 2

2
Logits of Cumulative Probabilities

-2

variety: 3 variety: 4

-2

variety: 5

2
low rating
at most medium rating
-2

5 6 7 8 9
Month

Figure 8.9. Logits of cumulative predicted probabilities.

© 2003 by CRC Press LLC


Applications 559

The linear portion of the model is best interpreted on the logit scale (Figure 8.9). The
final model contains varietal differences and linear and quadratic time effects. Especially
variety % has elevated intercepts compared to the other entries. From Output 8.7 we see that
the confidence interval c!Þ'$'&ß $Þ$%"d for coefficient v4 does not contain zero. Since variety
& is the benchmark, this implies that varieties % and & are significantly different in the
elevation of the lines in Figure 8.9. The statements to fit the selected model with proc
nlmixed and to obtain pairwise comparisons of the variety effects (intercepts) follows below.
Results of the pairwise comparisons are shown in Output 8.8. At the &% significance level
variety % is significantly different in the rating probability distributions from varieties ", #,
and & (: œ !Þ!#)%, !Þ!!#(, and !Þ!!%%, respectively).

proc nlmixed data=CountProfiles;


parms i1=5.024 i2=7.793 v1=1.61 v2=0.88 v3=0.337 v4=2.155
m=-2.34 m2=0.168 sd=1;
v5 = 0;
array vv{5} v1-v5;

linp = vv{variety};
linp = linp + m*month + m2*month*month+ ploterror;

if (rating=1) then catprob= 1/(1+exp(-i1-linp));


else if (rating=2) then catprob=1/(1+exp(-i2-linp))-1/(1+exp(-i1-linp));
else catprob= 1- 1/(1+exp(-i2-linp));
if (catprob > 1e-8) then ll=log(catprob); else ll=-1e100;
model rating ~ general(ll);
random ploterror ~ normal(0,sd*sd) subject=sub;
replicate count;
/* pairwise treatment comparisons */
estimate 'v1-v2' v1-v2;
estimate 'v1-v3' v1-v3;
estimate 'v1-v4' v1-v4;
estimate 'v1-v5' v1;
estimate 'v2-v3' v2-v3;
estimate 'v2-v4' v2-v4;
estimate 'v2-v5' v2;
estimate 'v3-v4' v3-v4;
estimate 'v3-v5' v3;
estimate 'v4-v5' v4;
run;

Output 8.8.
Additional Estimates

Standard
Label Estimate Error DF t Value Pr > |t| Alpha Lower Upper

v1-v2 0.6522 0.6716 87 0.97 0.3342 0.05 -0.6826 1.9871


v1-v3 -0.2910 0.6630 87 -0.44 0.6618 0.05 -1.6087 1.0268
v1-v4 -1.4887 0.6680 87 -2.23 0.0284 0.05 -2.8164 -0.1610
v1-v5 0.4999 0.6644 87 0.75 0.4538 0.05 -0.8206 1.8205
v2-v3 -0.9432 0.6806 87 -1.39 0.1693 0.05 -2.2959 0.4095
v2-v4 -2.1410 0.6929 87 -3.09 0.0027 0.05 -3.5182 -0.7638
v2-v5 -0.1523 0.6778 87 -0.22 0.8227 0.05 -1.4995 1.1949
v3-v4 -1.1977 0.6698 87 -1.79 0.0772 0.05 -2.5291 0.1336
v3-v5 0.7909 0.6720 87 1.18 0.2424 0.05 -0.5448 2.1267
v4-v5 1.9887 0.6803 87 2.92 0.0044 0.05 0.6365 3.3409

© 2003 by CRC Press LLC


Chapter 9

Statistical Models for


Spatial Data

Space. The Frontier. Finally!

9.1 Changing the Mindset


9.1.1 Samples of Size One
9.1.2 Random Functions and Random Fields
9.1.3 Types of Spatial Data
9.1.4 Stationarity and Isotropy — the Built-in Replication Mechanism
of Random Fields
9.2 Semivariogram Analysis and Estimation
9.2.1 Elements of the Semivariogram
9.2.2 Parametric Isotropic Semivariogram Models
9.2.3 The Degree of Spatial Continuity (Structure)
9.2.4 Semivariogram Estimation and Fitting
9.3 The Spatial Model
9.4 Spatial Prediction and the Kriging Paradigm
9.4.1 Motivation of the Prediction Problem
9.4.2 The Concept of Optimal Prediction
9.4.3 Ordinary and Universal Kriging
9.4.4 Some Notes on Kriging
9.4.5 Extensions to Multiple Attributes
9.5 Spatial Regression and Classification Models
9.5.1 Random Field Linear Models
9.5.2 Some Philosophical Considerations
9.5.3 Parameter Estimation
9.6 Autoregressive Models for Lattice Data
9.6.1 The Neighborhood Structure

© 2003 by CRC Press LLC




9.6.2 First-Order Simultaneous and Conditional Models


9.6.3 Parameter Estimation
9.6.4 Choosing the Neighborhood Structure
9.7 Analyzing Mapped Spatial Point Patterns
9.7.1 Introduction
9.7.2 Random, Aggregated, and Regular Patterns — the Notion of
Complete Spatial Randomness
9.7.3 Testing the CSR Hypothesis in Mapped Point Patterns
9.7.4 Second-Order Properties of Point Patterns
9.8 Applications
9.8.1 Exploratory Tools for Spatial Data —
Diagnosing Spatial Autocorrelation with Moran's I
9.8.2 Modeling the Semivariogram of Soil Carbon
9.8.3 Spatial Prediction — Kriging of Lead Concentrations
9.8.4 Spatial Random Field Models — Comparing C/N Ratios among
Tillage Treatments
9.8.5 Spatial Random Field Models — Spatial Regression of
Soil Carbon on Soil N
9.8.6 Spatial Generalized Linear Models — Spatial Trends in the
Hessian Fly Experiment
9.8.7 Simultaneous Spatial Autoregression — Modeling Wiebe's
Wheat Yield Data
9.8.8 Point Patterns — First- and Second-Order Properties of a
Mapped Pattern

© 2003 by CRC Press LLC


9.1 Changing the Mindset

9.1.1 Samples of Size One


We could call this section Introduction to Statistical Models for Spatial Data, but the entire
chapter is a mere introduction (superficial, at best) to the topic of statistical analysis of spatial
data. We will only scratch the surface of many important issues such as the analysis of lattice
data, cokriging or the modeling of spatial point patterns. Some of the methods of spatial data
analysis are discussed only by way of application in §9.8 without a detailed precursor. On the
other hand this section is hopefully more than an introduction to what follows later in this
chapter. What we are trying to achieve is a change in mindset, a way of looking at data that
differs substantively from any of the viewpoints we have taken in previous chapters. This is
necessary not only to convey the special standing statistical techniques for spatial data should
be awarded in the research worker's toolbox but also to underline the differences in subject
matter origin and mathematical-statistical content from the methods discussed so far. No other
area of statistical endeavor promises to impact the plant and soil scientist's approach to data
collection and analysis as spatial statistics does. And no other area requires tools that are
further removed from the statistical topics to which students, scholars, and research workers
have been traditionally exposed.
Observing spatial data entails the recording of an attribute of interest and the attribute
location. Parting with notation used earlier we denote the attribute of interest being measured
by ^ and the location at which we observe this attribute as s. A spatial observation is then de-
noted as ^ asb, the observation of attribute ^ at location s. The bold-faced vector notation s is
used to emphasize that s typically contains multidimensional coordinates. The case we will
consider throughout this chapter is where s is a point in ‘# , two-dimensional Euclidean space
and the elements of s represent longitude and latitude in the plane. As an example consider
yield monitoring a corn field which may give rise to "ß !!! spatially referenced observations.
The data consist of ^ as" bß âß ^ as"!!! b, where ^ as3 b denotes the corn yield at location s3 .
Should we think of these "ß !!! observations as a (random) sample of wheat yields of size
8 œ "ß !!!? First we note that the sample locations were not chosen at random since the com-
bine collects samples at systematic intervals. Second, it is (fairly) obvious that the "ß !!! ob-
servations cannot possibly be independent as a random sample would imply. If you were
given the information that ^ as& b is "$& bushels per acre, how surprised would you be to find
out that the next observation, ^ as' b, collected only a few feet from ^ as& b, was "%# bushels
per acre? We would not be surprised at all. In fact we would be surprised if ^ as' b œ #"
bushels per acre. This phenomenon is sometimes referred to as Tobler's law of geography:
“Everything is related to everything else, but near things are more related than distant things.”
Tobler's law of geography (Tobler 1970) instructs that we should expect relationships
between spatially distributed quantities and that the strength of the relationships is a function
of their spatial separation. In the sequel we will define and model numerous mathematical
forms of this sentiment. But there is a deeper issue to be considered here.
When the biomass of a random sample of fifty plants is observed, fifty realizations from
a univariate distribution, the distribution of plant biomass, are obtained. If we measure not
only the total plant biomass but the above- and below-ground biomass we obtain fifty

© 2003 by CRC Press LLC


realizations from a bivariate distribution. The below- and above-ground biomass of a single
plant are a sample of size one from this bivariate distribution. If ]" denotes above-ground and
]# below-ground biomass, the realized value of this single observation may be y œ c"Þ'ß !Þ&dw
(in appropriate units).

a) Univariate Random Variable b) Bivariate Random Variable


]" ! " !Þ$
] µ Ka!ß "b ” ] • µ KŒ” ! •ß ” !Þ$ " •
#

0.4

0.3
Density

0.2

0.1

0.0

-3 -2 -1 0 1 2 3
y

C" "Þ'
A single realization: C œ "Þ%$ A single realization: ” œ
C# • ” !Þ& •

c) A Spatial Random Field


e^ asb À s ­ H § ‘# f
A single realization:

Figure 9.1. Univariate (a) and bivariate (b) random variables and the realization of a
stochastic process in ‘# (c). In panels (a) and (b), the graph represents the distribution
(process) from which a single realization is drawn. In panel (c) the graph is the realization.

What is obtained by measuring crop yield at #!! locations in a wheat field or by


recording the locations of a group of red cockaded woodpeckers during the month of July or

© 2003 by CRC Press LLC


the locations of craters on the surface of the moon or the proportion of farmers employing
site-specific management by county in Virginia? The answer, perhaps surprisingly, is: a
single realization of an 8-dimensional random variable. Notice the difference between one
realization of above- and below-ground biomass of a plant and measuring lime requirement at
two spatial locations. In the former case two different attributes are observed, whereas in the
latter the same attribute is measured at different locations.
To emphasize this viewpoint of spatial data recall the notion of clustering in data from
§2.4 and §2.6. Assume data consists of 5 clusters of size 83 so that the total number of obser-
vations is !53œ" 83 œ 8. We make the implicit assumptions that observations from different
clusters are uncorrelated but that observations from the same cluster may be correlated, for
example, because they represent repeated measures. The response vector Y3 for the 3th cluster
is an a83 ‚ "b vector and the observed response y3 is a single realization from an 83 -dimen-
sional distribution. The case of unclustered data is a special case of this structure with 83 œ "
and the number of observations is equal to the number of clusters. In that case, C3 is a single
realization from a univariate ("-dimensional) distribution. Spatial data is also a special case of
a clustered structure a5 œ "b. It represents a single cluster, so to speak.
Figure 9.1 puts these notions into perspective. The graphs shown in panels a and b
represent the particular population or distribution from which individual realizations are
drawn. In the case of spatial data (panel c), the figure represents the realization itself (the
draw) that is obtained. To summarize, spatial observations ^ as" bß âß ^ as8 b are not the same
variable ^ observed 8 times over, but the variables ^ as" bß ^ as# bß âß ^ as8 b observed once.
But what kind of distribution or process do we draw from to generate the realization in Figure
9.1c?

9.1.2 Random Functions and Random Fields


Box 9.1 Random Fields

• A set of spatial data is considered a realization of a random experiment. For


any outcome = of the experiment, a single realization of ^ asb is obtained.
This is the realization of a random field, a stochastic process.

• ^ as! b is a random variable by considering the distribution of all possible


realizations at the location s! .

• When a random field is sampled, samples are drawn from one particular
realization of the random experiment.

Consider ^asb as a function of the spatial coordinates s that are elements of a set H, which
we call the domain. For now assume that H is a continuous set. To incorporate stochastic
behavior (randomness), ^asb is considered the realization of a random experiment. To make
the dependence on the random experiment explicit we use the notation ^asß =b for the time
being, where = is the outcome of a particular experiment. Hence, we are really concerned
with a function of two variables, s and =. This is called a random function because the surface

© 2003 by CRC Press LLC


obtained depends on a random experiment. Figure 9.2 shows four realizations of a random
function where the domain is the rectangle a!ß &b ‚ a!ß &b. Imagine a soil sample is poured
from a bucket onto a flat surface and the depth of the soil is measured. This may be the reali-
zation in the upper left panel of Figure 9.2. Now put the soil back in the bucket and pour it
again. This produces the realization in the upper right panel of the figure and so forth. Every
pouring = constitutes a random experiment, the result is a function ^a † ß =b.

Figure 9.2. Four realizations of a random function in ‘# . The domain H is continuous and
given by the rectangle a!ß &b ‚ a!ß &b. Realizations at s! œ c#ß $d are shown as dots.

The mechanism by which we consider the attribute at a particular location s! to be a ran-


dom variable is to imagine all possible realizations (all possible outcomes =) of the random
experiment at that location. In Figure 9.2 four realizations at s! œ c#ß $d are shown as dots.
The distribution of the attribute at s! over all possible realizations of the random function is
the probability distribution of the random variable ^ as! ß † b.
In what follows we will ignore the explicit dependence of the random function on the
random experiment = and simply refer to ^asb, the attribute at location s, which is a random
variable by this mechanism. It is important to note, however, that the randomness of ^asb
stems from a super-population model that is alluded to in Figure 9.2. This has several
important ramifications. Whether we sample the spatial attribute by randomly placing sample
locations in H or with a systematic grid has no bearing on the randomness of ^asb. A spatial
attribute is not considered random because we performed random sampling. Even if we
observed ^ asb everywhere in H with an exhaustive sampling procedure (which is impossible
if H is continuous), we would be assessing only one realization of the random function
^asß =b. The sample is drawn from a single panel of Figure 9.2. Again, this underlines the
notion that a sample ^ as" bß âß ^ as8 b is a sample of size one.

© 2003 by CRC Press LLC


In mathematical statistics random functions are known as stochastic processes. Those
processes where H is two- or more-dimensional are also called random fields. This nomen-
clature has nothing to do with the agricultural notion of a field, although the theory of random
fields is fruitfully applied there. The upper-case/lower-case distinction of random variables
that we have maintained so far is somewhat difficult to uphold for random functions without
making notation too cumbersome. What do we mean by ^asb? The random realizations of the
function or the random variable at location s? Similarly, does Dasb represent the realization at
location s or the realization of the function itself? In what follows we suppress the explicit de-
pendence of the random function on = and use notation that is common in the stochastic proc-
ess literature:
˜^ asb À s ­ H § ‘# ™ [9.1]

denotes a spatial random field with a two-dimensional domain. The attribute of interest, ^ , is
a stochastic process with domain (or index set) H which itself is a subset of ‘# . When we
have in mind the random variable at s, we use ^ asb and denote its realization as D asb. The
vector of all observations is denoted Zasb œ c^ as" bß âß ^ as8 bdw . Definition [9.1] is quite
abstract but it can be fleshed out by considering various types of spatial data.

9.1.3 Types of Spatial Data


Box 9.2 Spatial Data Types

• Three categories of spatial data are distinguished: geostatistical data, lattice


data, and spatial point patterns.

• The domain H is fixed and continuous for geostatistical data, fixed and
discrete for lattice data, and a random set for point data.

• The scientific questions raised differ substantially among the three data
types and specific tools have been developed to address these questions.
Some of the tools are transitive in that they can be applied to any of the data
types, others are particular to a specific spatial data structure.

Many practitioners associate with spatial data analysis terms like geostatistics and methods
such as kriging. Geostatistical data is only one of many spatial data types which can be de-
fined through the domain H of the random field [9.1]. In the case of geostatistical data the do-
main is a fixed, continuous set; the number of locations at which observations can be made is
not countable. Between any two sample locations s3 and s4 an infinite number of additional
samples can be placed in theory. Furthermore, there is nothing random about the locations
themselves. Examples of geostatistical data are measuring the electrical conductivity of soil,
yield monitoring a field, and sampling the ore grade of a rock formation. Figure 9.3 (left
panel) shows (# locations on a shooting range at which the lead concentration was measured.
The shooter location is at coordinate B œ "!!, C œ !.
Because of the continuity of H, geostatistical data is also referred to as spatial data with
continuous variation. This does not imply that the attribute ^ is continuous. The nature of the

© 2003 by CRC Press LLC


attribute ^ as discrete or continuous does not alter the nature of the spatial data type. Whether
one is interested in the presence/absence of a microbial species in a series of soil samples
a^ is binaryb, the soil :L a^ is continuousb, or the number of macropores a^ is a countb, the
data are geostatistical unless there is only a countable number of sample locations and/or the
domain changes from realization to realization of the random function at random. Whether
data are collected on regular grids, irregular grids, or by random sampling of locations also
has no bearing on the nature of the spatial data type. The continuity of the fixed domain H is
what matters, not how it is sampled. Figure 9.3 (right panel) shows the locations at which
wheat was sampled for determination of the deoxinyvalenol concentration in a Michigan field
in 2000. The basic layout is systematic, consisting of four transects with equally spaced
sample intervals along the transects. At every other transect location a cluster of samples is
collected by branching #, ', and ) feet perpendicular to the transect direction. Since samples
could have been placed anywhere within the field, the data are geostatistical.
300

150
250

Y-Coordinate
200

100
Y-coordinate

150
100

50
50

0
0

0 50 100 150 200


0 50 100 150 200 250
X-coordinate X-Coordinate

Figure 9.3. Left panel: sampling locations on shooting range at which lead concentrations
were measured. Right panel: Sample grid for collecting wheat kernels in a field with wheat
scab. Lead data kindly provided by Dr. James R. Craig and Dr. Donald Rimstidt, Dept. of
Geological Sciences, Virginia Polytechnic Institute and State University.

Data collected in a systematic sample scheme such as a grid, unaligned grid, or a


transect-type sample is sometimes incorrectly termed lattice data. Lattice data is a spatial data
type where H is a fixed and discrete (and hence countable) set of locations. The number of
locations at which measurements can be made might be infinite with lattice data; the key is
that the possible sample locations can be enumerated. Examples are attributes recorded by
county, city block, or census tract and information obtained from pixel images. When data are
collected by census tract or county, there is no space defined between these discrete units at
which observations can be made. Whether the units are aligned and shaped regularly as pixels
or experimental units in a field experiment, or irregularly shaped such as counties, does not
matter for the classification as lattice data. An example of lattice data is shown in Figure 9.4,
which depicts the number of sudden infant deaths in North Carolina between 1974 and 1978
relative to the number of live births. This famous data set appears in Cressie (1993) and is
analyzed in great depth there. A case of lattice data of particular interest to agronomists arises
as data from planned field experiments. The discrete spatial units are the experimental units
which are typically arranged in some regular fashion and equally sized (Figure 9.5).

© 2003 by CRC Press LLC


< 2.0
2.0 - 3.0

3.0 - 4.0

> 4.0

Figure 9.4. Sudden infant deaths (SIDs) in North Carolina 1974-1978. Shown are the
number of SIDs relative to the number of live births. These data are included with
S+SpatialStats® .

20
19
18
17
16
15
14
13
12
11
Row

10
9
8
7
6
5
4
3
2
1
0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Column

Figure 9.5. Grain yield data of Mercer and Hall (1911) at Rothamsted Experimental Station.
The area of the squares in each lattice cell is proportional to the grain yield. From Table 6.1
of Andrews and Herzberg (1985). Used with permission.

Geostatistical and lattice data have in common that the domain is fixed, not random. To
develop the idea of a random domain, consider ^asb to be crop yield and define an indicator
variable Y asb which takes on the value " if the yield is below some threshold level and !
otherwise,
" ^ a sb  -
Y a sb œ œ
! otherwise.

If ^ asb is geostatistical, so is Y asb. The random function Y asb now returns the values " and !
instead of the continuous attribute yield.

© 2003 by CRC Press LLC


400

300

200

Latitude
100

-100

-200

-100 0 100 200 300 400 500 600


Longitude

Figure 9.6. Locations in a corn field where yield per unit area is less than some threshold
value.

Now imagine throwing away all the points where the yield is above threshold and retain-
ing only those locations for which Y asb œ ". Define a new domain H‡ which consists of
those points where ^ asb  - . Since ^ asb is random, so is H‡ . We have replaced the attribute
^ asb (and Y asb) with a degenerate random variable whose domain H‡ consists of the loca-
tions at which we observe the event of interest and the focus has switched from studying the
attribute itself to studying the locations (Figure 9.6). Such processes are termed point
processes or point patterns.

4000

3000
Latitude

2000

1000

0 1000 2000 3000 4000


Longitude

Figure 9.7. Spatial distribution of a group of red cockaded woodpeckers in the Fort Bragg
area of North Carolina. Data kindly made available by Dr. Jeff Walters, Dept. of Biology,
Virginia Polytechnic Institute and State University. Used with permission.

Most point patterns do not arise through this transformation of an underlying random
function, they are observed directly. The emergence of plants, the distribution of seeds, the

© 2003 by CRC Press LLC


location of macropores in soil, the location of scab-infected wheat kernels on a truck load of
kernels are examples of this spatial data type (Figure 9.7). With geostatistical and lattice data,
statistical modeling focuses on the ^ ab process (since H is not random), whereas with point
data, we focus on the H process.
The distinction between the three types of spatial data is not always clear-cut. Geostatis-
tical data can be converted into lattice data by integrating the data over finite regions. By
changing the conceptual index set H, one can move from lattice data to geostatistical data
(consider H continuous, not discrete). Aggregating a point pattern over regions results in lat-
tice data. Lattice data, in some sense, are not as refined as geostatistical or point data, since
they can be obtained by reduction of the other spatial data types (integration of geostatistical
data or enumerating events in a point pattern). The questions of interest can vary greatly
among the three data types, however, and many statistical tools are specific to a particular
data type.
With geostatistical data a frequent application is to produce continuous surfaces (maps)
of the attribute ^ based on samples taken at some locations. The kriging methods developed
for this purpose (§9.4) predict ^as! b at some unsampled location s! based on the sample
^ as" bß âß ^ as8 b. For a mapped point pattern where all events within an area of interest have
been located, the issue of predicting at unobserved locations never arises. There are no unob-
served locations by definition. Point pattern analyses usually commence by raising the ele-
mentary question, “Are the events (points) distributed completely at random,” and if they are
not whether the events are more aggregated or more regularly distributed as expected under a
completely random placement. One approach to addressing this question is to study the distri-
bution of nearest-neighbor distances (§9.7.3) in the observed point pattern. For geostatistical
data examining the distance between a sample point and its nearest neighbor is nonsensical
since the sample locations are placed according to a sample design. The issue of where events
are located never surfaces with geostatistical or lattice data.
The various methods for modeling and analyzing geostatistical, lattice, and point data
have in common that they rely heavily on stochastic properties of the spatial random field. If
what we observe must be considered a sample of size one, how can we possibly learn any-
thing about the variation of ^ asb or the covariances among ^ as3 b and ^ as4 b, for example?
Certain assumptions, such as having obtained a random sample, are made frequently in classi-
cal statistics since they provide the underpinnings of the stochastic structure that enables
analysis.
If ]" ß âß ]8 are a random sample from a population with mean . and variance 5 # , for
example, then we automatically know that ] and W # are unbiased for . and 5 # , respectively.
Random sampling provides the needed replication mechanism without which estimation of
variation is difficult. Similar assumptions are made with spatial data but rather than targeting
the random mechanism by which the observations are obtained they focus on the (internal)
stochastic structure of the random field. Rather than replication of the data, we are looking for
replication in the data. These assumptions are summarized under the headings stationarity
and isotropy.

© 2003 by CRC Press LLC


9.1.4 Stationarity and Isotropy — the Built-in Replication
Mechanism of Random Fields
Box 9.3 Stationarity and Isotropy

• Stationarity is a property of self-replication of a stochastic process. It


implies the lack of importance of absolute coordinates.

• The three important varieties of stationarity are strict, second-order, and


intrinsic stationarity.

• Whereas in a stationary random field absolute coordinate differences are


immaterial, the orientation (angle) of coordinate differences matters.
Stationary random fields in which the orientation of coordinate differences
is not of consequence are called isotropic.

Stationarity in simple terms means that the random field looks similar in different parts of the
domain H, it replicates itself. Consider two observations at ^ asb and ^ as € hb. The vector h
is a displacement by which we move from location s to location u œ s € h; it is referred to as
the lag-vector (or lag for short). If the random field is self-replicating, the stochastic proper-
ties of ^ asb and ^ as € hb should be similar. For example, to estimate the covariance between
locations distance h apart, we might consider all pairs a^ as3 b, ^ as3 € hbb in the estimation
process, regardless of where s3 is located. Stationarity is the absence of an origin, the spatial
process has reached a state of equilibrium. Stationarity assumptions are also made for time
series data. There it means that it does not matter when a time shift is considered in terms of
absolute time, only how large the time shift is. In a stationary time series one can talk about a
difference of two days without worrying that the first occasion was a Saturday. In the spatial
context stationarity means the lack of importance of absolute coordinates. There are different
degrees of stationarity, however, and before we can make the various definitions more precise
a few comments are in order.
Since ^asb is a random variable, it has moments and a distribution. For example,
Ec^ asbd œ .asb is the mean of ^ asb at location s and Varc^ asbd is its variance. Analysts and
practitioners often refer to Gaussian random fields, but some care is necessary to be clear
about what is assumed to follow the Gaussian law. A Gaussian random field is defined as a
random function whose finite-dimensional distributions are multivariate Gaussian. That is,
the cumulative distribution function
Pra^ as" b  D" ß âß ^ as5 b  D5 b [9.2]

is that of a 5 -variate Gaussian distribution for all 5 . By the properties of the multivariate
Gaussian distribution (§3.7) this implies that any ^as3 b is a univariate Gaussian random var-
iable. The reverse is not true. If ^as3 b is Gaussian does not imply that [9.2] is a multivariate
Gaussian probability. Chilès and Delfiner (1999, p. 17) point out that this leap of faith is
sometimes made. The spatial distribution of ^asb is defined by the multivariate cumulative
distribution function [9.2], not the marginal distribution of ^asb.

© 2003 by CRC Press LLC


The first, and most restrictive, definition of stationarity is strong (or strict) stationarity.
It implies that
Pra^ as" b  D" ß âß ^ as5 b  D5 b œ Pra^ as" € hb  D" ß âß ^ as5 € hb  D5 b, [9.3]

meaning that the spatial distribution is invariant under translation of the coordinates by the
vector h. The random field repeats itself throughout the domain. Geometrically, this implies
that the spatial distribution is invariant under rotating and stretching of the coordinate system.
As the name suggests, strong stationarity is a very strict condition, more restrictive than what
is required for many of the statistical methods that follow. Two important versions of station-
arity, second-order (weak) and intrinsic stationarity are defined in terms of moments of ^asb.
A random field is second-order stationary if Ec^ asbd œ . and Covc^ asbß ^ as € hbd œ
G ‡ ahb. The first assumption states that the mean of the random field is constant and does not
depend on locations. The second assumption states that the covariance between two observa-
tions is only a function of their spatial separation. The function G ‡ ahb is called the covari-
ance function or the covariogram of the spatial process. If a random field is strictly station-
ary it is also second-order stationary, but the reverse is not necessarily true. The reasons are
similar to those that disallow inferring the distribution of a random variable from its mean and
variance alone. An exception is the Gaussian random variable where zero covariance does
imply independence. Similarly for random fields. If a Gaussian random field is second-order
stationary it is also strictly stationary.
Imagine that we wish to estimate the covariance function G ‡ ahb in a second-order
stationary process. For a lag vector h œ c  $&Þ$&ß $&Þ$&dw all pairs of points that are exactly
distance h apart can be utilized (Figure 9.8).

150
Latitude

100

50

0
-50 0 50 100 150 200
Longitude

Figure 9.8. The notion of second-order stationarity. Pairs of points separated by the same lag
vector (here, h œ c  $&Þ$&ß $&Þ$&dw ) provide built-in replication to assess the spatial depen-
dency for the particular choice of h.

While stationarity reflects the lack of importance of absolute coordinates, the direction in
which the lag vector h is assessed still plays an important role. We could not combine the
pairs of observations in Figure 9.8 whose lag vector is h œ c  $&Þ$&ß  $&Þ$&dw , they are

© 2003 by CRC Press LLC


oriented perpendicular to the rays shown in the figure. The condition by which the random
field is also invariant under rotation and reflection, is known as isotropy. In a second-order
stationary random field with isotropic covariogram the covariance between any two points
^ asb and ^ as hb is only a function of the Euclidean distance llhll between the two points,
Covc^ asbß ^ as hbd œ G allhllb.

The Euclidean distance is defined as follows. Let s œ cBß C dw where B and C are the longitude
and latitude, respectively. The Euclidean distance (Figure 9.9) between s" and s# is then
lls" s# ll œ ÉaB" B# b# aC" C# b# . Random fields that are stationary but not isotropic
are called anisotropic.

B=(2,10)
10

8 C=(9,8)
Y Coordinate

6
|| sA − sC ||= (2 − 9)2 + (2 − 8)2 = 9.219

2
A=(2,2) D=(10,2)

2 4 6 8 10
X Coordinate

Figure 9.9. Euclidean distance between points E œ aB œ #ß C œ #b and G œ a*ß )b. The
Euclidean distance between E and F and between E and H is ).

Note that we have distinguished between the covariogram G ‡ ahb of the second-order
stationary random field and the covariogram G allhllb which is also isotropic, since these are
different functions. In the sequel we will often refer to only Gab and whether the function
depends on h or on llhll is sufficient to distinguish the general from the isotropic case.
In a process with isotropic covariance function it does not matter how the lag vector
between pairs of points is oriented, only that the Euclidean distance between pairs of points is
the same (Figure 9.10). To visualize the difference between second-order stationary random
fields with isotropic and anisotropic covariance function, realizations were simulated with
proc sim2d in The SAS® System. The statements

proc sim2d outsim=RandomFields;


grid x=1 to 10 by 1 y=1 to 10 by 1;
simulate numreal=1 angle=90 range=5 scale=0.75 ratio=0.3 form=gaussian;
simulate numreal=1 angle= 0 range=5 scale=0.75 ratio=1 form=gaussian;
run;

random field in which data points are correlated up to Euclidean distance &È$ œ )Þ'' in the
generate two sets of spatial data on a "! ‚ "! grid. The first simulate statement generates a

East-West direction, but correlated up to a much smaller distance in the North-South direction

© 2003 by CRC Press LLC


(the genesis of this value called the spatial range is discussed in §9.2.1). The second simulate
statement generates a random field with isotropic covariance, i.e., the spatial dependence de-
velops similarly in all directions (Figure 9.11).

150
Latitude

100

50

0
-50 0 50 100 150 200
Longitude

Figure 9.10. The notion of isotropy. Pairs of points separated by the same Euclidean distance
(here llhll œ &!) provide built-in replication to assess spatial dependency at that lag. Orienta-
tion of the distance vector is immaterial.

Geometric Anisotropic Gaussian Random Field Isotropic Gaussian Random Field


10

10
8

8
6

6
4

4
2

2 4 6 8 10 2 4 6 8 10

Figure 9.11. Anisotropic (left) and isotropic (right), stationary Gaussian random fields.

The anisotropic field changes its values in the East-West direction more slowly than in
the North-South direction. For the isotropic random field the spatial dependency between data
points develops in the same fashion in all directions. If one were to estimate the covariance
between two points llhll œ $ distance units apart, for example, the covariance must be esti-
mated separately in the North-South and the East-West direction in the anisotropic case. In

© 2003 by CRC Press LLC


the isotropic case any pair of points will do provided their Euclidean distance is $ units,
regardless of the orientation of the points.
From the existence of the covariogram in a second-order stationary random field we can
derive an interesting property. Since Covc^ asbß ^ as € hbd œ G ahb does not depend on abso-
lute coordinates and Covc^ asbß ^ asbd œ Varc^ asbd œ G a0b, it follows that the variance of the
attribute is constant and does not depend on location. A second-order stationary spatial proc-
ess thus has a constant mean, constant variance, and a covariance function that does not de-
pend on absolute coordinates. Such a process can be thought of as the spatial equivalent of a
random sample in classical statistic which gives rise to independent random variables with the
same mean and dispersion.
In time series analysis, stationarity is just as important as with spatial data. A frequent
device employed to turn a nonstationary series into a stationary one is differencing. Let ] a>b
denote an observation in a time series at time > and consider the random walk ] a>b œ
] a>  "b € /a>b where the /a>b are independent random variables with mean ! and variance
5 # . It is easy to show that the random walk is not second-order stationary. We have
Ec] a>bd œ Ec] a>  5 bd but the variance is not constant (Varc] a>bd œ >5 # ) and the covariance
does depend on the origin, Covc] a>bß ] a>  5 bd œ a>  5 b5 # . While ] a>b is not stationary,
the first differences ] a>b  ] a>  "b are second-order stationary. A similar device is used in
spatial statistics when the increments ^ asb  ^ as € hb are second-order stationary. This form
of stationarity is called intrinsic stationarity. It is usually defined as follows: if Ec^ asbd œ .
and
"
Varc^ asb  ^ as € hbd œ # ahb, [9.4]
#
then ^asb is said to be intrinsically stationary or said to satisfy the intrinsic hypothesis. The
function #ahb is called the semivariogram of ^ asb. If the semivariogram is a function of the
Euclidean distance llhll, then it is furthermore isotropic.
The semivariogram and covariogram are parameters of the spatial process and play a crit-
ical role in the geostatistical method of spatial data analysis. Statisticians are used to working
with covariances, while the semivariogram is more frequently used in the geostatistical litera-
ture. Both are important ingredients of the kriging methods for spatial prediction. In a second-
order stationary random field a simple relationship between #ahb and G ahb can be used to de-
rive one from the other,
#ahb œ G a0b  G ahb, [9.5]

and kriging predictors can be written in terms of semivariances or covariances. However, if a


process is intrinsically stationary it may not be second-order stationary, the class of intrinsic
processes is larger. Care should be exercised when calculating G ahb as G a0b  # ahb. If the
process is intrinsic but not second-order stationary, Gahb is not a parameter.
We want to emphasize that it is because of the factor ½ that #ahb is termed the semi-
variogram and ##ahb is termed the variogram. Statistical packages and the spatial statistics
literature are not consistent in this terminology. The variogram procedure in The SAS®
System, for example, estimates the semivariogram but denotes it as the variogram on the out-
put. Kaluzny et al. (1998, p. 68) acknowledge in the S+SpatialStats® user manual that #ahb is
the semivariogram but refer to it in the manual as the variogram for conciseness. Chilès and

© 2003 by CRC Press LLC


Delfiner (1999, p. 32) acknowledge that #ahb is “also called” the semivariogram but refer to
#ahb as the variogram for simplicity and because this terminology “tends to become estab-
lished.” Assume a second-order random field whose isotropic covariogram approaches ! as
llhll Ä _. Then #allhllb approaches the constant G a0b which is the variance of an observa-
tion. A graph of the semivariogram will thus provide a simple estimate of Varc^ asbd as the
asymptote of the semivariogram. The asymptote of the variogram estimates twice the variance
of an observation and there is nothing concise, simple, or established about missing by a fac-
tor of #. We refer to # ahb as the semivariogram and to ## ahb as the variogram throughout.
The semivariogram is a structural tool that can convey a great deal about the nature and
structure of spatial dependency in a random field. It is also a parameter of the process that
must be estimated from the data. Estimating a semivariogram is usually a two-step process: (i)
derive an empirical estimate of the semivariogram from the data and (ii) fit a theoretical semi-
variogram model to the empirical estimate. Because of the importance of the semivariogram
in spatial statistics the next section is devoted to semivariogram analysis and estimation.

9.2 Semivariogram Analysis and Estimation


Box 9.4 Semivariogram

• The semivariogram of a spatial process is one half of the variance of the


difference between observations.

• The semivariogram conveys information about the spatial structure and the
degree of continuity of a random field (§9.2.1, §9.2.3).

• Theoretical models (§9.2.2) are fit to the data to arrive at an estimated


semivariogram that satisfies the properties needed for subsequent analysis
(§9.2.4).

9.2.1 Elements of the Semivariogram


A valid covariance function must satisfy certain properties. It must be even in the sense that
G ahb œ G a  hb since we must have Covc^ asbß ^ as € hbd œ Covc^ as € hbß ^ asbd. Further-
more, covariance functions must be non-negative definite, i.e.,
8 8
""!3 !4 G as3  s4 b   !, [9.6]
3œ" 4œ"

for all constants !" ß âß !8 and spatial locations. This condition guarantees that the variance
of spatial predictions are non-negative. As a consequence it can be shown by the Cauchy-
Schwartz inequality that lG ahbl Ÿ G a0b and we have already established that G a0b   !, since
Ga0b is the variance of an observation. In practice, we often consider only covariance func-
tions that have the following additional properties: they are positive and decrease monoton-

© 2003 by CRC Press LLC


ically with spatial separation. For some critical distance llh- ll, the covariance function is then
either identically zero or it approaches ! as lag distance increases. Similar conditions and
properties arise for valid semivariograms. Semivariograms have the evenness property
# ahb œ # a  hb and pass through the origin, # a0b œ !, since Varc^ asb  ^ as € 0bd œ
Varc!d œ !. Finally, a valid semivariogram must be conditionally negative-definite, i.e.,
8 8
""!3 !4 # as3  s4 b Ÿ !, [9.7]
3œ" 4œ"

for any number of spatial locations and constants !" ß âß !8 such that !83œ" !3 œ ! (Cressie
1993, p. 86). When the additional conditions (positive, monotonic decreasing) are imposed on
the covariance function the semivariogram of a second-order stationary random field takes on
a very characteristic shape (Figure 9.12) . It rises from the origin monotonically to an upper
asymptote called the sill of the semivariogram. The sill corresponds to Varc^asbd œ G a0b.
When the semivariogram meets the asymptote the covariance G ahb is zero since # ahb œ
G a0b  G ahb. The distance at which this occurs is called the range of the semivariogram. In
Figure 9.12 the semivariogram approaches the sill only asymptotically. In this case the prac-
tical range is defined as the lag distance at which the semivariogram achieves *&% of the sill,
here, the practical range is llhll œ "&.
Observations that are spatially separated by more than the range are uncorrelated (or
practically uncorrelated when separated by more than the practical range). Spatial auto-
correlation exists only for pairs of points separated by less than the (practical) range. The
more quickly the semivariogram rises from the origin to the sill the more quickly autocorrela-
tions decline.
Sill
10

8
Semivariogram γ(||h||)

Practical Range
0

0 10 20 30 40
||h||

Figure 9.12. Semivariogram of a second-order stationary process with positive covariance


function for which G allhllb Ä ! as llhll increases. The semivariogram has sill "! and practi-
cal range "&.

An intrinsically but not second-order stationary random field has a semivariogram that
does not reach an upper asymptote. The semivariance may increase with spatial separation as
in Figure 9.13. Obviously, there is no range defined in this case. The increase of the semi-
variogram with llhll can not be arbitrary, however. It must rise less quickly than #Îllhll# (this

© 2003 by CRC Press LLC


check is sometimes referred to as the test of the intrinsic hypothesis) because ##allhllbÎllhll#
must approach ! as llhll Ä _.

10

Semivariogram γ(||h||)
6

0 10 20 30 40
||h||

Figure 9.13. Semivariogram of an intrinsically but not second-order stationary process.

By definition we have #a0b œ ! but many data sets do not seem to comply with that
property. Figure 9.14 shows an estimate of the semivariogram for the Mercer and Hall grain
yield data plotted in Figure 9.5 (p. 569). The estimated semivariogram values are represented
by the dots and a nonparametric loess smooth of these values was also added to the graph.
The semivariogram appears to reach or at least to approach an asymptote with increasing lag
distance, the sill is approximately !Þ").
0.20
0.15
Semivariogram

0.10
0.05
0.0

0 5 10 15

distance

Figure 9.14. Classical semivariogram estimator (see §9.2.4) for Mercer and Hall grain data
(connected dots). Dashed line is loess smooth of the semivariogram estimator.

Notice that the empirical semivariogram commences at llhll œ ", since this is the smallest
lag between experimental units. We do not recommend smoothing semivariogram estimates
with standard nonparametric procedures because the resulting fit may not have the required
properties. It may not be conditionally negative definite (nonparametric semivariogram esti-
mation is discussed in 9.2.4). The loess smooth was added to the figure only to suggest an

© 2003 by CRC Press LLC


overall trend in the semivariogram estimate. By connecting the dots of the semivariogram
estimates it appears that the trend could pass through the origin as is required. However, the
loess smooth of the empirical semivariogram indicates otherwise. Extrapolation below
llhll œ " suggests an intercept of the semivariogram around !Þ"$.
This phenomenon is quite common in applications, namely, # ahb Ä )! Á ! as llhll Ä !.
How can this happen? How can we have a positive variance of the observation differences at
the same location? One possible explanation is measurement error. If a measurement at
location s cannot be repeated without error, then repeat observations at s will exhibit varia-
bility; call this variance component 5/# . A second explanation is that there is a spatial process
(asb operating at lag distances shorter than the smallest lag observed in the data set. This
microscale process has sill 5(# . Then, if the measurement error and microscale process are
independent,
)! œ 5/# € 5(# .

If any of the two components is not zero, the semivariogram exhibits a discontinuity at the
origin. The magnitude of this discontinuity is called the nugget effect. The term stems from
the idea that ore nuggets are dispersed throughout a larger body of rock but at distances
smaller than the smallest sample distance. If a semivariogram has nugget )! and sill Ga0b, the
difference G a0b  )! is called the partial sill of the semivariogram. The practical range is
then defined as the lag distance at which the semivariogram has achieved )! €
!Þ*&aG a0b  )! b (Figure 9.15).
Sill
10

8
Semivariogram γ(||h||)

4 θ0

Practical Range
0

0 10 20 30 40
||h||

Figure 9.15. Semivariogram of a second-order stationary process with and without nugget
effect. The no-nugget semivariogram (lower line) has sill "! and practical range "&, the
nugget semivariogram has )! œ %, partial sill G a0b  )! œ ', and the same practical range.

In the presence of a nugget effect the relationship between semivariogram and covario-
gram must be slightly altered. In the no-nugget model we can put Varc^as3 bd œ G a0b œ 5 # .
For the nugget model define Varc^as3 bd œ G a0b œ )! € 0 where 0 is the partial sill (and
5 # œ Varc^as3 bd œ )! € 0). The semivariogram can now be expressed as # ahb œ )! € 00 ahb

© 2003 by CRC Press LLC


(see §9.2.2 for some 0 ahbÑ. Then
0 a"  0 ah b b llhll ž !
G ah b œ œ [9.8]
)! € 0 h œ 0.

In the presence of a nugget effect a useful statistic is the Relative Structured Variability
(VWZ ). It measures the relative elevation of the semivariogram over the nugget effect in
percent:
0 )!
VWZ œ Œ  ‚ "!!% œ Œ"   ‚ "!!%. c9Þ9d
0 € )! 0 € )!

One interpretation of VWZ is as the degree to which variability is spatially structured. The
unstructured part of the variability is due to measurement error and/or microscale variability.
The larger the VWZ , the more efficient geostatistical prediction will be compared to methods
of prediction that ignore spatial information and the greater the continuity of the spatial
process (see §9.2.3)

9.2.2 Parametric Isotropic Semivariogram Models


Estimation of the semivariogram by parametric statistical methods requires the selection of a
semivariogram model #ahà )b that is fit to data. ) is a vector of parameters that is estimated
from the data by direct or indirect methods. We consider those methods as indirect that pro-
cess the data ^ as" bß âß ^ as8 b in some form, for example by averaging squared differences
e^ as3 b  ^ as3 € hbf# , and then fit the semivariogram model to these summaries. Functions
that serve as semivariogram models must be conditionally negative definite and a relatively
small number of such functions is used in practice. We introduce the key models in this sub-
section; many more semivariogram models can be found in e.g., Journel and Huijbregts
(1978), Cressie (1993), Stein (1999) and Chilès and Delfiner (1999). The models that follow
are isotropic, )! denotes the nugget and )= the sill parameter provided the model is second-
order stationary. The models are all valid in ‘# , some are valid for higher dimensional
problems. Note that a semivariogram that is valid in ‘. is also valid in ‘= where =  . . Since
we are concerned with two-dimensional random fields in this chapter we do not further com-
ment on the valid dimensions of any of the semivariogram models.

Nugget-Only Model
The nugget-only model is the semivariogram of a white-noise process, where the ^as3 b
behave like a random sample, all having the same mean, variance with no correlations among
them. The model is void of spatial structure, the relative structured variability is zero. The
nugget-only model is of course second-order stationary and a valid semivariogram in any
dimension. Nugget-only models are not that uncommon although analysts keen on applying
techniques from the spatial statistics toolbox such as kriging are usually less enthusiastic
when a nugget-only model is obtained. A nugget-only model is an appropriate model if the
smallest sample distance in the data is greater than the range of the spatial process. Sampling
an attribute on a regular grid whose spatial range is unknown may invariably lead to a nugget-
only model if grid points are spaced too far apart.

© 2003 by CRC Press LLC


10

8
θs = 9

! hœ0 6

γ(h)
# ah à )= b œ œ
)= hÁ0 4

0
0 2 4 6 8 10
||h||

Linear Model
The Linear Model is intrinsically stationary with parameters )! and " , both of which must be
positive. Covariances or semivariances usually do not change linearly over a large range, but
linear change of the semivariogram near the origin is often reasonable. If a linear semivario-
gram model is found to fit the data in practice, it is possible that one has observed the initial
increase of a second-order stationary model that is linear or close to linear near the origin but
failed to collect samples far enough apart to capture the range and sill of the semivariogram.
A second-order stationary semivariogram model that behaves linearly near the origin is the
spherical model.

50

θ0 = 5, β = 1
40

! hœ0 30
# ah à ) b œ œ
γ(h)

)! € " llhll hÁ0 20

10

0
0 10 20 30 40 50
||h||

Spherical Model
The spherical model is one of the most popular semivariogram models in applied spatial sta-
tistics for second-order stationary random fields. Its two main characteristics are linear behav-
ior near the origin and the fact that at distance ! the semivariogram meets the sill and remains
flat thereafter. This sometimes creates a visible kink at llhll œ !. The spherical semivario-
gram thus has a range !, rather than a practical range. The popularity of the spherical model
in the geostatistical literature is a mystery to Stein (1999, p. 52) who argues that perhaps
“there is a mistaken belief that there is some statistical advantage in having the autocorre-
lation function being exactly zero beyond some finite distance” a!b. The fact that # ahà )b is
only once differentiable at llhll œ ! can lead to problems in likelihood estimation that relies
on derivatives. Stein (1999, p. 52) concludes that the spherical model is a poor substitute for
the exponential model (see next). He recommends using the square of the spherical model
# ahà )b# instead of # ahà )b, since the former provides two derivatives on a!ß _b. The behav-
ior of the squared spherical semivariogram near the origin is not linear, however (see §9.2.3
on the effect of the near-origin behavior).

© 2003 by CRC Press LLC


14 θ0 = 3, θs = 10, α = 25

12

Ú
Ý!
Ý llhll œ ! 10

$ 8
ll hll ll hll
# ahà )b œ Û )! € )= œ $# !  "# Š ! ‹  !  llhll Ÿ !

γ(h)
Ý
Ý
6

Ü )! € )= llhll ž ! 4
θ0 = 3
2

0
0 10 20 ||h||
30 40 50

Exponential Model
The second-order stationary exponential model is a very useful model that has been found to
fit spatial data in varied applications well. It approaches the sill )= asymptotically as
llhll Ä _. In the parameterization shown below the parameter ! is the practical range of the
semivariogram (Figure 9.12 is an exponential semivariogram without nugget). Often the
model can be found in a parameterization where the exponent is  llhllÎ!. The practical
range then corresponds to $!. The SAS® System and S+SpatialStats® use this parameteriza-
tion. For the same range and sill as the spherical model the exponential model rises more
quickly from the origin and yields autocorrelations at short lag distances smaller than those of
the spherical model. A random field with an exponential semivariogram is less regular (less
continuous) on short distances than the spherical model (§9.2.3).

14 θ0 = 3, θs = 10, α = 25

12

10

! hœ0 8
γ(h)

# ah à ) b œ ž $llhll
)! € )= š"  expš  ! ›› hÁ0
6

4
θ0 = 3
2

0
0 10 20 ||h|| 30 40 50

The covariance function of the exponential model without nugget effect is


)= aexpe  $llhllÎ!fb llhll ž 0
G ah b œ œ
)= h œ 0.

It is easily seen that this is a special case of the covariance model introduced in §7.5.2 for
modeling the within-cluster correlations in repeated measures data (see formula [7.54] on p.
457). There it was referred to as the continuous AR(1) model because of its relationship to a
continuous first-order regressive time series. The extra constant $ was not used there, since
the temporal range is usually of less interest when modeling clustered repeated measures data
than the range for spatial data. The temporal separation l>34  >34w l between two observations
from the same cluster is now replaced by the Euclidean distance llhll. For the exponential
semivariogram to be valid we need to have )!   !, )=   !, and !   !.

© 2003 by CRC Press LLC


Gaussian Model
This model exhibits quadratic behavior near the origin and produces short-range correlations
that are higher than for any of the other second-order stationary models with the same
(practical) range. Notice that the only difference between the gaussian and exponential
semivariogram is the square in the exponent.
14 θ0 = 3, θs = 10, α = 25

12

10
Ú! hœ0 8

γ(h)
# ah à ) b œ Û #
)! € )= œ"  expœ  $Š ll!hll ‹  h Á 0
6

Ü 4
θ0 = 3
2

0
0 10 20 ||h||
30 40 50

This is a fairly subtle difference that has considerable implications. The gaussian model
is the most continuous near the origin of the models considered here. In fact, it is infinitely
differentiable near !. This implies a very smooth, regular spatial process (see §9.2.3). It is so
smooth that knowing the value at ! and the values of all partial derivatives determines the
values in the random field at any arbitrary location. Such smoothness is unrealistic for most
processes.
The name should not imply that this semivariogram model deserves similar veneration in
spatial statistics as is awarded rightfully to the Gaussian distribution in classical statistics. The
name stems from the fact that a stochastic process with covariance function
G a>b œ - exp˜  !># ™

has spectral density


" -
0 a=b œ exp˜  =# Î%!™,
# È1!

which resembles in functional form the Gaussian probability mass function. Furthermore, one
should not assume that the semivariogram of a Gaussian random field (see §9.1.4 for the
definition) is necessarily of this form. It most likely will not be.
As for the exponential model, the parameter ! is the practical range and The SAS®
System and S+SpatialStats® again drop the factor $ in the opponent. In their parameterization
the practical range is È$!.

Power Model
This is an intrinsically stationary model but only for ! Ÿ -  #. Otherwise the variogram
would increase faster than llhll# , which is in violation with the intrinsic hypothesis. The
parameter " furthermore must be positive. The linear semivariogram model is a special case
of the power model with - œ ". Note that the covariance model proc mixed in The SAS®

© 2003 by CRC Press LLC


System terms the power model is a reparameterization of the exponential model and not the
power model shown here.
400

350

300 θ0 = 3, β = 2, λ = 1.3

250

! hœ0

γ(h)
200
# ah à ) b œ œ
)! € " llhll- hÁ0 150

100

50

0
0 10 20 30 40 50
||h||

Wave (Cardinal Sine) Model


The semivariogram models discussed thus far permit only positive autocorrelation. This
implies that within the range a large (small) value ^asb is likely to be paired with a large
(small) value at ^ as € hb. A semivariogram that permits positive and negative autocorrelation
is the wave (or cardinal sine) semivariogram. It fluctuates about the sill )= and the
fluctuations decrease with increasing lag. At lag distances where the semivariogram is above
the sill the spatial correlation is negative; it is positive when the semivariogram drops below
the sill. All parameters of the model must be positive.

2.0

1.5
θ0 = 0.25, θs= 1.5, α = 25*π/180
! hœ0
γ(h)

# ah à ) b œ ž
)! € )= š"  !sinŠ ll!hll ‹Îllhll› h Á 0
1.0

0.5

0.0
0 5 10 15 20 25 30
||h||

The term llhllÎ! is best measured in radians. In the Figure above we have chosen
! œ #&‡1Î")!. The practical range is the value where the peaks/valleys of the covariogram
are no greater than !Þ!&G a0b, approximately 1‡'Þ&‡!. A process with a wave semivariogram
has some form of periodicity.

9.2.3 The Degree of Spatial Continuity (Structure)


The notion of spatial structure of a random field is related to its degree of smoothness or con-
tinuity. The slower the increase of the semivariogram near the origin, the more the process is
spatially structured, the smoother it is. Prediction of ^asb at unobserved locations is easier
(more precise) if a process is smooth. With increasing irregularity (= lack of structure) less in-
formation can be gleaned about the process by considering the values at neighboring loca-
tions. The greatest absence of structure is encountered when there is a discontinuity at the ori-
gin, a nugget effect. Consequently, the nugget-only model is completely spatially unstruc-

© 2003 by CRC Press LLC


tured. For the same sill and range, the exponential semivariogram rises faster than the spheri-
cal and hence the former is less spatially structured than the latter. The correctness of statisti-
cal inferences, on the other hand, depends increasingly on the correctness of the semivario-
gram model as processes become smoother. Semivariograms with a quadratic effect near the
origin (gaussian model, for example) are more continuous than semivariograms that behave
close to linear near the origin (spherical, exponential).
Figure 9.16 shows realizations of four random fields that were simulated along a transect
of length &! with proc sim2d of The SAS® System. The degree of spatial structure increases
from top to bottom. For a smooth process the gray shades vary only slowly from an observa-
tion to its neighbors. The (positive) spatial autocorrelations are strong on short distances. The
random field with the nugget-only model shows the greatest degree of irregularity (least con-
tinuity) followed by the exponential model that appears considerably smoother. In the nugget-
only model a large (dark) observation can be followed by a small (light-colored) observation,
whereas the exponential model maintains similar shading over short distances. The spherical
model is smoother than the exponential and the gaussian model exhibits the greatest degree of
regularity. As mentioned before, a process with gaussian semivariogram is smoother than
what one should reasonably expect in practice.

Nugget-only

Exponential

Spherical

Gaussian

0.0 12.5 25.0 37.5 50.0

Position on transect

Figure 9.16. Simulated Gaussian random fields that differ in their isotropic semivariogram
structure. In all cases the semivariogram has sill #, the exponential, spherical, and gaussian
semivariograms have range (practical range) &.

Because the (practical) range of the semivariogram has a convenient interpretation as the
distance beyond which observations are not spatially autocorrelated it is often interpreted as a
zone of influence, or a scale of variability of ^asb, or in terms of the degree of homogeneity
of the process. It must be noted that Varc^ asbd, the variability (scale) of ^ asb, is not a func-
tion of spatial location in a second-order stationary process. The variances of observations are
the same everywhere and the process is homogeneous in this sense, regardless of the magni-
tude of the variability. The scale of variability the range refers to is the spatial range over
which the variance of differences ^ asb  ^ as € hb changes. For distances exceeding the
range Varc^ asb  ^ as € hbd is constant. From Figure 9.16 it is also seen that processes with

© 2003 by CRC Press LLC


the same range can be very different in their respective degrees of continuity. A process with
a gaussian semivariogram implies short-range correlation of much greater magnitude than a
process with an exponential semivariogram and the same range.
For second-order stationary processes different measures have been proposed to capture
the degree of spatial structure. The Relative Structured Variability aVWZ [9.9]b measures that
aspect of continuity which is influenced by the nugget effect. To incorporate the spatial range
and the form of the semivariogram model, integral scales as defined by Russo and Bresler
(1981) and Russo and Jury (1987a) are useful. If 3ahb œ G ahbÎG a0b is the correlation func-
tion of the process, then
_ _ ½
N" œ ( 3a2b.2 N# œ œ#( 3a2b2.2
! !

are the integral scale measures for one- and two-dimensional processes, respectively, where 2
denotes Euclidean distance a2 œ llhllb. Consider a two-dimensional process with an
exponential semivariogram, no nugget, and practical range !. Then,
_ _ ½
N" œ ( expe  $2Î!f.2 œ !Î$ N# œ œ#( 2expe  $2Î!f.2 œ !È#Î$.
! !

Integral scales are used to define the distance over which observations are highly related
rather than relying on the (practical) range, which is the distance beyond which observations
are not related at all. For a process with gaussian semivariogram and practical range !, by
comparison, one obtains N" œ !Þ&‡!È1Î$ ¸ !Þ&"!. The more continuous gaussian process
has a longer integral scale, correlations wear off more slowly. An alternative measure to de-
fine distances over which observations are highly related is obtained by choosing a critical
value of the autocorrelation and to solve the correlation function for it. The distance 2a!ß - b
at which an exponential semivariogram with range ! (and no nugget) achieves correlation
!  - Ÿ " is 2a!ß - b œ !a  lne- fÎ$b. For the gaussian semivariogram this distance is
2a!ß - b œ !È  lne- fÎ$. The more continuous process maintains autocorrelations over
longer distances.
Solie, Raun, and Stone (1999) argue that integral scales provide objective measures for
the distance at which soil and plant variables are highly correlated and are useful when this
distance cannot be determined based on subject matter alone. The integral scale these authors
employ is a modification of N" where the autocorrelation function is integrated only to the
!
(practical) range, N œ '! 3a2b.2. For the exponential semivariogram with no nugget effect,
this yields N ¸ !Þ*&‡!Î$.

9.2.4 Semivariogram Estimation and Fitting


In this section we introduce the most important estimators of the semivariogram and methods
of fitting theoretical semivariograms to data. Mathematical background material can be found
in §A9.9.1 (on estimation) and §A9.9.2 (on fitting). An application of the estimators and
fitting methods is discussed in our §9.8.2.

© 2003 by CRC Press LLC


Empirical Semivariogram Estimators
Estimators in this class are based on summary statistics of functions of the paired differences
^ as3 b  ^ as4 b. Recall that the semivariogram is defined as # ahb œ !Þ&Varc^ asb  ^ as € hbd
and that any kind of stationarity implies at least that the mean Ec^ asbd is constant. The
squared difference a^ as3 b  ^ as4 bb# is then an unbiased estimator of Varc^ as3 b  ^ as4 bd
since Ec^ as3 b  ^ as4 bd œ ! and an unbiased estimator of # ahb is obtained by calculating the
average of one half of the squared differences of all pairs of observations that are exactly dis-
tance h apart. Mathematically, this estimator is expressed as
"
# ah b œ
s " a ^ as 3 b  ^ a s 4 b b # , [9.10]
#lR ahbl R ÐhÑ

where R ahb is the set of location pairs that are separated by the lag vector h and lR ahbl de-
notes the number of unique pairs in the set R ahb. Notice that ^ as" b  ^ as# b and ^ as# b 
^as" b are the same pair in this calculation and are not counted twice. If the semivariogram of
the random field is isotropic, h is replaced by llhll. [9.10] is known as the classical semivario-
gram estimator due to Matheron (1962) and also called the Matheron estimator. Its properties
are generally appealing. It is an unbiased estimator of #ahb provided the mean of the random
field is constant and it behaves similar to the semivariogram: it is an even function s # ah b œ
# a  hb and s
s # a0b œ !. A disadvantage of the Matheron estimator is its sensitivity to outliers.
If ^ as4 b is an outlying observation the difference ^ as3 b  ^ as4 b will be large and squaring the
difference amplifies the contribution to the empirical semivariogram estimate at lag s3  s4 . In
addition, outlying observations contribute to the estimation of #ahb at various lags and exert
their influence on more than one s # ahb value. Consider the following hypothetical data chosen
small to demonstrate the effect. The data represent five locations on a a$ ‚ %b grid.

Table 9.1. A spatial data set containing an outlying observation ^ ac$ß %db œ #!

Column (y)
Row (x) " # $ %
" " %
# #
$ $ #!

The observation in row $, column % is considerably larger than the remaining four obser-
vations. What is its effect on the Matheron semivariogram estimator? There are five lag dis-
tances in these data, at llhll œ È#, #, È&, $, and È"$ distance units. For each lag there are
exactly two data pairs. For example, the pairs contributing to the estimation of the semivario-
gram at llhll œ $ are e^ ac"ß "dbß ^ ac"ß %dbf and e^ ac$ß "dbß ^ ac$ß %dbf. The variogram
estimates are

© 2003 by CRC Press LLC


"
# ÐÈ#Ñ œ
#s ˜a"  #b# € a#  $b# ™ œ "
#
"
# Ð#Ñ œ
#s ˜a"  $b# € a%  #!b# ™ œ "$!
#
"
# ÐÈ&Ñ œ
#s ˜a%  #b# € a#!  #b# ™ œ "'%
#
"
# a$b œ
#s ˜a%  "b# € a#!  $b# ™ œ "%*
#
"
# È"$Ñ œ
2sÐ ˜a$  %b# € a#!  "b# ™ œ ")".
#
The outlying observation exerts negative influence at four of the five lags and dominates the
sum of squared differences. If the outlying observation is removed the variogram estimates
are #s # ÐÈ#Ñ œ ", #s
# Ð#Ñ œ %, #s # ÐÈ&Ñ œ %, #s # ÐÈ"$Ñ œ ".
# a$b œ *, and #s
Cressie and Hawkins (1980) derived an estimator of the semivariogram that is not as
susceptible to outliers. Details of the derivation are found in their paper and are reiterated in
our §A9.9.1. What has been termed the robust semivariogram estimator is based on absolute
differences l^ as3 b  ^ as4 bl rather than squared differences. The estimator has a slightly more
complicated form than the Matheron estimator and is given by
%
Î " Ñ !Þ%*%
#ahb œ !Þ& " l^ as3 b  ^ as4 bl½ ‚Œ!Þ%&( € . [9.11]
Ï lR ahbl R ahb Ò lR ahbl

Square roots of absolute differences are averaged first and then raised to the fourth power.
The influence of outlying observations is reduced because absolute differences are more
stable than squared differences and averaging is carried out before converting into the units of
a variance. Note that the attribute robust pertains to outlier contamination of the data; it
should not imply that [9.11] is robust against other violations, such as nonconstancy of the
mean. This estimator is not unbiased for the semivariogram but the term !Þ%&( €
!Þ%*%ÎlR ahbl in the denominator reduces the bias considerably. Calculating the robust esti-
mator for the spatial data set with an outlier one obtains (where !Þ%&( € !Þ%*%Î# œ !Þ(!%)
%
"
## ÐÈ#Ñ œ œ ˆÈl"  #l € Èl#  $l‰ Î!Þ(!% œ "Þ%#
#
%
"
## Ð#Ñ œ œ ˆÈl"  $l € Èl%  #!l‰ Î!Þ(!% œ ('Þ$
#
%
"
## ÐÈ&Ñ œ œ ˆÈl%  #l € Èl#!  #l‰ Î!Þ(!% œ *!Þ*
#
%
"
## a$b œ œ ˆÈl%  "l € Èl#!  $l‰ Î!Þ(!% œ "!%Þ$
#
%
"
2# ÐÈ"$Ñ œ œ ˆÈl$  %l € Èl#!  "l‰ Î!Þ(!% œ ($Þ#.
#

The influence of the outlier is clearly subdued compared to the Matheron estimator.

© 2003 by CRC Press LLC


Other proposals to make semivariogram estimation less susceptible to outliers have been
put forth. Considering the median of squared differences instead of the average squared dif-
ferences is one approach. Armstrong and Delfiner (1980) generalize this idea to using any
quantiles of the distribution of e^ as3 b  ^ as4 bf# . A median-based estimator is a special case
thereof. It is given by

µ "
# !Þ& ahb œ medianœ e^ as3 b  ^ as4 bf# €!Þ%&%*. [9.12]
#

The Matheron estimator [9.10] and the robust estimator [9.11] remain the most important esti-
mators of the empirical semivariogram in practice, however.
The precision of an empirical estimator at a given lag depends on the number of pairs
available at that lag that can be averaged or otherwise summarized. The recommendations
that at least &! (Chilès and Delfiner 1999, p. 38) or $! aJournel and Huijbregts 1978, p. 194b
unique pairs should be available for every lag vector h or distance llhll are common. Even
with &! pairs the empirical semivariogram can be quite erratic for larger lags and simulation
studies suggest that the approximate number of pairs can be considerably larger. Webster and
Oliver (1992) conclude through simulation that at least #!! to $!! observations are required
to estimate a semivariogram reliably. Cressie (1985) shows that the variance of the Matheron
semivariogram estimator can be approximated as
## # ahb
# a h bd »
Varcs . [9.13]
lR ahbl

As the semivariogram increases, so does the variance of the estimator. When the semivario-
# ahb for large lags can
gram is intrinsic but not second-order stationary, the variability of s
make it difficult to recognize the underlying structure unless R ahb is large. We show in
§A9.9.1 that [9.13] can be a poor approximation to Varcs # ahbd which also depends on the
degree of spatial autocorrelation and the spatial arrangement of the sampling locations. This
latter dependence has been employed to determine sample grids and layouts that lead to good
properties of the empirical semivariogram estimator without requiring too many observations.
For details see, for example, Russo (1984), Warrick and Myers (1987), and Zheng and
Silliman (2000).
With irregularly spaced data the number of observations at a given lag may be small,
some lags may be even unique. To collect a sufficient number of pairs the set R ahb is then
defined as the collection of pairs for which locations are separated by h „ % or llhll „ %,
where % is some lag tolerance. In other words, the empirical semivariogram is calculated only
for discrete lag classes and all observations within a lag class are considered representing that
particular lag. This introduces two potential problems. The term e^ as3 b  ^ as4 bf# is an un-
biased estimator of ## Ðs3  s4 Ñ, but not of ## as3  s4 € %b and grouping lags into lag classes
introduces some bias. Furthermore, the empirical semivariogram depends on the width and
number of lag classes, which introduces a subjective element into the analysis.
The goal of semivariogram estimation is not to estimate the empirical semivariogram giv-
en by [9.10], [9.11], or [9.12] but to estimate the unknown parameters of a theoretical semi-
variogram model #ahà )b. The least squares and nonparametric approaches fit the semivario-
gram model to the empirical semivariogram. If [9.10] was calculated at lags h" ß h# ß âß h5 ,
# ah " b ß s
then s # ah# bß âß s
# ah5 b serve as the data to which the semivariogram model is fit

© 2003 by CRC Press LLC


(Figure 9.17). We call this the indirect approach to semivariogram estimation since an empiri-
cal estimate is obtained first which then serves as the data. Note that by choosing more lag
classes one can apparently increase the size of this data set. Of the direct approaches we con-
sider likelihood methods (maximum likelihood and restricted maximum likelihood) as well as
a likelihood-type method (composite likelihood).

10

5
γ(||h||)

0 10 20 30 40 50 60 70 80 90
Lag ||h||

Figure 9.17. A robust empirical estimate of the semivariogram. #ahb was calculated at 5 œ
"$ lag classes of width (. The semivariogram estimates are plotted at the average lag distance
within each class. Connecting the dots does not guarantee that the resulting function is condi-
tionally negative definite.

Fitting the Semivariogram by Least Squares


In least squares methods to estimate the semivariogram the empirical semivariogram values
# ah" bß âß s
s # ah5 b or # ah" bß âß # ah5 b (or some other empirical estimate of # ahb) serve as the
responses; 5 denotes the number of lag classes. We discuss the least squares methods using
the Matheron estimator. The robust estimator can be used instead. Ordinary least squares
(OLS) estimates of ) are found by minimizing the sum of squared deviations between the
empirical semivariogram and a theoretical semivariogram:
5
# ah 3 b  # a h 3 à ) b b # .
"a s [9.14]
3œ"

OLS requires that the data points are uncorrelated and homoscedastic. Both assumptions are
not met. For the Matheron estimator Cressie (1985) showed that its variance is approximately
## # ahb
# a h bd »
Varcs . [9.15]
lR ahbl

It depends on the true semivariogram value at lag h and the number of unique data pairs at
# ah3 b are also not uncorrelated. The same data point ^as3 b contributes to the
that lag. The s

© 2003 by CRC Press LLC


estimation at different lags and there is spatial autocorrelation among the data points. The
same essential problems remain if #ahb is used in place of s # ahb. The robust estimator has an
advantage, however; its values are less correlated than those of the Matheron estimator.
One should use a generalized least squares criterion rather than ordinary least squares.
Write #
s ah b œ c s # ah5 bdw and # ahà )b œ c# ah" à )bß âß # ah5 à )bdw and denote the vari-
# ah" bß âß s
ance-covariance matrix of # s ahb by V. Then one should minimize
w
s ahb  # ahà )bb V" a#
a# s ahb  # ahà )bb. [9.16]

The problem with the generalized least squares approach lies in the determination of the vari-
ance-covariance matrix V. Cressie (1985, 1993 p. 96) gives expressions from which the off-
diagonal entries of V can be calculated for a Gaussian random field. These are complicated
expressions of the true semivariogram and as a simplification one often resorts to weighted
least squares (WLS) fitting. Here, V is replaced by a diagonal matrix W that contains the
variances of the s# ah3 b on the diagonal and the approximation [9.15] is used to calculate the
diagonal entries. The weighted least squares estimates of ) are obtained by minimizing
w
s ahb  # ahà )bb W" a#
a# s a hb  # a h à ) b b , [9.17]

where W œ Diage## # ahà )bÎlR ahblf. We show in §A9.9.2 that this is equivalent to minimi-
zing
5 #
# ah 3 b
s
"lR ah3 blœ  " , [9.18]
3œ"
# ah 3 à ) b

which is (2.6.12) in Cressie (1993, p. 96). If the robust estimator is used instead of the
Matheron estimator, sa# h3 b in [9.18] is replaced with # ah3 b. Note that semivariogram models
are typically nonlinear, with the exception of the nugget-only and the linear model, and mini-
mization of these objective functions requires nonlinear methods.
The weighted least squares method for fitting semivariogram models is very common in
practice. One must keep in mind that minimizing [9.18] is an approximate method. First,
[9.15] is an approximation for the variance of the empirical estimator. Second, W is a poor
approximation for V. The weighted least squares method is a poor substitute for the general-
ized least squares method that should be used. Delfiner (1976) developed a different weighted
least squares method that is implemented in the geostatistical package BLUEPACK (Delfiner,
Renard, and Chilès 1978). Zimmerman and Zimmerman (1991) compared various semivario-
gram fitting methods in an extensive simulation study and concluded that there is little to
choose between ordinary and weighted least squares. The Gaussian random fields simulated
by Zimmerman and Zimmerman (1991) had a linear semivariogram with nugget effect and a
no-nugget exponential structure. Neither the WLS or OLS estimates were uniformly superior
in terms of bias for the linear semivariogram. The weighted least squares method due to
Delfiner (1976) performed very poorly, however, and was uniformly inferior to all other
methods (including the likelihood methods to be discussed next). In case of the exponential
semivariogram the least squares estimators of the sill exhibited considerable positive bias, in
particular when the spatial dependence was weak.
In WLS and OLS fitting of the semivariogram care should be exercised in the interpreta-
tion of the standard errors for the parameter estimates reported by statistical packages.

© 2003 by CRC Press LLC


Neither method uses the correct variance-covariance matrix V. Instead, WLS uses a diagonal
matrix where the diagonal entries of V are approximated, and OLS uses a scaled identity
matrix. Also, the size of the data set to which the semivariogram model is fit depends on the
number of lag classes 5 which is chosen by the user. The number of lag classes 5 to which the
semivariogram is fit is also often smaller than the number of lag classes for which the empiri-
cal semivariogram estimator was calculated. Especially values at large lags and lag classes for
which the number of pairs does not exceed the rule of thumb value lR ahbl ž $! (or
lR ahbl ž &!) are removed before fitting the semivariogram by the least squares method. This
invariably results in a data set that is slanted towards the chosen semivariogram model be-
cause lag classes whose empirical semivariogram values are consistent with the modeled
trend are usually retained and those lag classes are removed whose values appear erratic.
Journel and Huijbregts (1978, p. 194) recommend using only lags (lag classes) less than half
of the maximum lag in the data set.

Fitting the Semivariogram by Maximum Likelihood


The least squares methods did not require that the random field is a Gaussian random field or
knowledge of any distributional properties of the random field above the constant mean and
the correctness of the semivariogram model. Maximum likelihood estimation requires knowl-
edge of the distribution of the data. Usually it is assumed that the data are Gaussian and for
spatial data this implies sampling from a Gaussian random field. Zasb œ c^ as" bß âß ^ as8 bdw is
then an 8-variate Gaussian random variable. Under the assumption of second-order stationari-
ty its mean and covariance matrix can be written as EcZasbd œ . œ c." ß âß .8 dw and
VarcZasbd œ Da)b,

Ô G a0à ) b G as"  s# à ) b G as"  s$ à ) b â G as"  s8 à ) b ×


Ö G as#  s" à ) b G a0à ) b G as#  s$ à ) b â G as#  s8 à ) b Ù
Ö Ù
Da) b œ Ö G as$  s" à ) b G as$  s# à ) b G a0à ) b â G as$  s8 à ) b Ù.
Ö Ù
ã ä ã
Õ G as8  s" à ) b G as8  s# à ) b â G as8  s8" à ) b G a0à ) b Ø

Note that instead of the semivariogram we work with the covariance function here, but
because the process is second-order stationary, the semivariogram and covariogram are
related by
#ahà )b œ G a0à )b  G ahà )b.

In short, Zasb µ Ka.ß Da)bb where ) is the vector containing the parameters of the covario-
gram. The negative log-likelihood of Zasb is
8 " "
6a)ß .à zasbb œ lna#1b € lnlDa)bl € azasb  .bw Da)b" azasb  .b [9.19]
# # #
and maximum likelihood (ML) estimates of ) (and .) are obtained as minimizers of this ex-
pression. Compare this objective function to that for fitting a linear mixed model by maxi-
mum likelihood in §7.4.1 and §A7.7.3. There the objective was to estimate the parameters in
the variance-covariance matrix and the unknown mean vector. The same idea applies here. If
. œ X" , where " are unknown parameters of the mean, maximum likelihood estimation pro-
vides simultaneous estimates of the large-scale mean structure (called the drift in the geosta-
tistical literature) and the spatial dependency. This is an advantage over the indirect least

© 2003 by CRC Press LLC


squares methods where the assumption of meanstationarity aEc^as3 bd œ .b is critical to obtain
the empirical semivariogram estimate.
Unbiasedness is not an asset of the ML estimator of the spatial dependence parameters ) .
It is well known that maximum likelihood estimators of covariance parameters are negatively
biased. In §7 the restricted maximum likelihood (REML) method of Patterson and Thompson
(1971) and Harville (1974, 1977) was advocated to reduce this bias. The same ideas apply
here. Instead of the likelihood of Zasb we consider that of KZasb where A is a matrix of error
contrasts such that EcKZasbd œ 0. Instead of [9.19] we minimize
"
6a)à Kzasbb œ a8  "blna#1b € lnlKDa)bKw l € zasbw Kw cKDa)bKw d Kzasb. [9.20]

Although REML estimation is well-established in statistical theory and applications, in the


geostatistical arena it appeared first in work by Kitanidis and coworkers in the mid-1980s
(Kitanidis 1983, Kitanidis and Vomvoris 1983, Kitanidis and Lane 1985).
An advantage of the ML and REML approaches is their direct dependence on the data
Zasb. No grouping in lag classes is necessary. Because maximum likelihood estimation does
not require an empirical semivariogram estimate, Chilès and Delfiner (1999, p. 109) call it a
blind method that tends “to be used only when the presence of a strong drift causes the
sample variogram to be hopelessly biased.” We disagree with this stance, the ML estimators
have many appealing properties, e.g., asymptotic efficiency. In their simulation study
Zimmerman and Zimmerman (1991) found the ML estimators of the semivariogram sill to be
much less variable than any other estimators of that parameter. Also, likelihood-based estima-
tors outperformed the least squares estimators when the spatial dependence was weak. Fitting
of a semivariogram by likelihood methods is not a blind process. If the random field has
large-scale mean EcZasbd œ X", then one can obtain residuals from an initial ordinary least
squares fit of the mean X" and use the residuals to calculate an empirical semivariogram
which guides the user to the formulation of a theoretical semivariogram or covariogram
model. Then, both the mean parameters " and the covariance parameters ) are estimated
simultaneously by maximum likelihood or restricted maximum likelihood. To call these
methods blind suggests that the models are formulated without examination of the data and
without a thoughtful selection of the model.

Fitting the Semivariogram by Composite Likelihood


The idea of composite likelihood estimation is quite simple and dates back to work by
Lindsay (1988). Lele (1997) and Curriero and Lele (1999) applied it to semivariogram esti-
mation and Heagerty and Lele (1998) to prediction of spatial binary data. Let ]" ß âß ]8 be
random variables whose marginal distribution 0 aC3 ß )b is known up to a parameter vector
) œ c)" ß âß ): dw . Then 6a)à C3 b œ lne0 aC3 ß ) bf is a true log-likelihood. For maximum likeli-
hood estimation the log-likelihood for the joint distribution of the ]3 is needed. Often this
joint distribution is known as in the previous paragraphs. If the ]3 are independent then the
full data log-likelihood is particularly simple: it is the sum of the individual terms 6a)à C3 b.
There are instances, however, when the complete data log-likelihood is not known or intrac-
table. A case in point are correlated observations. What is lost by using

© 2003 by CRC Press LLC


8
"6 a ) à C 3 b [9.21]
3œ"

as the objective function for maximization? Obviously, [9.21] is not a log-likelihood although
the individual terms 6a)à C3 b are. The estimates obtained by maximizing [9.21] cannot be as
efficient as ML estimates, which is easily established from key results in estimating function
theory (see Godambe 1960, Heyde 1997, and our §A9.9.2). The function [9.21] is called a
composite log-likelihood and its derivative,
8
`6a)5 à C3 b
GW a)5 à yb œ " [9.22]
3œ"
` )5

is the composite score function for )5 . Setting the composite score functions for )" ß âß ): to
zero and solving the resulting system of equations yields the composite likelihood estimates.
Applying this idea to the problem of estimating the semivariogram we commence by con-
sidering the 8a8  "bÎ# unique pairwise differences X34 œ ^ as3 b  ^ as4 b. When the ^ as3 b
are Gaussian with the same mean and the random field is intrinsically stationary (this is a
weaker assumption than assuming an intrinsically stationary Gaussian random field), then X34
is a Gaussian random variable with mean ! and variance ##as3  s4 à )b. We show in §A9.9.2
that the composite score function for the X34 is
8" 8
` # a s3  s 4 à ) b "
GW a)ß tb œ " " ˆ>#  ## as3  s4 à )b‰. [9.23]
3œ" 4ž3
`) %# as3  s4 à )b 34
#

Although this is a complicated looking expression, it is really the nonlinear weighted least
squares objective function in the model
X34# œ ##as3  s4 à )b € /34 ,

where the /34 are independent random variables with mean ! and variance )# # as3  s4 à )b.
Note the correspondence of VarÒX34# Ó to Cressie's variance approximation for the Matheron
# ahb is an average of the X34 .
estimator [9.15]. The expressions are the same considering that s
The composite likelihood estimator can be calculated easily with a nonlinear regression
package capable of weighted least squares fitting such as proc nlin in The SAS® System.
Obtaining the (restricted) maximum likelihood estimate requires a procedure that can mini-
mize [9.19] or [9.20] such as proc mixed. The minimization problem in ML or REML estima-
tion is numerically much more involved. One of the main problems there is that the matrix
Da)b must be inverted repeatedly. For clustered data as in §7, where the variance-covariance
matrix is block-diagonal, this is not too cumbersome; the matrix can be inverted block by
block. In the case of spatial data Da)b does not have a block-diagonal structure and in general
no shortcuts can be taken. Zimmerman (1989) derives some simplifications when the obser-
vations are collected on a rectangular or parallelogram grid. Composite likelihood (CL) esti-
mation on the other hand replaces the inversion of one large matrix with many inversions of
small matrices. The largest matrix to be inverted for a semivariogram model with $ parame-
ters (nugget, sill, range) is a $ ‚ $ matrix. However, CL estimation processes many more data
points. With 8 œ "!! spatial observations there are 8a8  "bÎ# œ %ß *&! pairs. That many
observations are hardly needed. It is quite reasonable to remove those pairs from estimation

© 2003 by CRC Press LLC


whose spatial distance is many times greater than the likely range or even to randomly sub-
sample the 8a8  "bÎ# distances. An advantage over the least squares type methods is the re-
liance on the data directly without binning pairwise differences into lag classes.
Recently, generalized estimating equations (GEE) have received considerable attention.
In the mid-1980s they were mostly employed for the estimation of mean parameters follow-
ing work by Liang and Zeger (1986) and Zeger and Liang (1986). Later the GEE methodolo-
gy was extended to the estimation of association parameters, for example, the correlation
among repeated measurements (see, e.g., Prentice 1988, Zhao and Prentice 1990). McShane
et al. (1997) applied GEE techniques for the estimation of the dependence in spatial data. It
turns out that there is a direct connection between generalized estimating equations for depen-
dence parameters and the composite likelihood method. §A9.9.2 contains the details.

Adding Flexibility: Nested Models and Nonparametric Fitting


One of the appealing features of nonparametric methods of statistical modeling is the absence
of a rigid mathematical model. In the parametric setting the user chooses a class of models
and estimates the unknown parameters of the model based on data to select one member of
the class, which is the fitted model. For example, when fitting an exponential semivariogram,
we assume that the process has a semivariogram of form
! hœ0
# ah à ) b œ ž $llhll
)! € )= š"  expš  ! ›› hÁ0

and the nugget )! , sill )= , and practical range ! are estimated based on data. Our list of
isotropic semivariogram models in §9.2.2 is relatively short. Although many more semi-
variogram models are known, typically users resort to one of the models shown there. In
applications one may find that none of these describes the empirical semivariogram well, for
example, because the random field does not have constant mean, is anisotropic, or consists of
different scales of variation. The latter reason is the idea behind what is termed the linear
model of regionalization in the geostatistical literature (see, for example, Goovaerts 1997,
Ch. 4.2.3). Statistically, it is based on the facts that (i) if G" ahb and G# ahb are valid co-
variance structures in ‘# , then G" ahb € G# ahb is also a valid covariance structure in ‘# ; (ii) if
G ahb is a valid structure, so is ,G ahb provided , ž !. As a consequence, linear combinations
of permissible covariance models lead to an overall permissible model. The coefficients in the
linear combination must be positive, however. The same results hold for semivariograms.
The linear model of regionalization assumes that the random function ^asb is a linear
combination of : stationary zero-mean random functions. If Y4 asb is a second-order stationary
random function with EcY4 asbd œ !, CovcY4 asbß Y4 as € hbd œ G4 ahb and +" ß âß +: are posi-
tive constants, then
:
^ asb œ "+4 Y4 asb € . [9.24]
4œ"

is a second-order stationary random function with mean ., covariance function

© 2003 by CRC Press LLC


G^ ahb œ Covc^ as € hbß ^ asbd
œ "+4 +5 CovcY4 asbß Y5 as € hbd
4ß5
:
œ "+4# G4 ahb, [9.25]
4œ"

and semivariogram #^ ahb œ !:4œ" +4# #4 ahb provided that the individual processes Y" asbß âß
Y: asb are not correlated. If the individual semivariograms #4 ahb have sill ", then !:4œ" +4# is
the variance of an observation. Covariogram and semivariogram models derived from a re-
gionalization such as [9.24] are called nested models. Every semivariogram model containing
a nugget effect is thus a nested model.
The variability of a soil property is related to many causes that have different spatial
scales, each scale integrating variability at all smaller scales (Russo and Jury 1987a). If the
total variability of an attribute varies with the spatial scale or resolution, nested models can
capture this dependency, if properly modeled. Nesting models is thus a convenient way to
construct theoretical semivariogram models that offer greater flexibility than the basic models
in §9.2.2. Nested models are not universally accepted, however. Stein (1999, p. 13) takes
exception to nested models where the individual components are spherical models. A danger
of nesting semivariograms is to model the effects of nonconstancy of the mean on the empiri-
cal semivariogram through a creative combination of second-order stationary and intrinsically
stationary semivariograms. Even if this combination fits the empirical semivariogram well, a
critical assumption of variogram analysis has been violated. Furthermore, the assumption of
mutual independence of the individual random functions Y" asbß âß Y: asb must be evaluated
with great care. Nugget effects that are due to measurement errors are reasonably assumed to
be independent of the other components. But a component describing smaller scale variability
due to soil nutrients may not be independent of a larger scale component due to soil types or
geology.
To increase the flexibility in modeling the semivariogram of stationary isotropic proces-
ses without violating the condition of positive definiteness of the covariogram (conditional
negative definiteness of the semivariogram), nonparametric methods can be employed. The
rationale behind the nonparametric estimators (a special topic in §A9.9.3) is akin to the
nesting of covariogram models in the linear model of regionalization. The covariogram is ex-
pressed as a weighted combination of functions, each of which is a valid covariance function.
Instead of combining theoretical covariogram models, however, the nonparametric approach
combines positive-definite functions that are derived from a spectral representation. These are
termed the basis functions. For data on a transect the basis function is cosa2b, for data in the
plane it is the Bessel function of the first kind of order zero, and for data in three dimensions
it is sina2bÎ2 (Figure 9.18).
The flexibility of the nonparametric approach is demonstrated in Figure 9.19 which
shows semivariograms constructed with 7 œ & equally spaced nodes and a maximum lag of
2 œ "!. The functions shown as solid lines have equal weights +4# œ !Þ#. The dashed lines are
produced by setting +"# œ +&# œ !Þ& and all other weights to zero. The smoothness of the semi-
variogram decreases with the unevenness of the weights and the number of sign changes of
the basis function.

© 2003 by CRC Press LLC


1.0

0.8

0.6

0.4

0.2

0.0

-0.2

-0.4

0 5 10 15 20 25 30
x

Figure 9.18. Basis functions for two-dimensional data (solid line, Bessel function of the first
kind of order !) and for three-dimensional data (dashed line, sinaBbÎB).

2 5 8

0 Sign changes 1 Sign change

1.0

0.5

0.0
Semivariogram γ(h)

2 Sign changes 3 Sign changes

1.0

0.5

0.0
4 Sign changes

1.0

0.5

0.0

2 5 8
Lag Distance h

Figure 9.19. Semivariogram models constructed as linear combinations of Bessel functions


of the first kind (of order !) as a function of the weight distribution and the number of sign
changes of the basis function. Semivariogram with five nodes and equal weights shown as
solid lines, semivariograms with unequal weights shown as dashed lines.

© 2003 by CRC Press LLC


9.3 The Spatial Model
Box 9.5 Spatial Model

• The Spatial Model is a statistical model decomposing the variability in a


random function into a deterministic mean structure and one or more
spatial random processes.

• The decomposition is not unique and components may be confounded in a


particular application.

• We distinguish between signal and mean models. In the former interest is


primarily in prediction of the signal, whereas in the latter, estimation of the
mean structure is more important.

• Reactive effects are modeled through the mean structure, interactive effects
are modeled through the random structure. For geostatistical data inter-
active effects are represented through stationary random processes, for lat-
tice data through autoregressive neighborhood structures.

So far we have been concerned with properties of random fields and the semivariogram or co-
variogram of a stationary process. Although the fitting of a semivariogram entails modeling,
this is only one aspect of representing the structure in spatial data in a manner conducive to a
statistical analysis. The constancy of the mean assumption implied by stationarity, for
example, is not reasonable in many applications. In a field experiment where treatments are
applied to experimental units the variation among units is not just due to spatial variation
about a constant mean but also due to the effects of the treatments. Our view of spatial data
must be extended to accommodate changes in the mean structure, stationarity, and measure-
ment error. One place to start is to decompose the variability in ^asb into various sources.
Following Cressie (1993, Ch. 3.1), we write for geostatistical data
^ asb œ .asb € [ asb € (asb € /asb. [9.26]

This decomposition is akin to the breakdown into sources of variability in an analysis of


variance model, but it is largely operational. It may be impossible in a particular application
to separate the components. It is, however, an excellent starting point to delineate some of the
approaches to spatial data analysis. The large-scale variation of ^asb is expressed through the
deterministic mean .asb. By implication all other components must have expectation !. The
mean can depend on spatial location and other variables. [ asb is called the smooth small-
scale variation; it is a stationary process with semivariogram #[ ahb whose range is larger
than the smallest lag distance in the sample. The variogram of the smooth-scale variation
should thus exhibit some spatial structure and can be modeled by the techniques in §9.2.4.
(asb is a spatial process with variogram #( ahb whose range is less than the smallest lag in the
data set. Cressie (1993, p. 112) terms it microscale variation. The semivariogram of (asb can-
not be modeled, no data are available at lags less than its range. The presence of micro-scale
variation is reflected in the variogram of ^ asb  .asb as a nugget effect which measures the

© 2003 by CRC Press LLC


sill 5(# of #( ahb. /asb, finally, is a white-noise process representing measurement error. The
variance of /asb, 5/# , also contributes to the nugget effect. There are three random components
on the right-hand side of the mixed model [9.26]. These are usually assumed to be indepen-
dent of each other.
With this decomposition in place, we define two basic types of models by combining one
or more components.
1. Signal Model. Let W asb œ .asb € [ asb € (asb denote the signal of the process. Then,
^ asb œ W asb € /asbÞ If the task is to predict the attribute of interest at unobserved locations it
is not reasonable to predict the noisy version ^asb which is affected by measurement error
(unless 5/# œ !). We are interested in the value of a soil attribute at location s! , not in the
value we would measure in error when a sample at s! were taken. Spatial prediction (kriging)
should always focus on predicting the signal Wab. Only if the data are measured without error
is prediction of ^ asb the appropriate course of action (in this case ^ asb œ W asb). The contro-
versy whether the kriging predictor is a perfect interpolator or not is in large measure ex-
plained by considering prediction of ^ asb vs. W asb in the presence of measurement error (see
§A9.9.5) and whether the nugget effect is recognized as originating from measurement error
or is part of the signal through micro-scale variation Ð5(# ž !ß 5/# œ !Ñ.
2. Mean Model. Let $ asb œ [ asb € (asb € /asb denote the error process (the stochastic
part of ^ asb). Then, ^ asb œ .asb € $ asb. This model is the entry point for spatial regression
and analysis of variance where focus is on modeling the mean function .asb as a function of
covariates and point location and $asb is assumed to have spatial autocorrelation structure
(§9.5). In this formulation .asb is sometimes called the large-scale trend and $ asb simply the
small-scale trend. All types of spatial data have their specific tools to investigate large- and
small-scale properties of a random field. The large-scale trend is captured by the mean func-
tion for geostatistical data, the mean vector for lattice data, and the intensity for a spatial point
process. The small-scale structure is captured by the semivariogram and covariogram for geo-
statistical and lattice data, and the O-function for point patterns (§9.7).
The signal and mean models have different focal points. In the former we are primarily
interested in the stochastic behavior of the random field and if it is spatially structured
employ this fact to predict ^ asb or W asb at observed and unobserved locations. The mean
Ec^ asbd œ .asb is somewhat ancillary in these investigations apart from the fact that if .asb
depends on location we must pay it special attention to model the stochastic structure proper-
ly since W asb (and by extension ^ asb) will not be stationary. This is the realm of the geostatis-
tical method (§9.4) that calls on kriging methods. It is in the assumptions about .asb that geo-
statistical methods of spatial prediction are differentiated into simple, ordinary, and universal
kriging. In the mean model ^ asb œ .asb € $ asb interest lies primarily in modeling the large-
scale trend (mean structure .asb) of the process and in turn the stochastic structure that arises
from $ asb is somewhat ancillary. We must pay attention to the stochastic properties of $ asb,
however, to ensure that the inferences drawn about the mean are appropriate. For example, if
spatial autocorrelation exists, the semivariogram of $asb will not be a nugget-only model and
taking the correlations among observations into account is critical to obtain reliable inference
about the mean .asb. An example application where .asb is of primary importance is the exe-
cution of a large field experiment where the experimental units are arranged in such a fashion
that the effects of spatial autocorrelation are not removed by the blocking scheme. Although
randomization neutralizes the spatial effects by balancing them across the units under con-

© 2003 by CRC Press LLC


ceptual repetitions of the basic experiment, we execute the experiment only once and may
obtain a layout where spatial dependency among experimental units increases the experi-
mental error variance to a point where meaningful inferences about the treatment effects
ainformation that is captured by .asbb are no longer possible. Incorporating the spatial effects
in the analysis by modeling .asb as a function of treatment effects and $ asb as a spatial ran-
dom field can assist in recovering vital information about treatment performance.
In particular for mean models the analyst must decide which effects are part of the large-
scale structure .asb and which are components of the error structure $ asb. There is no
unanimity between researchers. One modeler's fixed effect is someone else's random effect.
This contributes to the nonuniqueness of the decomposition [9.26]. Consider the special case
where .asb is linear,
^ asb œ xw asb" € $ asb,

and xasb is a vector of regressor variables that can depend on spatial coordinates alone or on
other explanatory variables and factors. Cliff and Ord (1981, Ch. 6) distinguish between
reaction and interaction models. In a reaction model sites react to outside influences, e.g.,
plants will react to the availability of nutrients in the root zone. Since this availability varies
spatially, plant size or biomass will exhibit a regression-like dependence on nutrient availabil-
ity. It is then reasonable to include nutrient availability as a covariate in the regressor vector
xasb. In an interaction model, sites react not to outside influences but react with each other.
Neighboring plants compete with each other for resources, for example. In general, when the
dominant spatial effects are caused by sites reacting to external forces, these effects should be
part of the mean function xw asb" . Interactive effects (reaction among sites) call for modeling
spatial variability through the spatial autocorrelation structure of the error process.
The distinction between reactive and interactive models is useful, but not cut-and-dried.
Significant autocorrelation in the data does not imply an interactive model over a reactive one
or vice versa. Spatial autocorrelation can be spurious if caused by large-scale trends or real if
caused by cumulative small-scale, spatially varying components. The error structure is thus
often thought of as the local structure and the mean is referred to as the global structure. With
increasing complexity of the mean model xw asb", for example, as higher-order terms are
added to a response surface, the mean will be more spatially variable and more localized. In a
two-way row-column layout (randomized block design) where rows and columns interact one
could model the data as
^34 œ . € !3 € "4 € # =" =# € /34 ß /34 µ 33. ˆ!ß 5 # ‰,

where !3 denotes row, "4 column effects and =" , =# are the cell coordinates. This model
assumes that the term #=" =# removes any residual spatial autocorrelation, hence the errors /34
are uncorrelated. Alternatively, one could invoke the model
^34 œ . € !3 € "4 € $34 ,

where the $34 are autocorrelated. One modeler's reactive effect will be another modeler's inter-
active effect.
With geostatistical data the spatial dependency between $ as3 b and $ as4 b (the interaction)
is modeled through the semivariogram or covariogram of the $ab process. If the spatial do-
main is discrete (lattice data), modifications are necessary since [ asb and (asb in decomposi-

© 2003 by CRC Press LLC


tion [9.26] are smooth-scale stationary processes with a continuous domain. As before, reac-
tive effects can be modeled as effects on the mean structure through regressor variables in
xasb. Interactions between sites can be incorporated into the model in the following manner.
The response ^as3 b at location s3 is decomposed into three components: (i) the mean .as3 b,
(ii) a contribution from the neighboring observations, and (iii) random error. Mathematically,
we can express the decomposition as
^ as3 b œ .as3 b € $ ‡ as3 b
8
œ .as3 b € ",34 a^ as4 b  .as4 bb € /as3 b. [9.27]
4œ"

The contribution to ^as3 b made by other sites is a linear combination of residuals at other lo-
cations. By convention we put ,33 œ ! in [9.27]. The /as3 b's are uncorrelated random errors
with mean ! and variance 53# . If all ,34 œ ! and .as3 b œ xw as3 b", the model reduces to
^ as3 b œ xw as3 b" € /as3 b, a standard linear regression model. The interaction coefficients ,34
contain information about the strength of the dependence between sites ^ as3 b and ^ as4 b.
Since !84œ" ,34 a^ as4 b  .as4 bb is a function of random variables it can be considered part of
the error process. In a model for lattice data it can be thought of as replacing the smooth-
scale random function [ asb in [9.26]. Model [9.27] is the spatial equivalent of an autoregres-
sive time series model where the current value in the series depends on previous values. In
the spatial case we potentially let ^as3 b depend on all other sites since space is not directed.
More precisely, model [9.27] is the spatial equivalent of a simultaneous time series model,
hence the denomination as a Simultaneous Spatial Autoregressive (SSAR) model. We discuss
SSAR and a further class of interaction models for lattice data, the Conditional Spatial Auto-
regressive (CSAR) models, in §9.6.
Depending on whether the spatial process has a continuous or discrete domain, we now
have two types of mean models. Let Zasb œ c^ as" bß âß ^ as8 bdw be the a8 ‚ "b vector of the
attribute ^ at all observed locations, Xasb be the a8 ‚ 5 b regressor matrix
w
Ô x a s" b ×
Ö x w a s# b Ù
X a sb œ Ö Ù,
ã
Õ x w a s8 b Ø

and $asb œ c$ as" bß âß $ as8 bdw the vector of errors in the mean model for geostatistical data.
The model can be written as
Zasb œ Xasb" € $ asb, [9.28]

where Ec$ asbd œ 0 and the variance-covariance matrix of $ asb contains the covariance func-
tion of the $ ab process. If Covc$ as3 bß $ as4 bd œ G as3  s4 ;)b, then

Ô G a0 à ) b G a s"  s # à ) b â G a s"  s 8 à ) b ×
Ö G a s #  s" à ) b G a0 à ) b â G as#  s8 à ) b Ù
Varc$ asbd œ Da)b œ Ö Ù.
ã ä ã
Õ G a s8  s " à ) b â G as8  s8" à )b G a0 à ) b Ø

Note that Da)b is also the variance-covariance matrix of Zasb. Unknown quantities in this
model are the vector of fixed effects " in the mean function and the vector ) in the covari-

© 2003 by CRC Press LLC


ance function. Now consider the SSAR model for lattice data ([9.27]). Collecting the auto-
regressive coefficients ,34 into matrix B and assuming .asb œ Xasb" , [9.27] can be written as
Zasb œ Xasb" € BaZasb  Xasb" b € easb
Í aI  BbaZasb  Xasb" b œ easb. [9.29]

It follows that VarcZasbd œ aI  Bb" VarceasbdaI  Bw b" . In applications of the SSAR model
it is often assumed that the errors are homoscedastic with variance 5 # . Then, VarcZasbd œ
5 # aI  Bb" aI  Bw b" . Parameters of the SSAR model are the vector of fixed effects ", the
residual variance 5 # , and the entries of the matrix B. While the covariance matrix Da)b de-
pends on only a few parameters with geostatistical data (nugget, sill, range), the matrix B can
contain many unknowns. It is not even required that B is symmetric, only that I  B is invert-
ible. For purposes of parameter estimation it is thus required to place some structure on B to
reduce the number of unknowns. For example, one can put B œ 3W, where W is a matrix se-
lected by the user that identifies which sites are spatially connected and the parameter 3 deter-
mines the strength of the spatial dependence. Table 9.2 summarizes the key differences
between the mean models for geostatistical and lattice data. How to structure the matrix B in
the SSAR model and the corresponding matrix in the CSAR model is discussed in §9.6.

Table 9.2. Mean models for geostatistical and lattice data (with Varceasbd œ 5 # I)
Geostatistical Data Lattice Data
Model Zasb œ Xasb" € $ asb Xasb" € BaZasb  Xasb" b € easb
EcZasbd Xasb" Xasb"
VarcZasbd Da)b 5 # aI  Bb" aI  Bw b"
Mean parameters " "
#
Dependency parameters ) 5 ,B

9.4 Spatial Prediction and the Kriging Paradigm

9.4.1 Motivation of the Prediction Problem


Box 9.6 Prediction vs. Estimation

• Prediction is the determination of the value of a random variable. Estima-


tion is the determination of the value of an unknown constant.

• If interest lies in the value of a random field at location s then we should


predict ^ asb or the signal W asb. If the average value at location s across all
realizations of the random experiment is of interest, we should estimate
Ec^ asbd.

© 2003 by CRC Press LLC


• Kriging methods are solutions to the prediction problem where require-
ments for the predictor are combined with assumptions about the spatial
model.

A common goal in the analysis of geostatistical data is the mapping of the random function
^ asb in some region of interest. The sampling process produces observations ^ as" b,â,^ as8 b
but ^ asb varies continuously through the domain H. To produce a map of ^ asb requires
prediction of ^ab at unobserved locations s! . What is commonly referred to as the
geostatistical method consists of the following steps (at least the first 6).
1. Using exploratory techniques, prior knowledge, and/or anything else, posit a model of
possibly nonstationary mean plus second-order or intrinsically stationary error for the
^asb process that generated the data.
2. Estimate the mean function by ordinary least squares, smoothing, or median polishing
to detrend the data. If the mean is stationary this step is not necessary. The methods
for detrending employed at this step usually do not take autocorrelation into account.
3. Using the residuals obtained in step 2 (or the original data if the mean is stationary), fit
a semivariogram model #ahà )b by one of the methods in §9.2.4.
4. Statistical estimates of the spatial dependence in hand (from step 3) return to step 2 to
re-estimate the parameters of the mean function, now taking into account the spatial
autocorrelation.
5. Obtain new residuals from step 4 and iterate steps 2 through 4, if necessary.
6. Predict the attribute ^ab at unobserved locations and calculate the corresponding mean
square prediction errors.

If the mean is stationary or if the steps of detrending the data and subsequent estimation
of the semivariogram (or covariogram) are not iterated, the geostatistical method consists of
only steps 1, 2, 3, and 6. This section is concerned with the last item in this process, the
prediction of the attribute (mapping).
Understanding the difference between predicting ^as! b, which is a random variable, and
estimation of the mean of ^as! b, which is a constant, is essential to gain an appreciation for
the geostatistical methods employed to that end. To motivate the problem of spatial prediction
focus first on a classical linear model Y œ X" € e, e µ a0ß 5 # Ib. What do we mean by a pre-
dicted value at a regressor value x! ? We can think of a large number of possible outcomes
that share the same set of regressors x! and average their response values. This average is an
estimate of the mean of ] at x! , Ec] lx! d. Once we have fitted the model to data and obtained
estimates " s the obvious estimate of this quantity is xw! " s . What if a predicted value is inter-
preted as the response of the next observation that has regressors x! ? This definition appeals
not to infinitely many observations at x! , but a single one. Rather than predicting the expected
value of ] , ] itself is then of interest. In the spatial context imagine that ^asb is the soil loss
potential at location s in a particular field. If an agronomist is interested in the soil loss po-
tential of a large number of fields with properties similar to the sampled one, the important
quantity would be Ec^asbd. An agronomist interested in the soil loss potential of the sampled
field at a location not contained in the sample would want to predict ^as! b, the actual soil loss

© 2003 by CRC Press LLC


potential at location s! . Returning to the classical linear model example, it turns out that re-
gardless of whether interest is in a single value or an average, the predictor of ] lx! and the
estimator of Ec] lx! d are the same:

s
sc] |x! d œ xw! "
E
s lx! œ xw! "
] s.

The difference between the two predictions does not lie in the predicted value, but in their
precision (standard errors). Standard linear model theory, where " is estimated by ordinary
least squares, instructs that
s “ œ 5 # xw! aXw Xb x!
sc] lx! d‘ œ Var’xw! " "
VarE [9.30]

is the variance of the predicted mean at x! . When predicting random variables one considers
the variance of the prediction error ] lx!  ] s lx! to take into account the variability of the
new observation. If the new observation ] lx! is uncorrelated with Y, then
s lx! ‘ œ Varc] lx! d € 5 # xw! aXw Xb" x! .
Var] lx!  ] [9.31]

Although the same formula axw! " b is used for predictions, the uncertainty associated with pre-
dicting a random variable ([9.31]) exceeds the uncertainty in predicting the mean ([9.30]). To
consider the variance of the prediction error in one case and the variance of the predictor in
the other is not arbitrary. Consider some quantity Y is to be predicted. We use some function
0 aYb of the data as the predictor. If EcY d œ Ec0 aYbd, the mean square prediction error is
Q WI cY ß 0 aYbd œ EaY  0 aYbb# ‘ œ VarcY  0 aYbd
œ VarcY d € Varc0 aYbd  #CovcY ß 0 aYbd. [9.32]

In the standard linear model estimating Ec] lx! d and predicting ] lx! correspond to the
following:
Target Y 0 aYb VarcY d Varc0 aYbd CovcY ß 0 aYbd Q WI cY ß 0 aYbd
s " "
Ec] lx! d xw! " ! 5 # xw! aXw Xb x! ! 5 # xw! aXw Xb x!
s " "
] lx! xw! " Varc] lx! d 5 # xw! aXw Xb x! !† Varc] lx! d € 5 # x!w aXw Xb x!
† s is based.
provided ] lx! is independent of the observed vector Y on which the estimate "

The variance formulas [9.30] and [9.31] are mean square errors and well-known from
linear model theory for uncorrelated data. Under these conditions it turns out that ] s
s lx! œ xw! "
is the best linear unbiased predictor of ] lx! and that Ec w s is the best linear un-
s ] |x ! d œ x ! "
biased estimator of Ec] lx! d. Expression [9.31] applies only, however, if ] lx! and Y are not
correlated.
Spatial data exhibit spatial autocorrelations which are a function of the proximity of ob-
servations. Denote by ^ as" bß âß ^ as8 b the attribute at the observed locations s" ß âß s8 and as
s! the target location where prediction is desired. If the observations are spatially correlated,
then ^as! b is also correlated with the observations unless the target location s! is further re-
moved from the observed locations than the spatial range (Figure 9.20). We must then ask
which function of the data best predicts ^as! b and how to measure the mean square predic-

© 2003 by CRC Press LLC


tion error. In order to solve this problem we need to define what best means. Kriging
methods are solutions to the prediction problem where a predictor is best if it (i) minimizes
the mean square prediction error, (ii) is linear in the observed values ^ as" bß âß ^ as8 b, and
(iii) is unbiased in the sense that the mean of the predicted value at s! equals the mean of
^as! b. There are many variants of the kriging method (Table 9.3) and their combinations
create a stunning array of techniques (e.g., universal block kriging, lognormal cokriging, â).

150

135

120

105
Y-Coordinate

90

75

60

45

30

15

0 13 26 39 52 65 78 91 104 117 130


X-Coordinate

Figure 9.20. Observed sample locations (dots). Crosses denote target locations for prediction.
Strength of correlation between observations at target locations and at observed locations
depends on the distance between dots and crosses.

Table 9.3. Different kriging methods and commonly encountered names


Distinguished by Method known as
Size of Target
Points (Point) Kriging
Areas Block Kriging
What is known
. known Simple Kriging
. unknown but constant Ordinary Kriging
. œ EcZasbd œ X" , " unknown Universal Kriging
Distribution
^ asb µ Gaussian Kriging
ln ^ asb µ Gaussian Lognormal Kriging
9a^ asbb µ Gaussian Transgaussian Kriging
^ asb is an indicator variable Indicator Kriging
Gaussian with isolated outliers Robust Kriging
Unknown Median Polish Kriging
.asb is itself a random process Bayesian Kriging
Number of attributes
Single attribute ^asb Kriging
Multiple attributes ^" asbß âß ^5 asb Cokriging
Linearity
Predictor linear in Zasb Kriging
Predictor linear in functions of Zasb Disjunctive Kriging

© 2003 by CRC Press LLC


The term kriging was coined by Matheron (1963) who named the method after the South
African mining engineer, D.G. Krige. Cressie (1993, p. 119) points out that kriging is used
both as a noun and a verb. As a noun, kriging implies optimal prediction of ^as! b; as a verb it
implies optimally predicting ^as! b. Ordinary and universal kriging are the most elementary
variants and those most frequently applied in agricultural practice. They are discussed in
§9.4.3. In the literature kriging is often referred to as an optimal prediction method, implying
that a kriging predictor beats any other predictor. This is not necessarily true. If, for example,
the discrepancy between the target ^ as! b and the predictor :aZà s! b is measured as l^ as! b 
:aZà s! bl, they are not best in the sense of minimizing the average discrepancy. If ^ asb is not
a Gaussian random field the kriging predictor is not optimal unless further restrictions are im-
posed. It is thus important to understand the conditions under which kriging methods are best
to avoid overstating their faculties.

9.4.2 The Concept of Optimal Prediction


Box 9.7 The Optimal Predictor

• To find the optimal predictor :aZà s! b for ^ as! b requires a measure for the
loss incurred by using :aZà s! b for prediction at s! . Different loss functions
result in different best predictors.

• The loss function of greatest importance in statistics is squared-error loss,


e^ as! b  :aZà s! bf# . Its expected value is the mean square prediction
error (MSPE).

• The predictor that minimizes the mean square prediction error is the
conditional mean Ec^ as! blZasbd. If the random field is Gaussian the
conditional mean is linear in the observed values.

When a statistical method is labeled as optimal or best, we need to inquire under which condi-
tions optimality holds; there are few methods that are uniformly best. The famous pooled >-
test, for example, is a uniformly most powerful test, but only if uniformly means among all
tests for comparing the means of two Gaussian populations with common variance. If ^as! b
is the target quantity to be predicted at location s! a measure of the loss incurred by using
some predictor :aZà s! b for ^ as! b is required (we use :aZà s! b as a shortcut for :aZasbà s! b).
The most common loss function in statistics is squared error loss
e^ as! b  :aZà s! bf# [9.33]

because of its mathematical tractability and simple interpretation. But [9.33] is not directly
useful since it is a random quantity that depends on unknowns. Instead, we consider its
average, EÒe^ as! b  :aZà s! bf# Ó, the mean square error of using :aZà s! b to predict ^ as! b.
This expected value is also called the Bayes risk under squared error. If squared error loss is
accepted as the suitable loss function, among all possible predictors the one that should be
chosen is that which minimizes the Bayes risk. This turns out to be the conditional
expectation :! aZà s! b œ Ec^ as! b l Zasbd. The minimized mean square prediction error
(MSPE) then takes on the following, surprising form:

© 2003 by CRC Press LLC


#
E’˜^ as! b  :! aZà s! b™ “ œ Varc^ as! bd  Var:! aZà s! b‘. [9.34]

This is a stunning result since variances are usually added, not subtracted. From [9.34] it is
immediately obvious that the conditional mean must be less variable than the random field at
s! , because the mean square error is a positive quantity. Perhaps even more surprising is the
fact that the MSPE is small if the variance of the predictor is large. Consider a time series
where the value of the series at time #! is to be predicted (Figure 9.21). Three different types
of predictors are used. The sample average C and two nonparametric fits that differ in their
smoothness. The sample mean C is the smoothest of the three predictors, since it does not
change with time. The loess fit with large bandwidth (dashed line) is less smooth than C and
more smooth than the loess fit with small bandwidth (solid line). The less smooth the predic-
tor, the greater its variability and the more closely it will follow the data. The chance that a
smooth predictor is close to the unknown observation at > œ #! is smaller than for one of the
variable (more jagged) predictors. The most variable predictor is one that interpolates the data
points (connects the dots). Such a predictor is said to honor the data or to be a perfect
interpolator. In the absence of measurement error the classical kriging predictors have
precisely this property to interpolate the observed data points (see §A9.9.5).

2.0

1.5

1.0

0.5
Z(t)

0.0

-0.5

-1.0

-1.5

-2.0
0 10 20 30 40
Time t

Figure 9.21. Prediction of a target point at > œ #! (circle) in a time series of length %!.
The predictors are loess smooths with small (irregular solid line) and large bandwidth
(dashed line) as well as the arithmetic average C (horizontal line).

Although the conditional mean Ec^ as! b l Zasbd is the optimal predictor of ^ as! b under
squared error loss, it is not the predictor usually applied. Ec^ as! b l Zasbd can be a complicated
nonlinear function of the observations. A notable exception occurs when ^asb is a Gaussian
random field and c^ as! bß Zasbd are jointly multivariate Gaussian distributed. Define the
following quantities
EcZasbd œ .asb Ec^ as! bd œ .as! b
VarcZasbd œ D Varc^ as! bd œ 5 # [9.35]
Covc^ as! bß Zasbd œ c

for a Gaussian random field. The joint distribution of ^as! b and Zasb then can be written as

© 2003 by CRC Press LLC


^ as! b .as! b 5# cw
” Zasb • µ KŒ” .asb •ß ” c D •
.

Recalling results from §3.7 the conditional distribution of ^as! b given Zasb is univariate
Gaussian with mean .as! b € cw D" aZasb  .asbb and variance 5 #  cw D" c. The optimal
predictor under squared error loss is thus
:! aZà s! b œ Ec^ as! blZasbd œ .as! b € cw D" aZasb  .asbb. [9.36]

This important expression is worthy of some comments. First, the conditional mean is a linear
function of the observed data Zasb. Evaluating the statistical properties of the predictor such
as its mean and variance is thus simple:
E:! aZà s! b‘ œ .as! b, Var:! aZà s! b‘ œ cw D" c.

Second, the predictor is a perfect interpolator. Assume you wish to predict at all observed
locations. To this end, replace in [9.36] .as! b with .asb and cw with D. The predictor becomes
:! aZà sb œ .asb € DD" aZasb  .asbb œ Zasb.

Third, imagine that ^ as! b and Zasb are not correlated. Then c œ 0 and [9.36] reduces to
:! aZà s! b œ .as! b, the (unconditional) mean at the unsampled location. One interpretation of
the optimal predictor is to consider cw D" aZasb  .asbb as the adjustment to the uncondition-
al mean that draws on the spatial autocorrelation between attributes at the unsampled and
sampled locations. If ^as! b is correlated with observations nearby, then using the information
from other locations strengthens our ability to make predictions about the value at the new
location (since the MSPE if c œ 0 is 5 # ). Fourth, the variance of the conditional distribution
equals the mean square prediction error.
Two important questions arise. Do the simple form and appealing properties of the best
predictor prevail if the random field is not Gaussian-distributed? If the means .asb and .as! b
are unknown are predictors of the form
sas! b € cw D" aZasb  .
. sasbb
still best in some sense? To answer these questions we now relate the decision-theoretic setup
in this subsection to the basic kriging methods.

9.4.3 Ordinary and Universal Kriging


Box 9.8 Kriging and Best Prediction

• Kriging predictors are the best linear unbiased predictors under squared
error loss.

• Simple, ordinary, and universal kriging differ in their assumptions about


the mean structure .asb of the spatial model.

© 2003 by CRC Press LLC


The classical kriging techniques are methods for predicting ^as! b based on combining
assumptions about the spatial model with requirements about the predictor :aZà s! b. The usual
set of requirements are
(i) :aZà s! b is a linear combination of the observed values ^ as" bß âß ^ as8 b.
(ii) :aZà s! b is unbiased in the sense that Ec:aZà s! bd œ Ec^ as! bd.
(iii) :aZà s! b minimizes the mean square prediction error.

Requirement (i) states that the predictors have the general form
8
:aZà s! b œ "-as3 b^ as3 b, [9.37]
3œ"

where -as3 b is a weight associated with the observation at location s3 . Relative to other
weights, -as3 b determines how much the observation ^ as3 b contributes to the predicted value
at location s! . To satisfy requirements (ii) and (iii) the weights are chosen to minimize
#
Ô 8 ×
E ž^ as! b  "-as3 b^ as3 bŸ
Õ 3œ" Ø

subject to certain constraints that guarantee unbiasedness. These constraints depend on the
model assumptions. The three basic kriging methods, simple, ordinary, and universal kriging,
are distinguished according to the mean structure of the spatial model
^ asb œ .asb € $ asb.

Table 9.4. Simple, ordinary, and universal kriging model assumptions


Method Assumption about .asb $ asb
Simple Kriging .asb is known second-order or intrinsically stationary
Ordinary Kriging .asb œ ., . unknown second-order or intrinsically stationary
w
Universal Kriging .asb œ x asb" , " unknown second-order or intrinsically stationary

Simple Kriging
The solution to this minimization problem if .asb (and thus .as! b) is known is called the
simple kriging predictor (Matheron 1971)
:WO aZà s! b œ .as! b € cw D" aZasb  .asbb. [9.38]

The details of the derivation can be found in Cressie (1993, p. 109 and our §A9.9.4). Note
that .as! b in [9.38] is a scalar and Zasb and .asb are vectors (see [9.35] on p. 608 for
definitions). The simple kriging predictor is unbiased since Ec:WO aZà s! bd œ .as! b œ
Ec^ as! bd and bears a striking resemblance to the conditional mean under Gaussianity ([9.36]).
Simple kriging is thus the optimal method of spatial prediction (under squared error loss) in a
Gaussian random field since :WO aZà s! b equals the conditional mean. No other predictor then
has a smaller mean square prediction error, not even when nonlinear functions of the data are

© 2003 by CRC Press LLC


considered. If the random field is not Gaussian, :WO aZà s! b is no longer best in that sense, it is
best only among all predictors that are linear in the data and unbiased. It is the best linear un-
biased predictor (BLUP). The minimized mean square prediction error of an unbiased kriging
predictor is often called the kriging variance or the kriging error. It is easy to establish that
the kriging variance for the simple kriging predictor is
#
5WO as! b œ 5 #  cw D" c, [9.39]

where 5 # is the variance of the random field at s! . We assume here that the random field is
second-order stationary so that Varc^ asbd œ Varc^ as! bd œ 5 # and the covariance function
exists (otherwise Gahb is a nonexisting parameter and [9.38], [9.39] should be expressed in
terms of the semivariogram).
Simple kriging is useful in that it determines the benchmark for other kriging methods.
The assumption that the mean is known everywhere is not tenable for most applications. An
exception is the kriging of residuals from a fit of the mean function. If the mean model is
correct the residuals will have a known, zero mean. How much is lost by estimating an un-
known mean can be inferred by comparing the simple kriging variance [9.39] with similar ex-
pressions for the methods that follow.

Universal and Ordinary Kriging


Universal and ordinary kriging have in common that the mean of the random field is not
known and can be expressed by a linear model. The more general case is .asb œ xw asb" where
the mean is a linear regression on some regressor variables xasb. We keep the argument asb to
underline that the regressor variables will often be the spatial coordinates themselves or func-
tions thereof. In many cases xasb consists only of spatial coordinates (apart from an intercept).
As a special case we can assume that the mean of the random field does not change with spa-
tial locations but is unknown. Then replace xw asb with " and " with ., the unknown mean.
The latter simplification gives rise to the ordinary kriging predictor. It is the predictor
8
:SO aZà s! b œ "-SO as3 b^ as3 b
3œ"

which minimizes the mean square prediction error subject to an unbiasedness constraint. This
constraint can be found by noticing that Ec:SO aZà s! bd œ Ec!83œ" -SO as3 b^ as3 bd œ
!83œ" -SO as3 b. which must equal . for :SO aZà s! b to be unbiased. As a consequence the
weights must sum to one. This does not imply, by the way, that kriging weights are positive.
If the mean of the random field is .asb œ xw asb" it is not sufficient to require that the
kriging weights sum to one. Instead we need
8 8
E–"-Y O as3 b^as3 b— œ "-Y O as3 bxw as3 b" œ xw as! b".
3œ" 3œ"

Using matrix/vector notation this constraint can be expressed more elegantly. Write the uni-
versal kriging model as Zasb œ Xasb" € $ asb and the predictor as

© 2003 by CRC Press LLC


8
:Y O aZà s! b œ "-Y O as3 b^ as3 b œ -w Xasb",
3œ"

where - is the vector of (universal) kriging weights. For :Y O aZà s! b to be unbiased we need
- w X œ x w as! b .
Minimization of
#
Ô 8 ×
E ž^ as! b  "-SO as3 b^ as3 bŸ subject to -w 1 œ "
Õ 3œ" Ø

to find the ordinary kriging weights and of


#
Ô 8 ×
E ž^ as! b  "-SO as3 b^ as3 bŸ subject to -w X œ xw as! b
Õ 3œ" Ø

to derive the universal kriging weights is a constrained optimization problem. It can be solved
as an unconstrained minimization problem using one (OK) or more (UK) Lagrange multi-
pliers (see §A9.9.4 for details and derivations). The resulting predictors can be expressed in
numerous ways. We prefer
s KPW € cw D" ŠZasb  X"
:Y O aZà s! b œ xw as! b" s KPW ‹, [9.40]

s KPW is the generalized least squares estimator


where "
s KPW œ ˆXw D" X‰" Xw D" Zasb.
" [9.41]

As a special case of [9.40], where x œ ", the ordinary kriging predictor is obtained:
s € cw D" aZasb  1.
:SO aZà s! b œ . sb [9.42]

Here, .
s is the generalized least squares estimator of the mean,
" 1w D" Zasb
s œ ˆ1w D" 1‰ 1w D" Zasb œ
. . [9.43]
1w D" 1

Comparing [9.40] with [9.36], the optimal predictor in a Gaussian random field, we again
notice a striking resemblance. The question raised at the end of the previous subsection about
the effects of substituting an estimate . sas! b for the unknown mean can now be answered.
Substituting an estimate retains certain best properties of the linear predictor. It remains un-
biased provided the estimate for the mean is unbiased and the model for the mean is correct.
It remains an exact interpolator and the predictor has the form of an estimate of the mean
s Ñ adjusted by surrounding values with adjustments depending on the strength of the
Ðx w as! b "
spatial correlation (c, D). Kriging predictors are obtained only if the generalized least squares
estimates [9.41] (or [9.43] for OK) are being substituted, however.
In the formulations [9.40] and [9.42] it is not immediately obvious what the kriging
weights are. Some algebra leads to

© 2003 by CRC Press LLC


" w
:Y O aZà s! b œ -Yw O Zasb œ ’c € XˆXw D" X‰ ˆxas! b  Xw D" c‰“ D" Zasb
" w
w
:SO aZà s! b œ -SO Zasb œ ’c € 1ˆ1w D" 1‰ ˆ"  1w D" c‰“ D" Zasb. [9.44]

The kriging variances are calculated as


w "
5Y# O œ 5 #  cw D" c € ˆxas! b  Xw D" c‰ ˆXw D" X‰ ˆxas! b  Xw D" c‰
#
[9.45]
#
5SO œ 5 #  cw D" c € ˆ"  1w D" c‰ Έ1w D" 1‰.

Compare these expressions to the kriging error for simple kriging [9.39]. Since
ˆxas! b  Xw D" c‰w ˆXw D" X‰" ˆxas! b  Xw D" c‰

is a quadratic form it is positive definite and 5Y# O ž 5WO


#
. The mean square prediction error
increases if the unknown mean is estimated from the data. Expressions in terms of the semi-
variances are given in §A9.9.4.

9.4.4 Some Notes on Kriging


Is Kriging Perfect Interpolation?
Consider the decomposition ^ asb œ W asb € /asb, where W asb is the signal of the process. If
/asb is pure measurement error, then one should predict the signal W ab, rather than the error-
contaminated ^ab process. If the data are affected by measurement error, one is not interested
in predicting the erroneous observation, but the amount that is actually there. The controversy
whether kriging is a perfect interpolation method (honors the data) is concerned with the
nature of the nugget effect as micro-scale variability or measurement error and whether
predictions focus on the ^ ab or the W ab process. In §A9.9.5 kriging of the ^ ab and W ab
process are compared for semivariograms with and without nugget effect. It is assumed there
that /asb does not contain a micro-scale variability component, i.e., Varc/asbd is made up of
measurement error in its entirety. The main findings of the comparison in §A9.5.5 are as
follows. In the absence of a nugget effect predictions of ^ asb and W asb agree in value and
precision for observed and unobserved locations. This is obvious, since then ^ asb œ W asb. In
a model where the nugget effect is measurement error, predictions of ^ asb and W asb agree in
value at unobserved locations but not in precision. Predictions of ^asb are less precise. At
observed locations predictions of ^asb honor the data even in the presence of a nugget effect.
Predictions of Wasb at observed locations honor the data only in the absence of a nugget
effect. So, when is kriging not a perfect interpolator? When predicting the signal at observed
locations and the nugget effect contains a measurement error component.

The Cat and Mouse Game of Universal Kriging


In §9.4.3 the universal kriging predictor in the spatial model
Zasb œ Xasb" € $ asb

was given in two equivalent ways:

© 2003 by CRC Press LLC


" w
:Y O aZà s! b œ -Yw O Zasb œ ’c € XˆXw D" X‰ ˆxas! b  Xw D" c‰“ D" Zasb
s KPW € cw D" ŠZasb  X"
œ x w as! b " s KPW ‹, [9.46]

where "s KPW is the generalized least squares estimator " s KPW œ aXw D" Xb" Xw D" Zasb. The
covariance matrix D is usually constructed from a model of the semivariogram utilizing the
simple relationship between covariances and semivariances in second-order stationary ran-
dom fields. The modeling of the semivariogram requires, however, that the random field is
mean stationary, i.e., the absence of large-scale structure. If Ec^asbd œ xw asb" the large-scale
trend must be removed before the semivariogram can be modeled. Failure to do so can se-
verely distort the semivariogram. Figure 9.22 shows empirical semivariograms (dots) calcu-
lated from two sets of deterministic data. The left panel is the semivariogram of ^ aBb œ
" € !Þ&B where B is a point on a transect. The right-hand panel is the semivariogram of
^ aBb œ " € !Þ##B € !Þ!##B#  !Þ!!"$B$ . The power model fits the empirical semivario-
gram in the left panel very well and the gaussian model provides a decent fit to the semivario-
gram of the cubic polynomial. The shapes of the semivariograms are due to trend only,
however. There is nothing stochastic about the data. One must not conclude based on these
graphs that the process on the left is intrinsically stationary and that the process on the right is
second-order stationary.

30

3
Semivariogram

20

10
1

0 0

0 5 10 15 0 5 10 15
Lag h Lag h

Figure 9.22. Empirical semivariograms (dots) for evaluations of deterministic (trend-only)


functions. A power semivariogram was fitted to the semivariogram on the left and a gaussian
model to the one on the right.

Certain types of mean nonstationarity (drift) can be inferred from the semivariogram
(Neuman and Jacobson 1984, Russo and Jury 1987b). A linear drift, for example, causes the
semivariogram to increase as in the left-hand panel of Figure 9.22. Using the semivariogram
as a diagnostic procedure for detecting trends in the mean function is dangerous as the drift-
contaminated semivariogram may suggest a valid theoretical semivariogram model. Kriging a

© 2003 by CRC Press LLC


process with linear drift and exponential semivariogram by ignoring the drift and using a
power semivariogram model is not the appropriate course of action. This issue must not be
confused with the question whether the semivariogram should be defined as
" "
Varc^ asb  ^ as € hbd or Ee^ asb  ^ as € hbf# ‘.
# #
The two definitions are equivalent for stationary processes. Intrinsic or second-order station-
ary require first that the mean is constant. Only if that is the case can the semivariogram (or
covariogram) be examined for signs of spatial stochastic structure in the process.
Trend removal is thus critical prior to the estimation of the semivariogram. But efficient
estimation of the large-scale trend requires knowledge of the variance-covariance matrix D as
can be seen from the formula of the generalized least squares estimator. In short, we need to
know " before estimation of D is possible but efficient estimation of " requires knowledge
of D.
One approach out of this quandary is to detrend the data by a method that does not re-
quire D, for example, ordinary least squares or median polishing. The residuals from this fit
are then used to estimate the semivariogram from which D s is being constructed. These steps
s
should be iterated. Once an estimate of D has been obtained a more efficient estimate of the
trend is garnered by estimated generalized least squares (EGLS) as
"
s" X‹ Xw D
s IKPW œ ŠXw D
" s" Zasb.

New residuals are obtained and the estimate of the semivariogram is updated. The only down-
side of this approach is that the residuals obtained from detrending the data do not exactly be-
have as the random component $asb of the model does. Assume that the model is initially de-
trended by ordinary least squares,
s SPW œ aXw Xb" Xw Zasb,
"

and the fitted residuals s$ SPW œ Zasb  X" s SPW are formed. The error process $ asb of the
model has mean 0, variance-covariance matrix D, and semivariogram #$ ahb œ
½Varc$ asb  $ as € hbd. The vector of fitted residuals also has mean 0, provided the model for
the large-scale structure was correct, but it does not have the proper semivariogram or
variance-covariance matrix. Since s $ asb œ aI  HbZasb where H is the hat matrix H œ
w " w
XaX Xb X , it is established that

Var’s
$ asb“ œ aI  HbDaI  Hb Á D.

The fitted residuals exhibit more negative correlations than the error process and the estimate
of the semivariogram based on the residuals will be biased. Furthermore, the residuals do not
have the same variance as the $ asb process. It should be noted that if D were known and the
semivariogram were estimated based on GLS residuals s $ SPW œ Zasb  X" s KPW , the semi-
variogram estimator would still be biased (see Cressie 1993, p. 166). The bias comes about
because residuals satisfy constraints that are not properties of the error process. For example,
the fitted OLS residuals will sum to zero. The degree to which a semivariogram estimate
derived from fitted residuals is biased depends on the method used for detrending as well as
the method of semivariogram estimation. Since the bias is typically more substantial at large

© 2003 by CRC Press LLC


lags, Cressie (1993, p. 168) reasons that weighted least squares fitting of the semivariogram
model is to be preferred over ordinary least squares fitting because the former places more
weight on small lags (see §9.2.4). In conjunction with choosing the kriging neighborhood
(see below) so that the semivariogram must be evaluated only for small lags, the impact of the
bias on the kriging predictions can be reduced. If the large-scale trend occurs in only one di-
rection, then the problem of detrending the data can be circumvented by using only pairs in
the perpendicular direction of the trend to model the semivariogram.
The following method is termed "universal kriging" on occasion. Fit the large-scale mean
by ordinary least squares and obtain the residuals s$ SPW œ Zasb  X" s SPW . Estimate the
semivariogram from these residuals and then perform simple kriging (because the residuals
are known to have mean zero) on the residuals. To obtain a prediction of ^as! b the OLS esti-
mate of the mean is added to the kriging prediction of the residual. Formally this approach
can be expressed as
µ s SPW € :WO Ðs
: aZà s! b œ xw! " $ à s! Ñ, [9.47]

where :WO Ðs $ à s! Ñ is the simple kriging predictor of the residual at location s! which uses the
variance-covariance matrix D s‡ formed from the semivariogram fitted to the residuals. Hence,
:WO Ðs$ à s! Ñ œ cw Ds‡"s $ asb, making use of the fact that EÒs
$ asbÓ œ 0. Comparing the naïve pre-
dictor [9.47] with the universal kriging predictor [9.46] it is clear that the naïve approach is
not equivalent to universal kriging. D is not known and the trend is not estimated by GLS.
The predictor [9.47] does have some nice properties, however. It is an unbiased predictor in
the sense that Ec µ : aZà s! bd œ ^ as! b and it remains a perfect interpolator. In fact all methods
that are perfect interpolators of the residuals are perfect interpolators of ^asb even if the trend
model is incorrectly specified. Furthermore, this approach does not require iterations. The
naïve predictor is not a best linear unbiased predictor, however, and the mean square predic-
tion error should not be calculated by the usual formulas for kriging variances (e.g., [9.45]).
The naïve approach can be generalized in the sense that one might use any suitable method
for trend removal to obtain residuals, krige those, and add the kriged residuals to the esti-
mated trend. This is the rationale behind median polish kriging recommended by Cressie
(1986) for random fields with drift to avoid the operational difficulties of universal kriging.
It must be noted that these difficulties do not arise in maximum likelihood (ML) or re-
stricted maximum likelihood (REML) estimation. In §9.2.4 ML and REML for estimating the
covariance parameters of a second-order stationary spatial process were discussed. Now con-
sider the spatial model
Zasb œ Xasb" € $ asb,

where $asb is a second-order stationary random field with mean 0 and variance-covariance
matrix Da)b. Under Gaussianity, K µ aXasb" ,Da)bb and estimates of the mean parameters "
and the spatial dependency parameters ) can be obtained simultaneously by maximizing the
likelihood or restricted likelihood of Zasb. In practice one must choose a parametric covari-
ance model for Da)b, which seems to open the same Pandora's box as in the cat and mouse
game of universal kriging. The operational difficulties are minor in the case of ML or REML
estimation, however. Initially, one should estimate " by ordinary least squares and calculate
the empirical semivariogram of the residuals. From a graph of the empirical semivariogram
possible parametric models for the semivariogram (covariogram) can be determined. In con-
trast to the least squares based methods discussed above one does not estimate ) based on the

© 2003 by CRC Press LLC


empirical semivariogram of the residuals. This is left to the likelihood procedure. The advan-
tage of ML or REML estimation is to provide estimates of the spatial autocorrelation structure
and the mean parameters simultaneously where " s is adjusted for the spatial correlation and s
)
is adjusted for the nonconstant mean. Furthermore, such models can be easily fit with the
mixed procedure of The SAS® System. We present applications in §9.8.4 and §9.8.5.

Local Kriging and the Kriging Neighborhood


The kriging predictors
s € cw D" ŠZasb  X"
:Y O aZà s! b œ xw as! b" s‹

s € cw D" aZasb  1.
:SO aZà s! b œ . sb
must be calculated for each location at which predictions are desired. Although only the
vector c of covariances between ^as! b and Zasb must be recalculated every time s! changes,
even for moderately sized spatial data sets the inversion (and storage) of the matrix D is a
formidable problem. A solution to this problem is to consider for prediction of ^as! b only ob-
served data points within a neighborhood of s! , called the kriging neighborhood. As s!
changes this is akin to sliding a window across the domain and to exclude all points outside
the window in calculating the kriging predictor. If 8as! b œ #& points are in the neighborhood
at s! , then only a a#& ‚ #&b matrix must be inverted. Using a kriging neighborhood rather
than all the data is sometimes referred to as local kriging. It has its advantages and disadvan-
tages.
Among the advantages of local kriging is not only computational efficiency; it might also
be reasonable to assume that the mean is at least locally stationary, even if the mean is glo-
bally nonstationary. This justification is akin to the reasoning behind using local linear regres-
sions in the nonparametric estimation of complicated trends (see §4.7). Ordinary kriging
performed locally is another approach to avoid the operational difficulties with universal krig-
ing performed globally. Local kriging essentially assigns kriging weight -as3 b œ ! to all
points s3 outside the kriging neighborhood. Since the best linear unbiased predictor is ob-
tained by allowing all data points to contribute to the prediction of ^as! b, local kriging pre-
dictors are no longer best. In addition, the user needs to decide on the size and shape of the
kriging neighborhood. This is no trivial task. The optimal kriging neighborhood depends in a
complex fashion on the parameters of the semivariogram, the large-scale trend, and the spatial
configuration of the sampling points.
At first glance it may seem reasonable to define the kriging neighborhood as a circle
around s! with radius equal to the range of the semivariogram. This is not a good solution, be-
cause although points further removed from s! than the range are not spatially autocorrelated
with ^as! b, they are autocorrelated with points that lie within the range from s! . Chilès and
Delfiner (1999, p. 205) refer to this as the relay effect. A practical solution in our opinion is
to select the radius of the kriging neighborhood as the lag distance up to which the empirical
semivariogram was modeled. One half of the maximum lag distance in the data is a frequent
recommendation. The shape of the kriging neighborhood also deserves consideration. Rules
that define neighborhoods as the 8‡ closest points will lead to elongated shapes if the sam-
pling intensity along a transect is higher than perpendicular to the transect. We prefer circular
kriging neighborhoods in general that can be suitably expanded based on some criteria about

© 2003 by CRC Press LLC


the minimum number of points in the neighborhood. The krige2d procedure in The SAS®
System provides flexibility in determining the kriging neighborhood. The statements
proc krige2d data=ThatsYourData outest=krige;
coordinates xcoord=x ycoord=y;
grid griddata=predlocs xcoord=x ycoord=y;
predict var=Z radius=10 minpoints=15 maxpoints=35;
model form=exponential scale=10 range=7;
run;

for example, use a circular kriging neighborhood with a "!-unit radius. If the number of
points in this radius is less than "& the radius is suitably increased to honor the minpoints=
option. If the neighborhood with radius "! contains more than $& observation the radius is
similarly decreased. If the neighborhood is defined as the nearest 8‡ observation, the predict
statement is replaced by (for 8‡ œ #!Ñ predict var=Z radius=10 numpoints=20;.

Positivity Constraints
In ordinary kriging, the only constraint placed on the kriging weights -as3 b is
8
"-as3 b œ ",
3œ"

which guarantees unbiasedness. This does not rule out that individual kriging weights may be
negative. For attributes that take on positive values only (yields, weights, probabilities, etc.) a
potential problem lurks here since the prediction of a negative quantity is not meaningful. But
just because some kriging weights are negative does not imply that the resulting predictor
8
"- a s 3 b ^ a s 3 b
3œ"

is negative and in many applications the predicted values will honor the positivity require-
ment. To exclude the possibility of negative predicted values additional constraints can be im-
posed on the kriging weights. For example, rather than minimizing
#
Ô 8 ×
E ž^ as! b  "-as3 b^ as3 bŸ
Õ 3œ" Ø

subject to !83œ" -as3 b œ ", one can minimize the mean square error subject to !83œ" -as3 b œ "
and -as3 b   ! for all s3 . Barnes and Johnson (1984) solve this minimization problem through
quadratic programming and find that a solution can always be obtained in the case of an un-
known but constant mean, thereby providing an extension of ordinary kriging. The positivity
and the sum-to-one constraint together also ensure that predicted values lie between the
smallest and largest value at observed locations. In our opinion this is actually a drawback.
Unless there is a compelling reason to the contrary, one should allow the predictions to ex-
tend outside the range of observed values. A case where it is meaningful to restrict the range
of the predicted values is indicator kriging (§A9.9.6) where the attribute being predicted is a
binary a!ß "b variable. Since the mean of a binary random variable is a probability, predictions
outside of the a!ß "b interval are difficult to justify. Cressie (1993, p. 143) calls the extra con-
straint of positive kriging weights “heavy-handed.”

© 2003 by CRC Press LLC


Kriging Variance Overstates Precision
The formulas for the kriging predictors and the corresponding kriging variances ([9.45]) con-
tain the variance-covariance matrix D which is usually unknown since the semivariogram
#ahb is unknown. An estimate D s of D is substituted in the relevant expressions. The uncer-
tainty associated with the estimation of the semivariances or covariances should be accounted
for in the determination of the mean square prediction error. This is typically not done. The
s is the correct variance-covariance matrix of Zasb. As
kriging predictions are obtained as if D
a consequence, the kriging variance obtained by substituting D s for D is an underestimate of
the mean square prediction error.

9.4.5 Extensions to Multiple Attributes


Box 9.9 Cokriging and Spatial Regression

• If a spatial data set consists of more than one attribute and stochastic rela-
tionships exist among them, these relationships can be exploited to improve
predictive ability.

• Commonly one attribute, ^" asb, say, is designated the primary attribute
and ^# asbß âß ^5 asb are termed the secondary attributes.

• Cokriging is a multivariate spatial prediction method that relies on the spa-


tial autocorrelation of the primary and secondary attributes as well as the
cross-covariances among the primary and the secondary attributes.

• Spatial regression is a multiple spatial prediction method where the mean of


the primary attribute is modeled as a function of secondary attributes.

The spatial prediction methods discussed thus far predict a single attribute ^asb at unobserved
locations s! . In most applications, data collection is not restricted to a single attribute. Other
variables are collected at the same or different spatial locations or the same attribute is
observed at different time points. Consider the case of two spatially varying attributes ^" and
^# for the time being. To be general it is not required that ^" and ^# are observed at the same
locations although this will often be the case. The vectors of observations on ^" and ^# are
denoted
Z" as" b œ c^" as"" bß âß ^" as"8" bd
Z# as# b œ c^# as#" bß âß ^" as#8# bd.

If s"4 œ s#4 , then ^" and ^# are said to be colocated, otherwise the attributes are termed non-
colocated. Figure 9.23 shows the sampling locations at which soil samples were obtained in a
chisel-plowed field and the relationship between soil carbon and total soil nitrogen at the
sampled locations. Figure 9.24 shows the relationship between total organic carbon percen-
tage and sand percentage in sediment samples from the Chesapeake Bay collected through the
Environmental Monitoring and Assessment Program (EMAP) of the U.S.-EPA. In both cases

© 2003 by CRC Press LLC


two colocated attributes aGß R b aGß W+8. %b have been observed. The relationships between
G and R is very strong in the field sample and reasonably strong in the aquatic sediment
samples.

300 2.0

250

200 1.5

Carbon
150
y

1.0
100

50

0.5
0

200 600 1000 0.05 0.10 0.15


x N itrogen

Figure 9.23. Sample locations in a field where carbon and nitrogen were measured in soil
samples (left panel). Relationship between soil carbon and total soil nitrogen (right panel).
Data kindly provided by Dr. Thomas G. Mueller, Department of Agronomy, University of
Kentucky. Used with permission.

4
Total Organic Carbon %

10 30 50 70 90
Sand %

Figure 9.24. Total organic carbon percentage as a function of sand percentage collected at %(
base stations in Chesapeake Bay in 1993 through the US Environmental Protection Agency's
Environmental Monitoring and Assessment Program (EMAP).

The attributes are usually not symmetric in that one attribute is designated the primary
variable of interest and the other attributes are secondary or auxiliary variables. Without loss
of generality we designate ^" asb as the primary attribute and ^# asbß âß ^5 asb as the second-

© 2003 by CRC Press LLC


ary attributes. For example, ^" asb may be the in-situ measurements of the plant canopy tem-
perature on a grassland site and ^# asb is the temperature obtained from remotely sensed ther-
mal infrared radiation (see Harris and Johnson, 1996 for an application). The interest lies in
predicting the in-situ canopy temperature ^" asb while utilizing the information collected on
the secondary attribute and its relationship with ^" asb. If primary and secondary attributes are
spatially structured and/or the secondary variables are related to the primary variable, predic-
tive ability should be enhanced by utilizing the secondary information.
Focusing on one primary and one secondary attribute in what follows two basic
approaches can be distinguished (extensions to more than one secondary attribute are
straightforward).
• ^" asb and ^# asb are stationary random fields with covariance functions G" ahb and
G# ahb, respectively. They also covary spatially giving rise to a cross-attribute covar-
iance function Covc^" asbß ^# as € hbd œ G"# ahb. By stationarity it follows that
Ec^" asbd œ ." and Ec^# asbd œ .# and the dependence of ^" asb on ^# asb is stochastic
in nature, captured by G"# ahb. This approach leads to the cokriging methods.
• $" asb is a stationary random field and ^" asb is related to ^# asb and $" asb through a
spatial regression model ^" asb œ 0 aD# asbß " b € $" asb, where the semivariogram of
$" asb is that of the detrended process ^" asb  0 aD# asbß " b. No assumptions about the
stochastic properties of ^# asb are made, the observed values D# as" bß âß D# as8 b are con-
sidered fixed in the analysis. The relationship between Ec^" asbd and ^# asb is determin-
istic, captured by the mean model 0 aD# asbß " b. This is a special case of a spatial
regression model (§9.5).

Ordinary Cokriging
The goal of cokriging is to find a best linear unbiased predictor of ^" as! b, the primary attri-
bute at a new location s! , based on Z" asb and Z# asb. Extending the notation from ordinary
kriging the predictor can be written as
8" 8#
:" aZ" ß Z# à s! b œ "-"3 ^" as3 b € "-#3 ^# as4 b œ -"w Z" asb € -#w Z# asb. [9.48]
3œ" 4œ"

It is not required that ^" asb and ^# asb are colocated. Certain assumptions must be made, how-
ever. It is assumed that the means of ^" asb and ^# asb are constant across the domain H and
that ^" asb has covariogram G" ahb and ^# asb has covariogram G# ahb. Furthermore, there is a
cross-covariance function that expresses the spatial dependency between ^" asb and
^# as € hb,
G"# ahb œ Covc^" asbß ^# as € hbd. [9.49]

The unbiasedness requirement implies that Ec:" aZ" ß Z# à s! bd œ Ec^" as! bd œ ." which implies
in turn that 1w -" œ " and 1w -# œ !. Minimizing the mean square prediction error
Ee^" as! b  :" aZ" ß Z# à s! bf# ‘

subject to these constraints gives rise to the system of cokriging equations

© 2003 by CRC Press LLC


Ô D"" D"# 1 0 ×Ô -" × Ô c"! ×
Ö D#" D## 0 1 ÙÖ -# Ù Ö c#! Ù
Ö w ÙÖ ÙœÖ Ù, [9.50]
1 0 0 0 7" "
Õ 0 1w 0 0 ØÕ 7# Ø Õ ! Ø

where D"" œ VarcZ" asbd, D## œ VarcZ# asbd, D"# œ CovcZ" asbß Z# asbd, 7" and 7# are La-
grange multipliers and c"! œ CovcZ" asbß ^" as! bd, c#! œ CovcZ# asb,^" as! bd. These equations
are solved for -" , -# , 7" and 7# and the minimized mean square prediction error, the co-
kriging variance, is calculated as
#
5GO as! b œ Varc^" as! bd  -"w c"!  -#w c#! € 7" .

Cokriging utilizes two types of correlations. Spatial autocorrelation due to spatial proximity
aD"" and D## b and correlation among the attributes aD"# b. To get a better understanding of
the cokriging system of equations we first consider the special case where the secondary
variable is not correlated with the primary variable. Then D"# œ 0 and c#! œ 0 and the co-
kriging equations reduce to
ó D"" -"  17" œ c"! 1 w 7" œ "
ô D## -#  17# œ c#! ´ 0 1 w 7# œ ! .

Equations ó are the ordinary kriging equations (see §A9.9.4) and -" will be identical to the
ordinary kriging weights. From ô one obtains -# œ D" w " w "
## ac#!  1 1 D## c#! Îa1 D## 1bb œ 0.
The cokriging predictor reduces to the ordinary kriging predictor,
:" aZ" ß Z# à s! b œ :SO aZ" à s! b œ -"SO Z" asb,
# #
and 5GO as! b reduces to 5SO as! b. There is no benefit in using a secondary attribute unless it is
correlated with the primary attribute. There is also no harm in doing so.
Now consider the special case where ^3 asb is the observation of ^ asb at time >3 so that
the variance-covariance matrix D"" describes spatial dependencies at time >" and D## spatial
dependencies at time ># . Then D"# contains the covariances of the single attribute across
space and time and the kriging system can be used to produce maps of the attribute at future
points in time.
Extensions of ordinary cokriging to universal cokriging are relatively straightforward in
principle. Chilès and Delfiner (1999, Ch. 5.4) consider several cases. If the mean functions
EcZ" asbd œ X" "" and EcZ# asbd œ X# "# are unrelated, that is, each variable has a mean func-
tion on its own and the coefficients are not related, the cokriging system [9.50] is extended to

Ô D"" D"# X" 0 ×Ô -" × Ô c"! ×


Ö D#" D## 0 X# ÙÖ -# Ù Ö c#! Ù
Ö w ÙÖ ÙœÖ Ù. [9.51]
X" 0 0 0 7" x"!
Õ 0 Xw# 0 0 ØÕ 7# Ø Õ 0 Ø

Universal cokriging has not received much application since it is hard to imagine a situation
in which the mean functions of the attributes are unrelated with D"# not being a zero matrix.
Chilès and Delfiner (1999, p. 301) argue that cokriging is typically performed as ordinary
cokriging for this reason.

© 2003 by CRC Press LLC


The two sets of constraints that the kriging weights for the primary variable sum to one
and those for the secondary attributes sum to zero were made to ensure unbiasedness.
Goovaerts (1998) points out that most of the secondary weights in -# tend to be small and be-
cause of the sum-to-zero constraint many of them will be negative which increases the risk to
obtain negative predicted values for the primary attribute. To enhance the contribution of the
secondary data and to limit the possibility of negative predictions Isaaks and Srivastava
(1989, p. 416) proposed replacing the two constraints with a single sum-to-one constraint for
all weights, primary and secondary. This requires that the secondary attributes be rescaled to
have the same mean as the primary variable. An additional advantage of a single constraint
comes to bear in what has been termed collocated cokriging.
The cokriging predictor [9.48] does not require that ^# asb is observed at the prediction
location s! . If ^" as! b were observed there is of course no point in predicting it unless there is
measurement error and prediction of the signal W" as! b matters. Collocated cokriging (Xu et al.
1992) is a simplification of cokriging where the secondary attribute is available at the predic-
tion location, while the primary attribute is not. The collocated cokriging system thus uses
only 8" € " instead of 8" € 8# observations and if separate constraints for the primary and
secondary attribute are entertained must be performed as simple cokriging, since -# would be
zero in an ordinary cokriging algorithm. Replacing the two constraints with a single sum-to-
one constraint avoids this problem and permits collocated cokriging to be performed as an or-
dinary cokriging method.
Finally, using only a single constraint for the combined weights of all attributes has ad-
vantages in the case of colocated data. Part of the problem in implementing cokriging is
having to estimate the cross-covariances D"# . A model commonly employed is the propor-
tional covariance model where D"# º D"" . If ^" asb and ^# asb are colocated the secondary
weights will be zero under the proportional covariance model when the constraints 1w -" œ ",
1w -# œ ! are used. A single constraint avoids this problem and allows the secondary variable
to contribute to the prediction of ^" as! b.

Multiple Spatial Regression


The second approach to utilizing information on secondary attributes is to consider the mean
of ^" asb to be a (linear) function of the secondary attributes which are considered fixed,
Ec^" asbd œ "! € "" ^# asb € "# ^$ asb € ⠀ "5" ^5 asb œ xasbw ". [9.52]

The error term of this model is a (second-order) stationary spatial process $" asb with mean !
and covariance function G" ahb (semivariogram #" ahb). The complete model can then be
written as
^" asb œ xw asb" € $" asb
Ec$" asbd œ ! [9.53]
Covc$" asbß $" as € hbd œ G" ahb.

This model resembles the universal kriging model but while there xasb is a polynomial
response surface in the spatial coordinates, here xasb is a function of secondary attributes ob-
served at locations s. It is for this reason that the multiple spatial regression model is particu-
larly meaningful in our opinion. In many applications colorful maps of an attribute are

© 2003 by CRC Press LLC


attractive but not the most important result of the analysis. That crop yield varies across a
field is one thing. The ability to associate this variation with variables that capture soil
properties, agricultural management strategies, etc., is more meaningful since it allows the
user to interpolate and extrapolate (within reasonable limits) crop yield to other situations. It
is a much overlooked fact that geostatistical kriging methods produce predicted surfaces for
the primary attribute at hand that are difficult to transfer to other environments and
experimental situations. If spatial predictions of soil carbon a^" asbb rely on the stochastic
spatial variability of soil carbon and its relationship to soil nitrogen a^# asbb, for example, a
map of soil carbon can be produced from samples of soil nitrogen collected on a different
field provided the relationship between ^" asb and ^# asb can be transferred to the new location
and the stochastic properties (the covariance function G" ahb) are similar in the two instances.
One advantage of the spatial regression model [9.53] over the cokriging system is that it
only requires the spatial covariance function of the primary attribute. The covariance function
of the secondary attribute and the cross-covariances are not required. Spatial predictions of
^" as! b can then be performed as modified universal kriging predictions where the polynomial
response surface in the spatial coordinates is replaced by the mean function [9.52]. A second
advantage is that simultaneous estimation of " and the parameters determining the covario-
gram G" ahb is possible based on maximum likelihood or restricted maximum likelihood
techniques. If one is willing to make the assumption that $" asb is a Gaussian random field the
mixed procedure in The SAS® System can be used for this purpose. Imagine that the data in
Figure 9.23 are supplemented by similar samples of soil carbon and soil nitrogen on no-till
plots (which they actually were, see §9.8.4 for the application). The interest of the researcher
shifts from producing maps of soil carbon under no-till and chisel-plow strategies to tests of
the hypothesis that soil carbon mean or variability is identical under the two management
regimes. This is easily accomplished with spatial regression models by incorporating into the
mean function [9.52] not only secondary attributes such as soil nitrogen levels, but classifica-
tion variables that identify treatments.
The multiple spatial regression approach has some drawbacks as compared to cokriging,
however. Only colocated samples of ^" asb and all secondary attributes can be used in esti-
mation. If only one of the secondary attributes has not been observed at a particular location
the information collected on any attribute at that location will be lost. In the chisel-plow vs.
no-till example this does not imply that all locations have to be managed under chisel-plowed
and no-till regimes. It implies that for any location where the primary attribute (soil carbon)
has been observed it can be identified whether it belonged to the chisel-plow or no-till
treatment. If, in addition, soil carbon is modeled as a function of tillage and soil nitrogen,
only those records will be retained in the analysis where soil carbon and soil nitrogen have
been measured. If the secondary attribute stems from a much more dense sampling scheme,
e.g., ^# asb has been remotely sensed and ^" asb is a more sparse in-situ sampling, there is no
problem since each primary attribute can be matched with a secondary attribute. If the
secondary attribute has been sampled sparsely compared to the primary attribute, e.g., ^" asb
stems from yield monitoring and ^# asb from grid soil sampling, the applicability of multiple
spatial regression is limited. The attribute sampled most coarsely determines the spatial
resolution of the analysis.
Another "drawback" relates to the prediction of the primary attribute at a new location s! .
It is required that the secondary attributes have been observed at the prediction location s! . If
soil carbon is modeled as a function of tillage regime and soil nitrogen, prediction at a new

© 2003 by CRC Press LLC


location cannot be performed unless the tillage regime and soil nitrogen at s! are known. This
is a feature of all regression methods and should not be construed as a shortcoming. If the
modeler determines that ] depends on \ and the value of \ is unknown the model cannot be
used to produce a prediction of ] .

9.5 Spatial Regression and Classification Models

9.5.1 Random Field Linear Models


A spatial regression or classification model is a model for geostatistical data where interest
lies primarily in statistical inference about the mean function Ec^ asbd œ .asb. The most
important application of these models in the crop and soil sciences is the analysis of field
experiments where the experimental units exhibit spatial autocorrelation. The agricultural
variety trial is probably the most important type of experiment to which the models in this
section can be applied, but any situation in which Ec^ asbd is modeled as a function of other
variables in addition to a spatially autocorrelated error process falls under this heading.
Variety trials are particularly important here because of their respective size. Randomization
of treatments to experimental units neutralizes the effects of spatial correlation among experi-
mental units and provides the framework for statistical inference in which cause-and-effect
relationships can be examined. These trials are often conducted as randomized block designs
and, because of the large number of varieties involved, the blocks can be substantial in size.
Combining adjacent experimental units into blocks in agricultural variety trials can be at
variance with an assumption of homogeneity within blocks. Stroup, Baenziger and Mulitze
(1994) notice that if more than eight to twelve experimental units are grouped, spatial trends
will be removed only incompletely. Although randomization continues to neutralize these
effects, it does not eliminate them as a source of experimental error.
Figure 9.25 shows the layout of a randomized complete block design conducted as a field
experiment in Alliance, Nebraska. The experiment consisted of &' wheat cultivars arranged in
four blocks and is discussed in Stroup et al. (1994) and Littell et al. (1996).
Analysis of the plot yields in this RCBD reveals a : -value for the hypothesis of no
varietal differences of : œ !Þ(""* along with a coefficient of variation of GZ œ #(Þ&)%. A
:-value that large should give the experimenter pause. That there are no yield differences
among &' varieties is very unlikely. The large coefficient of variation conveys the consid-
erable magnitude of the experimental error variance. Blocking as shown in Figure 9.25 did
not eliminate the spatial dependencies among experimental units and left any spatial trends to
randomization which increased the experimental error. The large :-value is not evidence of
an absence of varietal differences, but of an experimental design lacking power to detect
these differences.
Instead of the classical RCBD analysis one can adopt a modeling philosophy where the
variability in the data from the experiment is decomposed into large-scale trends and smooth-
scale spatial variation. Contributing to the large-scale trends are treatment effects, determinis-
tic effects of spatial location, and other explanatory variables. The smooth-scale variation
consists of a spatial random field that captures, for example, smooth fertility trends.

© 2003 by CRC Press LLC


50
19 21 16 31 45 43 38 8 54 5 29 3 14 55 12 53 24 20 51 13 28 22

52 30 15 2 26 35 36 6 4 47 32 48 18 10 50 39 17 56 33 46 27 23
40
30 7 43 48 9 14 49 35 12 40 41 49 44 9 25 42 11 1 7 34 37

4 52 46 13 39 26 20 53 5 10 41 16 27 15 22 6 25 32 55 47 33 2

30 42 8 18 3 19 11 31 36 28 17 56 40 54 24 51 29 37 45 38 1 44 21
Latitude

49 8 41 46 7 11 51 18 54 15 20 23 25 21 2 52 53 40 34 23 50

36 47 26 30 35 31 10 27 33 13 19 38 55 48 29 34 12 6 44 37 22 24
20
52 53 54 55 56 9 16 4 32 42 45 14 17 39 43 50 56 5 1 28 3

30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
10
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

1 2 3 4 5 6 7

0
0 5 10 15 20 25
Longitude

Figure 9.25. Layout of wheat variety trial at Alliance, Nebraska. Lines show block
boundaries, numbers identify the placement of varieties within blocks. There are four blocks
and &' varieties. Drawn from data in Littell et al. (1996).

In the notation of §9.4 we are concerned with the spatial mean model
^ asb œ .asb € $ asb,

where $asb is assumed to be a second-order stationary spatial process with semivariogram


#ahb and covariogram G ahb. The mean model .asb is assumed to be linear in the large-scale
effects, so that we can write
^ asb œ Xasb" € $ asb. [9.54]

We maintain the dependency of the design/regressor matrix Xasb on the spatial location since
Xasb may contain, apart from design (e.g., block) and treatment effects, other variables that
depend on the spatial location of the experimental units, or the coordinates of observations
themselves although that is not necessarily so. Zimmerman and Harville (1991) refer to [9.54]
as a random field linear model. Since the spatial autocorrelation structure of $asb is modeled
through a semivariogram or covariogram we take a direct approach to modeling spatial
dependence rather than an autoregressive approach (in the vernacular of §9.3). This can be
rectified with the earlier observation that data from field experiments are typically lattice data
where autoregressive methods are more appropriate by considering each observation as
concentrated at the centroid of the experimental unit (see Ripley 1981, p. 94, for a contrasting
view that utilizes block averages instead of point observations).
The model for the semivariogram/covariogram is critically important for the quality of
spatial predictions in kriging methods. In spatial random field models, where the mean func-
tion is of primary importance, it turns out that it is important to do a reasonable job at mod-
eling the second order structure of $asb, but as Zimmerman and Harville (1991) note, treat-
ment comparisons are relatively insensitive to the choice of covariance functions (provided
the set of functions considered is a reasonable one and that the mean function is properly

© 2003 by CRC Press LLC


specified). Besag and Kempton (1986) found that inclusion of a nugget effect also appears to
be unnecessary in many field-plot experiments.
Before proceeding further with random field linear models we need to remind the reader
of the adage that one modeler's random effect is another modeler's fixed effect. Statistical
models that incorporate spatial trends in the analysis of field experiments have a long history.
In contrast to the random field models, previous attempts of incorporating the spatial structure
focused on the mean function .asb rather than the stochastic component of the model. The
term trend analysis has been used in the literature to describe methods that incorporate covar-
iate terms that are functions of the spatial coordinates. In a standard RCBD analysis where ]34
denotes the observation on treatment 3 in block 4, the statistical model for the analysis of vari-
ance is
]34 œ . € 34 € 73 € /34 ,

where the experimental errors /34 are uncorrelated random variables with mean ! and variance
5 # . A trend analysis changes this model to
]34 œ . € 73 € )56 € /34 , [9.55]

where )56 is a polynomial in the row and column indices of the experimental units (Brownie,
Bowman, and Burton 1993). If <5 is the 5 th row and -6 the 6th column of the field layout, then
one may choose )56 œ "" <5 € "# -6 € "$ <5# € "% -5# € "& <5 -6 , for example, a second-order re-
sponse surface in the row and column indices. The difference to a random field linear model
is that the deterministic term )56 is assumed to account for the spatial dependency between
experimental units. It is a fixed effect. It does, however, appeal to the notion of a smooth-
scale variation in the sense that the spatial trends move smoothly across block boundaries.
The block effects have disappeared from model [9.55]. Applications of these trend analysis
models can be found in Federer and Schlottfeldt (1954), Kirk, Haynes, and Monroe (1980),
and Bowman (1990). Because it is assumed that the error terms /34 remain uncorrelated they
are not spatial random field models in our sense and will not be discussed further. For a com-
parison of trend and random field analyses see Brownie et al. (1993).
A second type of model that maintains independence of the errors are the nearest-neigh-
bor models which are based on differencing observations with each other or by taking dif-
ferences between plot yields and cultivar averages. The Papadakis nearest-neighbor analysis
(Papadakis 1937), for example, calculates residuals between plot yields and arithmetic treat-
ment averages in the East-West and North-South direction and uses these residuals as co-
variances in the mean model (the )56 part of the trend analysis model). The Schwarzbach
analysis relies on adjusted cultivar means which are arithmetic means corrected for average
responses in neighboring plots (Schwarzbach 1984).
In practical applications it may be difficult to choose between these various approaches
to model spatial dependencies and to discriminate between different models. For example,
changing the fixed effects trend by including or eliminating terms in a trend analysis will
change the autocorrelation of the model residuals. Brownie and Gumpertz (1997) conclude
that it is necessary to account for major spatial trends as fixed effects in the model but also
that random field analyses are surprisingly robust to moderate misspecification of the fixed
trend and retain a high degree of validity of tests and estimates of precision. The reason, in
our opinion, is that a model which simultaneously models large- and small-scale stochastic
trends is able  within limits  to capture omitted trends in the mean model through the spa-

© 2003 by CRC Press LLC


tial dependency structure in the error process. Zimmerman and Harville (1991) refer to this
effect as the covariance function “soaking up” spatial heterogeneity that would otherwise be
fitted through fixed effects in the mean function. A trend analysis model or nearest-neighbor
model that assumes that the mean function is correctly specified and the errors are uncorrelat-
ed will be invalid if the mean function is not modeled properly. There is nothing in the error
structure that can “soak up” the ill-specification of the fixed effects.

9.5.2 Some Philosophical Considerations


Modeling data from a field experiment with random field methods seems like a win-win
situation. The modeler can add or delete terms to the fixed effects part of the model that cap-
ture large-scale trends and let the covariance function of the error process $asb pick up any
smooth-scale spatial variation of the omitted effects. As always, there is no free lunch and the
analyst must be aware of the differences between modeling the data from a designed experi-
ment vs. relying on randomization theory. The classical analysis of an experimental design
stems from its underlying linear model which in turn is generated by the particular error-
control, treatment, and observational designs. The ability to perform cause-and-effect inferen-
ces rests on these design components. Randomization ensures that the unaccounted effects —
such as systematic spatial trends among the experimental units — are balanced out. This
implies that expectations are reckoned over the randomization distribution of the design. In
the Alliance, Nebraska wheat yield variety trial this distribution is formed by all possible
arrangements of the &' treatments to the &' ‚ % œ ##% experimental units. The observed out-
comes are considered fixed in the randomization approach. Assume, for a moment, that the
three rightmost columns of experimental units in Figure 9.25 are systematically different from
the other units. Should we take this into account in specifying the statistical model for the
analysis or appeal to the fact that under randomization such effects are washed (balanced)
out? There are three schools of thought:
1. Appeal to the randomization distribution because it allows causal inference. In effect,
stick with the randomized complete block analysis. If it does not work out because
blocking was carried out incorrectly, learn from the mistake and fix the problem the
next time a variety trial with fifty-six treatments is conducted.
2. Do not appeal to the randomization distribution and model the variability and effects
for this particular set of data. This is a modeling exercise determining which effects
are modeled as part of the mean structure Xasb and which effects are “soaked up” by
the error structure.
3. Appeal to randomization but also to the fact that stochastic elements beyond the ran-
domization of treatments to units are at work. In developing the analysis appeal to a
model where the errors are no longer independent and take expectation with respect to
the joint distribution of randomization and the spatial process.

The three approaches differ in what is considered the correct model for analysis and how
it is used. In (1) the correct model stems from the error-control, treatment, and observational
design components. Treatment comparisons will always be unbiased under this approach, but
can be inefficient if the design was not chosen carefully (as is the case in the Alliance-
Nebraska case). In (2) the analyst is charged to develop a suitable model. Statistical inference

© 2003 by CRC Press LLC


proceeds assuming that the selected model is correct. If a wrong model is used, treatment
comparisons will be biased. Since there is never unshakable evidence that the final model is
correct one can no longer make causal statements about the effect of treatments on the out-
come. Statistical inference is associative rather than causal. The third approach is a mixture
technique. It recognizes dependencies among the experimental units and the fact that treat-
ments are randomly assigned to the units. Expectations of mean squares are calculated first
over the randomization distribution conditional on the spatial process and then over the
spatial process (see, for example, Grondona and Cressie 1991).
In spatial analyses the observed data are considered a realization of a random field and
modeling the mean and dispersion structure proceeds in an observational manner. Whether a
spatial model will provide a more efficient analysis will depend to what extent large-scale and
small-scale trends are conducive to modeling. Besag and Kempton (1986) conclude that many
agronomic experiments are not carried out in a sophisticated manner. The reasons may be
convenience, unfamiliarity of the experimenter with more complex design choices, or tradi-
tion. We agree that it is hardly reasonable to conduct a field experiment with &' treatments in
a randomized complete block design. An incomplete block design or a resolvable, cyclic de-
sign may have been more appropriate. Nevertheless, many experiments are still conducted in
this fashion. Bartlett (1938, 1978a) views analyses that emphasize the spatial context over the
design context as ancillary devices to salvage efficiency in experiments that could have been
designed more appropriately. Spatial random field models are more than salvage tools. They
are statistical models that describe the variation in data, whether the data stem from a de-
signed experiment or an observational study. By switching from a design-based analysis to
one based on modeling, the ability to draw causal inferences is sacrificed, however.

9.5.3 Parameter Estimation


In matrix-vector notation model [9.54] can be written as
Zasb œ Xasb" € $ asbß $ asb µ a0ß Da)bb [9.56]

and the parameters of the model to be estimated are 9 œ c" ß )dw . ) relates to the spatial
dependency structure and " to the large-scale trend. As models for Da)b we usually consider
covariograms that are derived from the isotropic semivariogram models in §9.2.2, keeping the
number of parameters in ) small. Because we work with covariances, it is assumed that the
process is second-order stationary so that its covariogram is well-defined. Two general
approaches to parameter estimation can be distinguished. Likelihood and likelihood-type
methods which estimate ) and " simultaneously and least squares methods that estimate "
given an externally obtained estimate of the spatial dependency.

Least Squares Methods


If Da)b were known parameter estimates for " can be obtained by generalized least squares
(GLS):
s KPW œ ˆXw Da)b" X‰" Xw Da)b" Zasb.
"

Since Da)b is usually unknown we are faced with a similar quandary as in universal kriging.

© 2003 by CRC Press LLC


Estimating ) through semivariogram analysis requires detrending of the data, that is, an esti-
mate of " . Efficient estimation of " requires knowledge of ). The usual approach is to
s SPW .
1. Assume Da)b œ 5 # I and fit the model by ordinary least squares to obtain "
s SPW .
2. Obtain the OLS residuals seasb œ Zasb  Xasb"
3. Fit a parametric, second-order stationary semivariogram based on the seasb to obtain s
).
4. Use the estimates from the semivariogram fit to construct the DÐs
)Ñ matrix.

These steps can (and should) be iterated, replacing SPW residuals in step 2. with KPW
residuals after the first iteration. The final estimates of the mean parameters are estimated
generalized least square estimates
"
s IKPW œ ŠXw DÐs
" )с" X‹ Xw DÐs
)с" Zasb. [9.57]

The same issues as in §9.4.4 must be raised here. The residuals lead to a biased estimate
of the semivariogram of $asb and " s SPW is an inefficient estimator of the large-scale trend
parameters. Since the emphasis in spatial random field linear models is often not on predict-
ing but on estimation and hypothesis testing about " these issues are not quite as critical as in
the case of universal kriging. If the results of a random field linear model analysis are used to
predict ^as! b as a function of covariates and the spatial autocorrelation structure, the issues
regain importance.

Likelihood Methods
Likelihood methods circumvent these problems because the mean and covariance parameters
are estimated simultaneously. On the other hand they require distributional assumptions about
^asb or $ asb. If $ asb is a Gaussian random field, then twice the negative log-likelihood of
Zasb is

°a"ß )à zasbb œ 8lne#1f € lnlDa)bl € azasb  Xasb" bw Da)b" azasb  Xasb" b.

and the maximum likelihood estimates " sQ , s


)Q minimize this expression. This process is
generally iterative and can be simplified by profiling the likelihood. This numerically effi-
cient method can be applied if some parameters have a closed-form solution given the others.
First consider ) fixed and known. Minimizing °a"ß )à zasbb is then equivalent to minimizing
az  Xasb" bw Da)b" az  Xasb" b. Since this is a generalized residual sum of squares, the
maximum likelihood estimate of " (given )) is
s KPW œ ˆXw Da)b" X‰" Xw Da)b" Zasb.
"

The profiled (negative) log likelihood is obtained by substituting this expression back into
°a"ß )à zasbb which is then only a function of ) and is minimized with respect to ) . The
resulting estimate s
)Q is the maximum likelihood estimate of ) and the MLE of " is
"
s Q œ ŠXw DÐs
" )Q с" X‹ Xw DÐs
)Q с" Zasb. [9.58]

© 2003 by CRC Press LLC


The maximum likelihood a[9.58]b and estimated generalized least squares estimates
[9.57] are very similar. They differ only in the covariance parameter estimate that is being
substituted. To reduce the bias in maximum likelihood estimates of the covariance parameters
it is again recommended to perform restricted maximum likelihood estimation. The REML
estimates of the large-scale trend parameters are obtained as
"
s V œ ŠXw DÐs
" )V с" X‹ Xw DÐs
)V с" Zasb. [9.59]

Software Implementation
The three methods, GLS, ML, and REML, lead to very similar formulas for the " estimates.
The mixed procedure in The SAS® System can be used to obtain any one of the three. The
spatial covariance structure of $asb is specified through the repeated statement of the
procedure. In contrast to clustered data models in §7, all data points are potentially auto-
correlated which calls for the subject=intercept option of the repeated statement.
Assume that an analysis of OLS residuals leads to an exponential semivariogram with
practical range %Þ&, partial sill "!Þ&, and nugget #Þ!. The spatial coordinates of the data points
are stored in variables xloc and yloc of the SAS data set. The mean model consists of treat-
ment effects and a linear response surface in the coordinates. The following statements obtain
the EGLS estimates [9.57], preventing proc mixed from iteratively updating the covariance
parameters (noiter option of parms statement). The noprofile option prevents the profiling
of an extra scale parameter from Da)b. The Table of Covariance Parameter Estimates will
contain three rows entitled Variance, SP(EXP), and Residual. These correspond to the partial
sill, the range, and the nugget effect, respectively. Notice that the parameterization of the
exponential covariogram in proc mixed considers the range parameter to be one third of the
practical range.

/* ----------------------------------------------------- */
/* Fit the model by EGLS for fixed covariogram estimates */
/* ----------------------------------------------------- */
proc mixed data=RFLMExample noprofile ;
class treatment;
model Z = treatment xloc yloc xloc*yloc / s;
parms /* sill */ ( 10.5 )
/* range */ ( 1.5 )
/* nugget */ ( 2.0 ) / noiter;
/* The local option of the repeated statement adds the */
/* nugget effect */
repeated /subject=intercept local type=sp(exp)(xloc yloc);
run; quit;

Restricted maximum likelihood estimates are obtained in proc mixed with the statements
proc mixed data=RFLMExample noprofile ;
class treatment;
model Z = treatment xloc yloc xloc*yloc / s;
parms /* sill */ ( 6 to 12 by 2 )
/* range */ ( 0.5 to 3 by 1.5 )
/* nugget */ ( 1 to 4 by 1.0 );
repeated /subject=intercept local type=sp(exp)(xloc yloc);
run; quit;

© 2003 by CRC Press LLC


The noiter option was removed from the parms statement which prompts the procedure
to iteratively update the covariance parameter estimate ). For each element of ) a range of
starting values is given. This can considerably speed up estimation, which can require formi-
dable resources for large data sets. If the grid of starting values is too fine this is somewhat
counterproductive as the procedure then has to evaluate many combinations of possible start-
ing values before settling on the best set. The default estimation procedure for covariance
parameter estimation is restricted maximum likelihood and the code example above yields " sV
as in [9.59]. To obtain maximum likelihood estimates add the method=ml option to the proc
mixed statement.

9.6 Autoregressive Models for Lattice Data


Box 9.10 Lattice Models

• Models for spatial lattice data are close relatives of time series models.

• A lattice model commences with the user's definition of spatial connectivity


among sites. This choice is then combined with an appropriate model for
the marginal or conditional distribution of the data that is consistent with
the neighborhood structure.

• Depending on whether the joint or conditional distribution of ^ asb is being


modeled, SSAR and CSAR models for lattice data are distinguished.

9.6.1 The Neighborhood Structure


Lattice data are spatial data where the index set H is a fixed, discrete subset of ‘# of count-
able points and ^asb is a random variable at location s ­ H. Examples of lattice data are ob-
servations made by census tract, county, or city blocks, data from field trials and remotely
sensed images. Keeping with the literature on lattice data we call the locations s ­ H the sites
of the lattice. It is common to enumerate the countable set of sites in a lattice, for example,
counties or census tracts can be numbered from " to 8. Since the numbering in itself does not
convey any spatial information it is necessary to define a location feature of each site such as
the county center or the seat of the county government. On rectangular lattices (field experi-
ments) the center of the unit is often used or experimental units can be identified by row and
column number.
Modeling the spatial dependence among observations via the semivariogram or covario-
gram requires a smooth-scale spatial structure and a continuous spatial process. With lattice
data other means of capturing the spatial dependence are needed. The notion of stationarity is
of somewhat questionable value for processes operating on irregularly shaped area units or
partitions (census tracts, counties, landscapes, regions, states, etc.). Even if there exists an
underlying stationary continuous-space process, variances and covariances will not be the
same for all areas if the observations arise from different area integrations. Stationary co-

© 2003 by CRC Press LLC


variance or semivariogram functions are then not useful to describe the stochastic interarea
relationships. For lattice data a different system is needed. This starts with a definition of
what is considered the neighborhood of site s3 . By choosing the neighborhood structure the
modeler determines which sites are spatially connected and the degree of their connectedness.
In a regular lattice with nearest neighbor dependence, for example, a site s3 can be declared as
being connected only to its immediate neighbors. Cliff and Ord (1981) and Upton and Fingle-
ton (1985) distinguish the rook, bishop, and queen definition of spatial contiguity drawing on
the respective moves on the chess board (Figure 9.26). The rook definition is sometimes
identified as the nearest-neighbor definition.
a) b) c)
5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

Figure 9.26. Definitions of spatial contiguity (neighborhood, connectedness) on a regular


lattice. (a) rook's definition: edges abut; (b) bishop's definition: touching corners; (c) queen's
definition: touching corners and edges.

On an irregular lattice neighborhoods are defined differently. Consider a lattice consist-


ing of area units such as counties (Figure 9.27). A neighborhood definition akin to the queen
definition on a regular lattice is to collect all sites into the neighborhood of a site that share a
common boundary with the county. Alternatively one can identify a single point s3 with each
county, the seat of the county government, for example. The neighborhood then can be de-
fined as all counties within a given distance from the county seat or as all counties whose
county seat is within a given distance of s3 .

Figure 9.27. Counties of North Carolina and possible neighborhood definitions for two coun-
ties: Counties with adjoining borders (left part of map) or counties within $! miles of the
county seat (center part of map). County seats are shown as dots.

© 2003 by CRC Press LLC


Once the neighbors of all sites are identified weights A34 are assigned. If sites s3 and s4
are not connected then A34 œ !, but what weights are to be assigned to sites that are neigh-
bors? The simplest solution is through binary variables:
Ú" if s4 is a neighbor of s3
A34 œ Û ! if s4 is not a neighbor of s3
Ü! 3 œ 4.

A binary weighing scheme is reasonable when sites are spaced regularly by uniform distances
(Figure 9.26). If sites are arranged irregularly or represent area units of different size and
shape (Figure 9.27), weights should be chosen more carefully. Some possibilities (Haining
1990) are
• A34 œ lls3  s4 ll# ß #   !
• A34 œ expells3  s4 ll# f
• A34 œ a634 Î63 b# where 634 is the length of the common border between areas 3 and 4 and 63
is the perimeter of the border of area 3
• A34 œ a634 Î63 b7 Îlls3  s4 ll# .

The weights are then collected in an a8 ‚ 8b weight matrix W œ cA34 d and the statistical
model is parameterized to incorporate the large-scale mean structure as well as the interactive
spatial correlation structure. The two approaches that are common lead to the simultaneous
and the conditional spatial autoregressive models. These combine the user's choice of neigh-
bors with an appropriate model for the marginal or conditional distribution of the data that is
consistent with the neighborhood structure.
The choice of the neighborhood structure is largely subjective. The fact that two sites
have nonzero connectivity weights does not imply a causal relationship between the respons-
es at the sites. It is a representation of the local variation due to extraneous conditions. Besag
(1975) calls this “"third-party" dependence.” Imagine a locally varying regressor variable on
which ^asb depends has not been observed. The localized neighborhood structure supplants
the missing information by defining groups of sites which would have been affected similarly
by the unobserved variable because they are in spatial proximity of each other.

9.6.2 First-Order Simultaneous and Conditional Models


In the simplest simultaneous spatial autoregressive (SSAR) model the response at site s3 is ex-
pressed as an adjustment to the mean at s3 . The adjustment consists of random error /as3 b and
the weighted influences of sites in the neighborhood. If Ec^ as3 bd œ .as3 b, an SSAR model
can be expressed formally as
8
^ as3 b œ .as3 b € 3= "A34 a^ as4 b  .as4 bb € /as3 b. [9.60]
4œ"

In contrast to a spatial regression model, where secondary attributes are used as regressors
and their values are considered fixed, the SSAR model regresses ^as3 b onto neighboring

© 2003 by CRC Press LLC


values of the same attribute and these remain random variables. The parameter 3= measures
the strength of the spatial autocorrelation but in contrast to autoregressive time series models
cannot be interpreted as a correlation parameter. For example, the range of 3= depends on the
structure of W. Using matrix and vector notation and assuming that .as3 b œ xw asb", [9.60]
can be expressed more concisely as
Zasb œ Xasb" € 3= WaZasb  Xasb" b € easb.

It follows that EcZasbd œ Xasb" and if the /as3 b are homoscedastic with variance 5=# that
VarcZasbd œ 5=# aI  3= Wb" aI  3= Ww b" .
Instead of choosing a semivariogram or covariogram model for the smooth-scale varia-
tion as in a spatial regression model this spatial autoregressive model for lattice data requires
estimation of only one parameter associated with the spatial autocorrelations, 3= . The struc-
ture and degree of the autocorrelation is determined jointly by the structure of W and the
magnitude of 3= .
The second class of autoregressive models for lattice data, the conditional spatial autore-
gressive (CSAR) models, commence with the conditional mean and variance of a site's re-
sponse given the observed values at all other sites. Denote by zasb3 the vector of observed
values at all sites except the 3th one. A CSAR model is then defined through
8
Ec^ as3 blzasb3 d œ .as3 b € 3- "A34 azas 4 b  .as4 bbß Varc^ as3 blzasb3 d œ 53# . [9.61]
4œ"

For spatial data the conditional and simultaneous formulations lead to different models.
Assume that the conditional variances are identical, 53# ´ 5-# . The marginal variance in the
CSAR model is then given by VarcZasbd œ 5-# aI  3- Wb" which is to be compared against
VarcZasbd œ 5=# aI  3= Wb" aI  3= Ww b" in the simultaneous scheme. Even if 5-# œ 5=# and
3- œ 3= , the variance-covariance matrices will differ. Furthermore, since the variance-cova-
riance matrix of Zasb is symmetric it is necessary in the CSAR model with constant condi-
tional variance that the weight matrix W be symmetric. The SSAR model imposes no such
restriction on the weights. If a lattice consists of irregularly shaped area units asymmetric
weights are often reasonable. Consider a study of urban sprawl with an irregular lattice of
counties where a large county containing a metropolitan area is surrounded by smaller rural
counties. It is reasonable to assume that what happens in a small county is very much deter-
mined by developments in the metropolitan area while the development of the major city will
be much less influenced by a rural county. In a regular lattice asymmetric dependency param-
eters may also be possible. A site located on the edge of the lattice will depend on an interior
site differently from how an interior site depends on an edge site. In these cases an asym-
metric neighborhood structure is called for which rules out the CSAR model unless the condi-
tional variances are adjusted.
SSAR models have disadvantages, too. The model disturbances /as3 b and the responses
^as4 b are not uncorrelated in these models which is in contrast to autoregressive time series
models. This causes ordinary least squares estimators to be inconsistent. In matrix/vector
notation the CSAR model can be written as
Zasb  .asb œ 3- WaZasb  .asbb € / asb,

© 2003 by CRC Press LLC


where /asb œ aI  3- WbaZasb  .asbb is a vector of pseudo-errors that is uncorrelated with
Zasb and hence the CSAR model retains this feature of the related time series model. This has
advantages in parameter estimation. Cressie (1993) reasons that when the process has
achieved stability, symmetric dependencies among sites are in general more natural than
asymmetric dependencies and concludes that CSAR models are more natural than SSAR
models.
The parameter 3 is called an interaction parameter since autoregressive models are inter-
active models. This parameter measures the correlation between neighboring time points in a
first-order autoregressive time series but not so with spatial data. The matrix aI  3Wb must
be invertible and hence its determinant lI  3Wl must be nonzero. This places restrictions on
the possible values of 3. If Ö-3 × denotes the set of eigenvalues of W, then, if the smallest
eigenvalue is negative and the largest eigenvalue is positive,
" "
3 .
mine-3 f maxe-3 f

For square lattices 3 is restricted to  !Þ#&  3  !Þ#& as the size of the lattice increases.
The range of the interaction parameter can be affected by standardization. If rows of W are
standardized to sum to one then l3l  " (Haining 1990, p. 82). For regular lattices and the
rook neighborhood definition without row standardization the permissible ranges for 3 are
shown in the next table.

Table 9.5. Limits on interaction parameter in first-order spatial autoregressive


models as a function of the size of a regular lattice
" "
Lattice size mine-3 f maxe-3 f
$‚$  !Þ$&% !Þ$&%
%‚%  !Þ$!* !Þ$!*
&‚&  !Þ#)* !Þ#)*
'‚'  !Þ#(( !Þ#((
(‚(  !Þ#(" !Þ#("
)‚)  !Þ#'' !Þ#''
*‚*  !Þ#'$ !Þ#'$
"! ‚ "!  !Þ#'" !Þ#'"
#! ‚ #!  !Þ#&$ !Þ#&$

Models [9.60] and [9.61] are termed first-order models since they involve only one set of
neighborhood weights and a single interaction parameter. To make Zasb a function of two
interaction parameters that measure different distance effects, the SSAR model can be modi-
fied to a second order model as
Zasb œ Xasb" € a3" W" € 3# W# baZasb  Xasb" b € easb.

For example, W" can be a rook neighborhood structure and W# a bishop neighborhood
structure. The CSAR model can be similarly extended to a higher order scheme (see Whittle
1954, Besag 1974, Haining 1990).

© 2003 by CRC Press LLC


9.6.3 Parameter Estimation
The marginal mean and variance in the (homoscedastic) first order SSAR and CSAR models
are
"
SSAR: EcZasbd œ Xasb" VarcZasbd œ 5=# aI  3= Wb" aI  3= Ww b
CSAR: EcZasbd œ Xasb" VarcZasbd œ 5-# aI  3- Wb"

and one could estimate the mean parameters by least squares, minimizing
SSAR: 5=# aZasb  Xasb" bw aI  3= Ww baI  3= WbaZasb  Xasb" b
CSAR: 5-# aZasb  Xasb" bw aI  3- WbaZasb  Xasb" b.

Unfortunately, because the errors /as3 b and data ^as4 b in the SSAR model are not uncorrelat-
ed, the least squares estimates in the simultaneous scheme are not consistent (Whittle 1954,
Mead 1967, Ord 1975). Ord (1975) devised a modified least squares procedure when
Ec^ asbd œ ! that yields consistent estimates but comments on its low efficiency. The CSAR
model does not suffer from this shortcoming and least squares estimation is possible there.
When the spatial autoregressive model contains reactive effects a" b in addition to an autore-
gressive structure it is desirable to obtain estimates of the large-scale mean structure and the
interaction parameters simultaneously. The maximum likelihood method seems to be the
method of choice. Unless the distribution of Zasb is Gaussian, maximum likelihood estima-
tion is numerically cumbersome, however. Ord (1975) adapted an iterative procedure de-
veloped by Cochrane and Orcutt (1949) for estimation in simultaneous time series models to
obtain maximum likelihood estimates in the SSAR model. Haining (1990, p. 128) notes that
no proof exists that this adapted algorithm converges to a local minimum in the spatial case.
If the Gaussian assumption is reasonable we prefer maximum likelihood estimation of the pa-
rameters in a simultaneous scheme with a profiling algorithm as outlined in §A9.9.7.

9.6.4 Choosing the Neighborhood Structure


Choosing the neighborhood structure and thereby the weight matrix W is of importance in
lattice models but often carried out in an ad-hoc manner without clear guidelines. The specifi-
cation of W represents a priori knowledge of the range and intensity of a spatial effect for a
set of area units constituting a geographical system. Forms used in the specification of W in-
clude binary contiguity matrices (rook, queen, bishop's definition), row-standardized forms,
length of common boundary, intercentroid distances among others. Of interest are the follow-
ing three questions:
• Does the choice of W make any practical difference in the statistical analysis of spatial
lattice data?
• In what ways does the misspecification of a geographic weight matrix influence statis-
tical analysis?
• Are there rule-of-thumb directions to guide specification of W for a given spatial
landscape?

© 2003 by CRC Press LLC


Griffith (1996) has addressed these questions and developed some guidelines and rules-of-
thumb. We repeat the main results. Assume that the true model is Zasb œ X" € easb with
easb µ Ka0ß 5 # Vb but the model Zasb œ X" € e‡ asb with e‡ asb µ KÐ0ß 5 # aI  3Wb" Ñ is fit.
That is, instead of V we are using the matrix A œ aI  3Wb" .
It is easy to show that under this misspecification the generalized least squares estimator
s E œ ˆXw A" X‰" Xw A" Zasb
"

is unbiased but the residual based estimator


w
s ‹ A" ŠZasb  X"
s # œ ŠZasb  X"
5 s ‹€a8  <aXbb

is a biased estimator of 5 # .
Misspecification of the geographic weight matrix tends to suppress statistical efficiency.
If "s E are the estimates obtained under misspecification (using A instead of V), then
s E Ó   VarÒcw "
VarÒcw " s Z Ó. Moderate to strong positive autocorrelation that is ignored nearly de-
stroys the efficiency of the ordinary least squares estimator.
Griffith (1996) recommends the following:
1. It is better to posit some reasonable geographic weight matrix than to assume indepen-
dence.
2. A relatively large number of area units should be employed, at least '!.
3. Lower-order spatial models should be given preference over those of higher-order.
4. It is better to employ a somewhat underspecified than a somewhat overspecified geo-
graphic weight matrix as long as W Á 0Þ

The upshot is that first-order neighbor connectivity definitions are usually sufficient on
regular lattices and little is gained by extending to second-order definitions. On irregular
lattices complex neighborhood definitions are often not supported and large data sets are nec-
essary to distinguish between competing specifications of the W matrix. Less is more.

9.7 Analyzing Mapped Spatial Point Patterns

9.7.1. Introduction
The preceding discussion considered the random field e^ asb À s ­ H § ‘# f where H was a
fixed set. H was discrete in the case of lattice data and continuous in the case of geostatistical
data. A spatial point pattern (SPP) is a random field where H is a random set of locations at
which certain events of interest occurred. Unless ^asb is itself a random variable — a situa-
tion we will exclude from the discussion here (see Cressie 1993, Ch. 8.7 on marked point
processes) — the focal point of statistical inquiry is the random set H itself. What kind of
statistical questions may be associated with studying the set of locations H ? For example, one

© 2003 by CRC Press LLC


may ask whether the distribution of events in space is completely random or whether events
appear more clustered or regular than is expected under a complete random placement. If
events have a tendency to group together in space, we may wish to examine what kind of sto-
chastic model can adequately describe the process, that is, we seek a model which can serve
as the data generating mechanism for the observed point pattern. Figure 9.28 shows the loca-
tions of &"% maple trees in the Lansing Woods of Clinton County, Michigan. Do these points
appear to be placed by a mechanism that arranges tree locations independently and uniformly
throughout the study region?
1.0
0.8
0.6
y
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Figure 9.28. Location of &"% maple trees in Lansing Woods, Clinton County, Michigan. Data
described by Gerrard (1969), appear in Diggle (1983), and are included in S+SpatialStats® .

The events (location of trees) graphed in Figure 9.28 represent a mapped point pattern where
all events within the study region have been located. A sampled point pattern on the other
hand is one where a finite number of sample points is selected. At each point one collects
either an area sample by counting the number of events in a sampling area around the point or
a distance sample by recording the distance between the sample points and nearby events.
This chapter is concerned only with the analysis of mapped patterns. Diggle (1983) is an
excellent reference for the analysis of sampled patterns.
A data set containing the results of mapping a spatial point pattern is deceivingly simple. It
may contain only the longitude and latitude of the recorded events. Answering such simple
questions as are the points distributed at random requires tools, however, that are quite dif-
ferent from what the reader has been exposed to so far. For example, little rigorous theory is
available to derive the distribution of even simple test statistics and testing hypotheses in spa-
tial point patterns relies heavily on Monte Carlo (computer simulation) methods. It is our
opinion that point pattern data is collected quite frequently in agronomic studies but rarely
recognized and analyzed as such. This chapter is a brief introduction into spatial point pattern
analysis. The interested reader is encouraged to further the limited discussion we provide with
resources such as Ripley (1981), Diggle (1983), Ripley (1988), and our §A9.9.10 to A9.9.13.

© 2003 by CRC Press LLC


9.7.2. Random, Aggregated, and Regular Patterns — the
Notion of Complete Spatial Randomness
When comparing a set of treatments in an analysis of variance the global hypothesis ad-
dressed first is that there are no differences in mean response among the treatments. In many
applications rejection of this global hypothesis is not a big surprise and the analyst proceeds
with post-ANOVA procedures (contrasts, multiple comparison procedures, etc.) to shed light
on how exactly treatments differ in soliciting a response. If the global hypothesis cannot be
rejected, however, there is little (no) incentive to proceed further. The global hypothesis for
spatial point patterns akin to this initial inquiry in the analysis of variance is whether the
events are distributed completely at random. If this hypothesis cannot be rejected there also is
little incentive to further inquiries.
Complete spatial randomness (CSR) of events implies that events are uniformly
distributed and that events are distributed independently of each other. Uniformity means that
the expected number of events per unit area is the same throughout the region. Events exhibit
no tendency to occupy particular regions in space. Formally, uniformity of events — or the
lack thereof — is expressed through the (first-order) intensity function of the SPP. Let R aEb
denote the number of events that are observed in a region E. The first-order intensity function
-asb is defined as
EcR a. sbd
-asb œ lim . [9.62]
l. sl Ä _ l. sl

In [9.62] .s is an infinitesimal region (a small disk centered at location s) and l. sl is its area
(volume). As the radius of the disk shrinks toward zero the expected number of events in this
area goes to zero but so does the area l. sl. The function -asb obtained in the limit is the first-
order intensity function of the spatial point process. Once -asb is known, the expected num-
ber of events in a region E can be determined by integrating the first-order intensity,

EcR aEbd œ .aEb œ ( -asb. s. [9.63]


E

Uniformity of events, one of the conditions of complete spatial randomness, implies that
-asb œ -. The average number of events per unit area a-asbb does not depend on spatial loca-
tion, it is the same everywhere. A point process with this property is termed homogeneous or
first-order stationary. The expected number of events in E is then simply -lEl, the (constant)
expected number of events per unit area times the area. It is now seen that the assumption of a
constant first-order intensity is the SPP equivalent to the assumption for geostatistical and lat-
tice processes that the mean Ec^asbd is constant. For the latter data types constancy of the
mean does not imply the absence of spatial autocorrelations. By the same token spatial point
processes where -asb is independent of location are not necessarily CSR processes.
The first-order intensity conveys no information about the possible interaction of events
just as the means of two random variables tell us nothing about their covariance. The CSR
hypothesis requires that beyond uniformity the number of events in disjoint regions are inde-
pendent, CovcR aEbß R aF bd œ ! if E  F œ gÞ The spatial point process that embodies CSR
is the homogeneous Poisson process (HPP). Testing the CSR hypothesis is equivalent to

© 2003 by CRC Press LLC


asking whether the observed point pattern could be the realization of an HPP, defined through
the following postulates:
(i) the counts in any finite region E have a Poisson distribution with mean -lEl, where
lEl is the area of E and - is some positive constant;
(ii) counts in disjoint regions are independent;
(iii) given 8 events in E, the locations s" ß s# ß âß s8 are a random sample from a uniform
distribution on E.

A deviation from CSR implies that points are either not independent or not uniformly
distributed. A point pattern is called aggregated or clustered if events separated by short
distances occur more frequently than is expected under CSR, and regular if they occur less
frequently than in a homogeneous Poisson process (Figure 9.29).
A: CSR B: SSI
1.0

1.0
0.8

0.8
0.6

0.6
y

y
0.4

0.4
0.2

0.2
0.0

0.0

0.0 0.5 1.0 0.0 0.5 1.0


x x

C: CLU
1.0
0.8
0.6
y
0.4
0.2
0.0

0.0 0.5 1.0


x

Figure 9.29. Realizations of a CSR (A), regular (B), and clustered (C) process. Each pattern
has "!! events on the unit square. SSI is the simple sequential inhibition process (Diggle et
al. 1976, Diggle 1983) which does not permit events within a minimum distance of other
events (see §A9.9.11).

Aggregated patterns are common and several theories have been developed to explain the
formation of clusters in biological applications. One explanation for clustering is through a
contagious process where the presence of one or more organisms increases the probability of
other organisms occurring in the same sample. In an aggregated process, contagiousness is
positive, resulting in an excess of events at small distances and fewer events separated by
large distances compared to a CSR process. Aggregation has also been explained in terms of

© 2003 by CRC Press LLC


reproductive and dispersal mechanisms. Offspring of plants reproducing vegetatively tend to
cluster around parent plants. Plants that reproduce by seed may show a degree of aggregation
if seeds do not disperse easily (heavy seeds). Site conditions influence the adaptability and
relative abundance of species on a given site. If the success of a species varies with respect to
site conditions, the spatial distribution of the species will vary from site to site. Plant density
will tend to be higher on preferable sites. A regular spatial distribution is one in which events
are more evenly distributed than expected under CSR. It has been speculated that regular
spatial patterns could occur in nature when there is a high degree of competition for space
between individuals (negative contagiousness). Light sensitivity of trees in older hardwood
stands result in a regular distribution of trees to maintain a minimal growing space. The
distribution of cell nuclei (or cell centers) in tissue exhibits regularity, since the cell occupies
space and can not be deformed arbitrarily.
The distinction between random, aggregated, and regular patterns is made for conven-
ience, there exists a continuum among the three types of spatial patterns. Also, one should
keep in mind that spatial patterns evolve over time and undergo an evolution that may take
the process through clustered, random, and regular stages. The initial distribution of a re-
generated oak stand appears clustered due to the limited radius of dispersion of the heavy
seeds. Over the years increasing light sensitivity of the species and intertree competition tend
to create a regular pattern. Human intervention can alter this evolution.

9.7.3. Testing the CSR Hypothesis in Mapped Point Patterns


The CSR hypothesis asserts that the number of events in any region E with area lEl is a
Poisson random variable with mean -lEl and that given 8 events s3 in E, the s3 are an
independent random sample from the uniform distribution on E. The implications are that
(a) the intensity does not vary over the region E (homogeneity of the process);
" 5
(b) PraR ÐEÑ œ 5 b œ 5x a- lElb expe  -lElf;
(c) there are no interactions among events.

A goodness-of-fit approach to testing for complete spatial randomness is to count events in


nonoverlapping subregions and to compare the observed counts against a Poisson distribution
with a standard Chi-square test. While counting events is simple, this approach has some
ambiguity because the user must decide how to divide the total area into nonoverlapping sub-
regions. A second approach to testing for CSR is to measure various types of distances in the
observed point pattern. For example, the average distance between an event and its nearest
neighbor in a clustered pattern is smaller than the same average distance in a random pattern.
If the distribution of the average nearest neighbor distance under CSR (the null hypothesis)
can be determined a statistical test is possible. Since the sampling distributions of distance-
based measures are difficult to ascertain, tests of CSR based on distances often rely on Monte
Carlo (simulation) methods.

© 2003 by CRC Press LLC


Quadrat counts
Denote as 8 the number of events in the observed point pattern which occupies region E. The
study region E is partitioned into 7 nonoverlapping subregions (quadrats) of equal area and
the number of events is determined in each quadrat. Commonly, E is taken to be square and
divided into a 5 ‚ 5 regular grid, so that 7 œ 5 # . Let 83 be the event count in grid cell
3 œ "ß âß 7 and notice that 8 œ !7 3œ" 83 . Under CSR the process is homogeneous, the
number of events in each cell is estimated as 8 œ 8Î7, and the statistic
7
a83  8b#
\# œ " [9.64]
3œ"
8

has an asymptotic ;# distribution with 7  " degrees of freedom. Significantly small values
of \ # indicate regularity and significantly large values of \ # indicate aggregation. For the ;#
approximation to hold, counts in each quadrat should exceed % a   &b in )!% of the cells and
should be greater than " everywhere. This rule can be used to find a reasonable grid size to
partition E.
The quadrat count statistic is closely related to the index of dispersion which is a ratio of
two variance estimates obtained with and without making any distributional assumptions.
Under CSR the number of events in any one of the 7 subregions is a Poisson random
variable whose mean and variance are estimated by 8. Regardless of the spatial point process
that generated the observed data, W # œ a7  "b" !7 3œ" a83  8b
#
estimates the variance of
#
the quadrat counts. The ratio M œ W Î8 is called the index of dispersion and is related to
[9.64] through
" # "
Mœ W œ \#. [9.65]
8 7"
If the process is clustered the quadrat counts will vary more than what is expected under
CSR, and M will be large. If the process is regular the counts will vary less since all 83 will be
similar and similar to the mean count 8. The index of dispersion will be small. For a CSR
process the index will be about one on average.
To test the CSR hypothesis with quadrat counts for the point patterns shown in Figure
9.29, 5 œ % bin classes were used to partition the unit square into % ‚ % œ "' œ 7 quadrats.
The resulting quadrat counts follow.

Table 9.6. Quadrat counts for point patterns of Figure 9.29


A: CSR B: SSI C: CLU
" # $ % " # $ % " # $ %
" % & ) ( ( & ' ' "! * * (
# $ ( ) $ ' ( & ) " "! " !
$ "! & ( ' ( ' % ' ' ( "$ "!
% ' ' ' * ( ' ( ( # "# $ !

Notice that !83œ" 83 œ 8 œ "!! for all three patterns and 8 œ 'Þ#&. The even distribution
of the sequential inhibition process (B) and the uneven distribution of the counts for the clus-

© 2003 by CRC Press LLC


tered process (C) are reflected in the sample variances of the quadrat counts: =#-=< œ $Þ*$,
=#==3 œ "Þ!, and =#-6? œ "*Þ*$. The indices of dispersion are M-=< œ !Þ'#), M==3 œ !Þ"', and
M-6? œ $Þ"*. The CSR hypothesis cannot be rejected for process A and is rejected against the
regular alternative for process B and against the clustered alternative for process C (Table
9.7).

Table 9.7. \ # statistics for quadrat counts of CSR, SSI, and CLU processes
Process \# Pra;#"& Ÿ \ # b #
Pra;"&   \#b
A: CSR *Þ%% !Þ"%'' !Þ)&$%
B: SSI #Þ%! !Þ!!!" !Þ****
C: CLU %(Þ)% !Þ***) !Þ!!!#

Using quadrat counts and the Chi-square test is simple, but the method is sensitive to the
choice of the subregions (grid size). Statistical tests based on distances between events or
sampling points and events avoid this problem.

CSR Tests Based on Distances


Most CSR tests for mapped point patterns utilize the distances between events and between
sample points and events rather than quadrat counts. This results in tests for CSR which are
slightly more computationally involved, but eliminate a subjective element  how to partition
the region  from the analysis. Figure 9.30 shows some of the distances that are commonly
employed. In mapped patterns we prefer nearest-neighbor distances because of their greater
computational efficiency. In a point pattern with 8 events there are 8a8  "bÎ# interevent dis-
tances but only 8 nearest-neighbor distances.

event locations
0.9 sample locations

0.8

0.7

0.6

0.5
y

0.4

0.3

0.2

0.1

0.0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
x

Figure 9.30. Distance measurements used in CSR tests. Solid lines: sample point to nearest
event distances. Dashed lines: inter-event distances (also called event-to-event distances).
Dotted lines: nearest-neighbor distances (also called event-to-nearest-event distances).

© 2003 by CRC Press LLC


To fix ideas let C3 denote the distance from the event at s3 to the nearest other event and
>34 the Euclidean distance between events at s3 and s4 . The empirical distribution function of
the nearest-neighbor distances is calculated as

s " aCb œ " #aC3 Ÿ C b


K [9.66]
8
and that of the interevent distances as

s " a>b œ # #a>34 Ÿ >b .


L [9.67]
8a8  "b

As a test statistic we may choose the average nearest-neighbor distance C, the average inter-
event distance >, the estimate K s " aC! b of the probability that the nearest-neighbor distance is at
s
most C! , or L " a>! b. It is the user's choice which test statistic to use and in the case of the
empirical distribution functions how to select C! and/or >! . It is important that the test statistic
constructed from event distances can be interpreted in the context of testing the CSR hypothe-
sis. Compared to the average nearest-neighbor distance expected under CSR, C will be
smaller in a clustered and larger in a regular pattern. If C! is chosen small, K s " aC! b will be
larger than expected under CSR in a clustered pattern and smaller in a regular pattern.
The sampling distributions of distance-based test statistics are usually complicated and
elusive, even under the assumption of complete spatial randomness. An exception is the quick
test proposed by Ripley and Silverman (1978) that is based on one of the first ordered inter-
event distances. If >" œ mine>34 f, then >#" has an exponential distribution under CSR and an
exact test is possible. Because of possible inaccuracies in determining locations, Ripley and
Silverman recommend using the third smallest interevent distance. The asymptotic Chi-squa-
red distribution of these order statistics of interevent distances under CSR is given in their
paper. Advances in computing power has made possible to conduct tests of the CSR hypo-
thesis by Monte Carlo (MC) procedures. Among their many advantages is to yield exact :-
values (exact within simulation variability) and to accommodate irregularly shaped regions.
Recall that the :-value of a statistical test is the probability to obtain a more extreme out-
come than what was observed if the null hypothesis is true (§1.6). If the :-value is smaller
than the user-selected Type-I error level, the null hypothesis is rejected. An MC test is based
on simulating =  " independent sets of data under the null hypothesis and calculating the test
statistic for each set. Then the test statistic is obtained from the observed data and combined
with the values of the test statistics from simulation to form a set of = values. If the observed
value of the test statistic is sufficiently extreme among the = values, the null hypothesis is
rejected.
Formally, let ?" be the value of a statistic Y calculated from the observed data and let
?# ß ÞÞÞß ?= be the values of the test statistic generated by independent sampling from the distri-
bution of Y under the null hypothesis aL! b. If the null hypothesis is true we have
Pra?" œ maxe?3 ß 3 œ "ß ÞÞÞß =fb œ "Î=.

Notice that we consider ?" , the value obtained from the actual (nonsimulated) data, to be part
of the sequence of all = values. If we reject L! when ?" ranks 5 th largest or higher, this is a
one-sided test of size 5Î=. When values of the ?3 are tied, one can either randomly sort the
?3 's within groups of ties, or choose the least extreme rank for ?" . We prefer the latter method

© 2003 by CRC Press LLC


because it is more conservative. Studies have shown that for a &% level test, = œ "!! is ade-
quate, whereas = œ &!! should be used for "% level tests (Diggle 1983).
To test a point pattern for CSR with a Monte Carlo test we simulate =  " homogeneous
point processes with the same number of events as the observed pattern and calculate the test
statistic for each simulated as well as the observed pattern. We prefer to use nearest-neighbor
distances and test statistic C . The SAS® macro %ghatenv() contained in file \SASMacros\
NearestNeighbor.sas accomplishes that for point patterns that are bounded by a rectangular
region. The statements
%include 'DriveLetterofCDROM:\SASMacros\NearestNeighbor.sas';
%ghatenv(data=maples,xco=x,yco=y,alldist=1,graph=1,sims=20);
proc print data=_ybars; var ybar sim rank; run;

perform a nearest-neighbor analysis for the maple data in Figure 9.28. For exposition we use
=  " œ #! which should be increased in real applications. The observed pattern has the
smallest average nearest-neighbor distance with C" œ !Þ!"()#) (Output 9.1, sim=0). This
yields a :-value for the hypothesis of complete spatial randomness of : œ !Þ!%(' against the
clustered alternative and : œ !Þ*&#% against the regular alternative (Output 9.2).

Output 9.1.
Obs ybar sim rank

1 0.017828 0 1
2 0.021152 1 2
3 0.021242 1 3
4 0.021348 1 4
5 0.021498 1 5
6 0.021554 1 6
7 0.021596 1 7
8 0.021608 1 8
9 0.021630 1 9
10 0.021674 1 10
11 0.021860 1 11
12 0.022005 1 12
13 0.022013 1 13
14 0.022046 1 14
15 0.022048 1 15
16 0.022070 1 16
17 0.022153 1 17
18 0.022166 1 18
19 0.022172 1 19
20 0.022245 1 20
21 0.023603 1 21

Output 9.2.
One One
Test # of MC Sided Sided
Statistic runs rank Left P Right P

0.017828 20 1 0.95238 0.047619

Along with the ranking of the observed test statistic it is useful to prepare a graph of the
simulation envelopes for the empirical distribution function of the nearest-neighbor distance

© 2003 by CRC Press LLC


(called a K -hat plot). The upper and lower simulation envelopes are defined as

Y aC b œ max šK s 3 aC b› and PaCb œ min šK s 3 aC b›


3 œ #ß ÞÞß = 3 œ #ß ÞÞÞß =

and are plotted against KaCb, the average empirical distribution at C from the simulations,
" s 4 aC b .
K aC b œ "K
=  " 4Á3

This plot is overlaid with the observed K -function ÐK s " aC bÑ as shown in Figure 9.31. Clus-
tering is evidenced by the K function rising above the %&-degree line that corresponds to the
CSR process (dashed line), regularity by a K function below the dashed line. When the ob-
served K function crosses the upper or lower simulation envelopes the CSR hypothesis is re-
jected. For the maple data there is very strong evidence that the distribution of maple trees in
the particular area is clustered.

1.0

0.8

0.6

0.4

0.2

0.0
0.2 0.4 0.6 0.8 1.0
Gbar(y)

Figure 9.31. Upper [Y aCb] and lower [PaCb] simulation envelopes for #! simulations and
observed empirical distribution function (step function) for maple data (Figure 9.28). Dashed
line represents K -function for a CSR process.

MC tests require a procedure to simulate the process under the null distribution. To test a
point pattern for CSR requires an efficient method to generate a homogeneous Poisson proc-
ess. The following algorithm simulates this process on the rectangle a!ß !b ‚ a+ß ,b.
1. Generate a random number from a Poissona-+,b distribution Ä 8.
2. Order 8 independent Y a!ß +b random variables Ä \"  \#  ⠝ \8 .
3. Generate 8 independent Y a!ß ,b random variables Ä ]" ß âß ]8 .
4. Return a\" ß ]" bß âß a\8 ß ]8 b as the coordinates of the two-dimensional Poisson
process on the rectangle.

© 2003 by CRC Press LLC


This algorithm generates a random number of events in step 1. Typically, simulations are
conditioned such that the simulated patterns have the same number of events as the observed
pattern (this is called a Binomial process). In this case 8 is determined by counting the events
in the observed pattern and the simulation algorithm consists of steps 2 to 4. If the study re-
gion E is not a rectangle, but of irregular shape, create a homogeneous Poisson process on a
rectangle which encloses the shape of the study region and generate events until 8 events fall
within the shape of interest.
If it is clear that events are not uniformly distributed, testing the observed pattern against
CSR is an important first step but rejection of the CSR hypothesis is not surprising. To test
whether the events follow a nonuniform process in which events remain independent one can
test the observed pattern against an inhomogeneous Poisson process where -asb is not a
constant but follows a model specified by the user (note that -asb must be bounded and non-
negative). The following algorithm by Lewis and Shedler (1979) simulates an
inhomogeneous Poisson process.
1. Simulate a homogeneous Poisson process on E with intensity -!   maxe-asbf
according to the algorithm above Ä a\" ß ]" bß âß a\8 ß ]8 b.
2. Generate uniform Y a!ß "b random variables for each event in E that was generated in
step 1 Ä Y" ß âß Y8 .
3. If Y3 Ÿ -asbÎ-! , retain the event, otherwise discard the event.

9.7.4 Second-Order Properties of Point Patterns


To determine whether events interact (attract or repel each other) it is not sufficient to study
the first-order properties of the point pattern. Rejection of the CSR hypothesis may be due to
lack of uniformity or lack of independence. With geostatistical data the first-order properties
are expressed through the mean function Ec^asbd œ .asb and the second-order properties
through the semivariogram or covariogram of ^asb. For point patterns the first-order intensity
-asb takes the place of the mean function and the second-order properties are expressed
through the second-order intensity function
EcR a. s" bR a. s# bd
-# as" ß s# b œ lim . [9.68]
l. s" l Ä !ß l. s# l Ä ! l. s" ll. s# l

If the point process is second-order stationary, then -asb œ - and -# as" ß s# b œ -# as"  s# b;
the first-order intensity is constant and the second-order intensity depends only on the spatial
separation between points. If the process is furthermore isotropic, the second-order intensity
does not depend on the direction, only the distance between pairs of points: -# as" ß s# b œ
-# alls"  s# llb œ -# a2b. Notice that any process for which the intensity depends on locations
cannot be second-order stationary. The second-order intensity function depends on the expec-
ted value of the cross-product of counts in two regions, similar to the covariance between two
random variables \ and ] , which is a function of their expected cross-product,
Covc\ß ] d œ Ec\] d  Ec\ dEc] d. A downside of -# as"  s# b is its lack of physical inter-
pretability. The remedy is to use interpretable measures that are functions of the second-order
intensity or to perform the interaction analysis in the spectral domain (§A9.9.12).

© 2003 by CRC Press LLC


Ripley (1976, 1977) proposed studying the interaction among events in a second-order
stationary, isotropic point process through a reduced moment function called the O -function.
The O -function at distance 2 is defined through
-O a2b œ Ec# of extra events within distance 2 from an arbitrary eventdß 2 ž !.

O-function analysis in point patterns takes the place of semivariogram analysis in geosta-
tistical data. The assumption of a constant mean there is replaced with the assumption of a
constant intensity function here. The O -function has several advantages over the second-or-
der intensity function [9.68].
• Its definition suggests a method of estimating O a2b from the average number of
events less than distance 2 apart (see §A9.9.10).

• O a2b is easy to interpret. In a clustered pattern a given event is likely to be surround-


ed by events from the same cluster. The number of extra events within small distances
2 of an event will be large, and so will be O a2b. In an aggregated pattern the number
of extra events (and therefore O a2b) will be small for small 2.
• O a2b is known for important point process models (§A9.9.11). For the homogeneous
Poisson process the expected number of events per unit area is -, the expected number
of extra events within distance 2 is -12# and O a2b œ 12# . If a process is first-order
stationary comparing a data based estimate Os a2b against 12# allows testing for inter-
#
action among the events. If O a2b   12 for small distances 2, the process is clustered
whereas O a2b  12# (for 2 small) indicates regularity.
• The O -function can be obtained from the second-order intensity of a stationary, iso-
tropic process if -# a2b is known:
2
O a2b œ #1-# ( B-# aBb.B.
!

Similarly, the second-order intensity can be derived from the O -function.


• If not all events have been mapped and the incompleteness of the data is spatially
neutral (events are missing completely at random (MCAR)), the O -function remains
an appropriate measure for the second-order properties of the complete process. This
is known as the invariance of O a2b to random thinning. If the missing data process is
MCAR the observed pattern is a realization of a point process whose events are a sub-
set of the complete process generated by retaining or deleting the events in a series of
mutually independent Bernoulli trials. Random thinning reduces the intensity - and
the expected number of additional events within distance 2 of s by the same factor.
Their ratio, which is O a2b, remains unchanged.

Ripley's O -function is a useful tool to study second-order properties of stationary, isotro-


pic spatial processes. Just as the semivariogram does not uniquely describe the stochastic
properties of a geostatistical random field, the O -function is not a unique descriptor of a point
process. Different point processes can have identical O -functions. Baddeley and Silverman
(1984) present interesting examples of this phenomenon. Study of second-order properties
without requiring isotropy of the point process is possible through spectral tools and periodo-

© 2003 by CRC Press LLC


gram analysis. Because of its relative complexity we discuss spectral analysis of point pat-
terns in §A9.9.12 for the interested reader with a supplementary application in A9.9.13.
Details on the estimation of first- and second-order properties from an observed mapped
point pattern can be found in §A9.9.10.

9.8 Applications
The preceding sections hopefully have given the reader an appreciation of the unique stature
of statistics for spatial data in the research worker's toolbox and a glimpse of the many types
of spatial models and spatial analyses. In the preface to his landmark text, Cressie (1993)
notes that “this may be the last time spatial Statistics will be squeezed between two covers.”
History proved him right. Since the publication of the revised edition of Cressie's Statistics
for Spatial Data, numerous texts have appeared that deal with primarily one of the three types
of spatial data (geostatistical, lattice, point patterns) at length, comparable to this entire
volume. The many aspects and methods of this rapidly growing discipline we have failed to
address are not countable. Some of the applications that follow are chosen to expose the
reader to some topics by way of example.
§9.8.1 reiterates the importance of maintaining the spatial context in data that are geo-
referenced and the ensuing perils to modeling and data interpretation if this context is over-
looked. Global and local versions of Moran's M statistics as a measure of spatial autocorrela-
tion are discussed there. In the analysis of geostatistical data the semivariogram or covario-
gram plays a central role. Kriging equations depend critically on it and spatial regres-
sion/ANOVA models require information about the spatial dependency to estimate
coefficients and treatment effects efficiently. §9.8.2 estimates empirical semivariograms and
fits theoretical semivariogram models by least squares, (restricted) maximum likelihood, and
composite likelihood. Point and block kriging are illustrated in §9.8.3 with an interesting
application concerning the amount of lead and its spatial distribution on a shotgun range.
Treatment comparisons of random field models and spatial regression models are examined in
§9.8.4 and §9.8.5. Most methods for spatial data analysis we presented assume that the
response variable is continuous. Many applications with georeferenced data involve discrete
or non-Gaussian data. Spatial random field models can be viewed as special cases of mixed
models. Extensions of generalized linear mixed models (§8) to the spatial context are
discussed in §9.8.6 where the Hessian fly data are tackled with a spatially explicit model.
Upon closer inspection many spatial data sets belong in the category of lattice data but are
often modeled as if they were geostatistical data. Lattice models can be extremely efficient in
explaining spatial variation. In §9.8.7 the spatial structure of wheat yields from a uniformity
trial are examined with geostatistical and lattice models. It turns out that a simple lattice
model explains the spatial structure more efficiently than the geostatistical approaches. The
final application, §9.8.8, demonstrates the basic steps in analyzing a mapped point pattern,
estimating its first-order intensity, Ripley's O -function, and Monte-Carlo inferences based on
distances to test the hypothesis of complete spatial randomness. A supplementary application
concerning the spectral analysis of point patterns can be found in Appendix A1 (§A9.9.13).
While The SAS® System is our computing environment of choice for statistical analyses,
its capabilities for spatial data analysis at the time of this writing did not extend to point
patterns and lattice data. The variogram procedure is a powerful tool for estimating empirical

© 2003 by CRC Press LLC


semivariograms with the classical (Matheron) and robust (Cressie-Hawkins) estimators. The
krige2d procedure performs ordinary kriging (globally and locally). Random field models
can be fit with the mixed procedure provided the spatial correlation structure can be modeled
through the procedure's random and/or repeated statements. This procedure can also be used
to compute kriging predictions. The kde procedure efficiently estimates univariate and biva-
riate densities and can be used to estimate the (first-order) intensity in point patterns. The
sim2d procedure simulates spatial data sets. For some applications one must either rely on
macros and programs tailored to a particular application or draw on specialized packages. In
cases not handled by SAS® procedures we used the S+SpatialStats® module of the S-PLUS®
program and SAS® macros we developed with similar functionality in mind. The use of the
macros is outlined in the applications that follow and the companion CD-ROM.

9.8.1 Exploratory Tools for Spatial Data — Diagnosing Spatial


Autocorrelation with Moran's I
The tools for modeling and analyzing spatial data discussed in this chapter have in common
the concept that space matters. The relationships among georeferenced observations is taken
into account, be it through estimating stochastic dependencies (semivariogram modeling for
geostatistical data and O -function analysis for point patterns), deterministic structure in the
mean function (spatial regression), or expressing observations as functions of neighboring
values (lattice models). Just as many of the modeling tools for spatial data are quite different
from the techniques applied to model independent data, exploratory spatial data analysis re-
quires additional methods beyond those used in exploratory analysis of independent data. To
steer the subsequent analysis in the right direction, exploratory tools for spatial data must
allow insight into the spatial structure in the data. Graphical summaries, e.g., stem-and-leaf
plots or sample histograms, are pictures of the data, not indications of spatial structure.
A) B)

10 10

8 8

6 6
y

4 4

2 2

0 0
0 2 4 6 8 10 0 2 4 6 8 10
x x

Figure 9.32. Two simulated lattice arrangements with identical frequency distribution of
points. Area of dots is proportional to the magnitude of the values.

The importance of retaining spatial information can be demonstrated with the following
example. A rectangular "! ‚ "! lattice was filled with "!! observations drawn at random
from a Ka!ß "b distribution. Lattice A is a completely random assignment of observations to

© 2003 by CRC Press LLC


lattice positions. Lattice B is an assignment to positions such that a value is surrounded by
values similar in magnitude (Figure 9.32).
Histograms of the "!! observed values that do not take into account spatial position will
be identical for the two lattices. (Figure 9.33). Plotting observed values against the average
value of the nearest neighbors the differences in the spatial distribution between the two
lattices emerge (Figure 9.34). The data in lattice A are not spatially correlated, the data in
lattice B are very strongly autocorrelated. We note further that the "density" estimate drawn
in Figure 9.33 is not an estimate of the probability distribution of the data. The probability
distribution of a random function is given through [9.3]. Even if the histogram calculated by
lumping data across spatial locations appears Gaussian does not imply that the data are a
realization of a Gaussian random field.

0.4

0.3

0.2

0.1

0.0
-3.3 -2.3 -1.3 -0.3 0.7 1.7 2.7 3.7

Figure 9.33. Histogram of the "!! realizations in lattices A and B along with kernel density
estimate. Both lattices produce identical sample frequencies.

1
Ave Z(s+h); ||h|| = 1

-1

-2

-2 -1 0 1 2
Z(s)

Figure 9.34. Lag-" plots for lattices A (full circles) and B (open circles) of Figure 9.32.
There is no trend between a value and the average value of its immediate neighbors in lattice
A but a very strong trend in lattice B.

© 2003 by CRC Press LLC


Distinguishing between the spatial and nonspatial context is also important for outlier
detection. An observation that appears unusual in a stem-and-leaf or box-plot is a distribu-
tional outlier. A spatial outlier on the other hand is an observation that is unusual compared
to its surrounding values. A data set can have many more spatial than distributional outliers.
One method of diagnosing spatial outliers is to median-polish the data (or to remove the large
scale trends in the data by some other outlier-resistant method) and to look for outlying obser-
vations in a box-plot of the median-polished residuals. Lag-plots such as Figure 9.34 where
observations are plotted against averages of surrounding values are also good graphical
diagnostics of observations that are unusual spatially.
A first step to incorporate spatial context in describing a set of spatial data is to calculate
descriptive statistics and graphical displays separately for sets of spatial coordinates. This is
simple if the data are observed on a rectangular lattice such as the Mercer and Hall grain yield
data (Figure 9.5, p. 569) where calculations can be performed by rows and columns. Row and
column box-plots for these data show a cubic (or even higher) trend in the column medians
but no trend in the row medians (Figure 9.35). This finding can be put to use to detrend the
data with a parametric model. Without detrending the data semivariogram estimation is
possible by considering pairs of data within columns only. The row and column box-plots
were calculated in S+SpatialStats® with the statements
bwplot(y~grain,data=wheat,ylab="Row",xlab="Grain Yield")
bwplot(x~grain,data=wheat,ylab="Column",xlab="Grain Yield").

20 25
19 24
23
18 22
17 21
16 20
15 19
18
14 17
13 16
12 15
Column

11 14
Row

13
10 12
9 11
8 10
7 9
8
6 7
5 6
4 5
3 4
3
2 2
1 1

3.0 3.5 4.0 4.5 5.0 3.0 3.5 4.0 4.5 5.0
Grain Yield Grain Yield

Figure 9.35. Row (left) and column (right) box-plots for Mercer and Hall grain yield data.

Other graphical displays and numerical summary measures were developed specifically
for spatial data, for example, to describe, diagnose, and test the degree of spatial autocorre-
lation. With geostatistical data the empirical semivariogram provides an estimate of the
spatial structure. With lattice data join-count statistics have been developed for binary and
nominal data (see, for example, Moran 1948, Cliff and Ord 1973, and Cliff and Ord 1981).
Moran (1950) and Geary (1954) developed autocorrelation coefficients for continuous attrib-
utes observed on lattices. These coefficients are known as Moran's M and Geary's - and like
many other autocorrelation measures compare an estimate of the covariation among the ^asb
to an estimate of their variation. Since the distribution of M and - tends to a Gaussian distribu-
tion with increasing sample size, these summary autocorrelation measures can also be used in
confirmatory fashion to test the hypothesis of no (global) spatial autocorrelation in the data.
In the remainder of this application we introduce Moran's M , estimation and inference based

© 2003 by CRC Press LLC


on this statistic, and a localized form of M that Anselin (1995) termed a LISA (local indicator
of spatial association).
Let ^ as3 b, 3 œ "ß âß 8 denote the attribute ^ observed at site s3 and Y3 œ ^ as3 b  ^ its
centered version. Since data are on a lattice let A34 denote the neighborhood connectivity
weight between sites s3 and s4 with A33 œ !. These weights are determined in the same
fashion as those for the lattice models in §9.6.1. Moran's M is then defined as

8 8
! ! A34 ?3 ?4
8 3œ" 4œ"
Mœ 8 . [9.69]
!A34 ! ?#
3ß4 3
3œ"

A more compact expression can be obtained in matrix-vector form by putting


u œ c?" ß âß ?8 dw , W œ cA34 d, and 1 œ c"ß âß "dw . Then
8 uw Wu
Mœ . [9.70]
1w W1 uw u

In the absence of spatial autocorrelation M has expected value EcM d œ  "Îa8  "b and
values M ž EcM d indicate positive, values M  EcM d negative autocorrelation. It should be noted
that the Moran test statistic bears a great resemblance to the Durbin-Watson (DW) test
statistic used in linear regression analysis to test for serial dependence among residuals
(Durbin and Watson 1950, 1951, 1971). The DW test replaces u with a vector of least squares
residuals and considers squared lag-1 serial differences in place of uw Wu. To determine
whether a deviation of M from its expectation is statistically significant one relies on the
asymptotic distribution of M which is Gaussian with mean  "Îa8  "b and variance 5M# . The
hypothesis of no spatial autocorrelation is rejected at the ! ‚ "!!% significance level if
lM  EcM dl
l^9,= l œ
5M
is more extreme than the D!Î# cutoff of a standard Gaussian distribution. Right-tailed (left-
tailed) tests for positive (negative) autocorrelation compare ^9,= to D! aD"! b cutoffs.
Two approaches are common to derive the variance 5M# . One can assume that the ^as3 b
are Gaussian or adopt a randomization framework. In the Gaussian approach the ^as3 b are
assumed Ka.ß 5 # b, so that Y3 µ a!ß 5 # a"  "Î8bb under the null hypothesis. In the randomi-
zation approach the ^ as3 b are considered fixed and are randomly permuted among the 8 lat-
tice sites. There are 8x equally likely random permutations and 5M# is the variance of the 8x
Moran M values. A detailed derivation of and formulas for the variances under the two
assumptions can be found in Cliff and Ord (1981, Ch. 2.3). If one adopts the randomization
framework an empirical :-value for the test of no spatial autocorrelation can be calculated if
one ranks the observed value of M among the 8x  " possible remaining permutations. For
even medium-sized lattices this is a computationally expensive procedure. The alternatives
are to rely on the asymptotic Gaussian distribution to calculate : -values or to compare the
observed M against only a random sample of the possible permutations.
The SAS® macro %MoranI (contained on CD-ROM) calculates the ^9,= statistics and : -
values under the Gaussian and randomization assumption. A data set containing the W matrix

© 2003 by CRC Press LLC


is passed to the macro through the w_data option. For rectangular lattices the macro
%ContWght (in file \SASMacros\ContiguityWeights.sas) calculates the W matrices for
classical neighborhood definitions (see Figure 9.26, p. 633). For the Mercer and Hall grain
yield data the statements
%include 'DriveLetterofCDROM:\Data\SAS\MercerWheatYieldData.sas';
%include 'DriveLetterofCDROM:\SASMacros\ContiguityWeights.sas';
%include 'DriveLetterofCDROM:\SASMacros\MoranI.sas';

title1 "Moran's I for Mercer and Hall Wheat Yield Data, Rook's Move";
%ContWght(rows=20,cols=25,move=rook,out=rook);
%MoranI(data=mercer,y=grain,row=row,col=col,w_data=rook);

produce Output 9.3. The observed M value of !Þ%!'' is clearly greater than the expected
value. The standard errors 5M based on randomization and Gaussianity differ only in the
fourth decimal place in this application. In other instances the difference will be more
substantial. There is overwhelming evidence that the data exhibit positive autocorrelation.
Moran's M is somewhat sensitive to the choice of the neighborhood matrix W. If the rook
definition (edges abut) is replaced by the bishop's move (touching corners),
title1 "Moran's I for Mercer and Hall Wheat Yield Data, Bishop's Move";
%ContWght(rows=20,cols=25,move=bishop,out=bishop);
%MoranI(data=mercer,y=grain,row=row,col=col,w_data=bishop);

the autocorrelation remains significant but the value of the test statistic is reduced by about
&!% (Output 9.4). But it is even more sensitive to large scale trends in the data. For a
significant test result based on Moran's M to indicate spatial autocorrelation it is necessary that
the mean of ^ asb is stationary. Otherwise subtracting ^ is not the appropriate shifting of the
data that produces zero mean random variables Y3 . In fact, the M test may indicate significant
"autocorrelation" if data are independent but have not been properly detrended.

Output 9.3.
Moran's I for Mercer and Hall Wheat Yield Data, Rook's Move

Observed Pr(Z >


_Type_ I E[I] SE[I] Zobs Zobs)

Randomization 0.4066 -.002004 0.0323 12.6508 0


Gaussianity 0.4066 -.002004 0.0322 12.6755 0

Output 9.4.
Moran's I for Mercer and Hall Wheat Yield Data, Bishop's Move

Observed Pr(Z >


_Type_ I E[I] SE[I] Zobs Zobs)

Randomization 0.20827 -.002004008 0.032989 6.37388 9.2152E-11


Gaussianity 0.20827 -.002004008 0.032981 6.37546 9.1209E-11

This spurious autocorrelation effect can be demonstrated by generating independent


observations with a mean structure. On a "! ‚ "! lattice we construct data according to the

© 2003 by CRC Press LLC


linear model
^ œ "Þ% € !Þ"B € !Þ#C € !Þ!!#B# € /ß / µ 33. K a!ß "b,

where B and C are the lattice coordinates:


data simulate;
do x = 1 to 10; do y = 1 to 10;
z = 1.4 + 0.1*x + 0.2*y + 0.002*x*x + rannor(2334);
output;
end; end;
run;
title1 "Moran's I for independent data with large-scale trend";
%ContWght(rows=10,cols=10,move=rook,out=rook);
%MoranI(data=simulate,y=z,row=x,col=y,w_data=rook);

The test indicates strong positive "autocorrelation" which is an artifact of the changes in
Ec^ d rather than stochastic spatial dependency among the sites.

Output 9.5.
Moran's I for independent data with large-scale trend
Observed Pr(Z >
_Type_ I E[I] SE[I] Zobs Zobs)

Randomization 0.39559 -0.010101 0.073681 5.50604 1.835E-8


Gaussianity 0.39559 -0.010101 0.073104 5.54948 1.4326E-8

If trend contamination distorts inferences about the spatial autocorrelation coefficient,


then it seems reasonable to remove the trend and calculate the autocorrelation coefficient
s SPW is the residual vector, the M statistic [9.70] is modified
from the residuals. If se œ Y  X"
to
8 sew Wes
M‡ œ w w . [9.71]
1 W1 se se
The mean and variance of [9.71] are not the same as those for [9.70]. For example, the mean
EcM ‡ d now depends on the weights W and the X matrix. Expressions for EcM ‡ d and VarcM ‡ d are
found in Cliff and Ord (1981, Ch. 8.3). These were coded in the SAS® macro %RegressI()
contained in file \SASMacros\MoranResiduals.sas. Recall that for the Mercer and Hall grain
yield data the exploratory row and column box-plots indicated possible cubic trends in the
column medians. To check whether there exists autocorrelation in these data or whether the
significant M statistic in Output 9.3 was spurious, the following SAS® code is executed.

title1 "Moran's I for Mercer and Hall Wheat Yield Data";


title2 "Calculated for Regression Residuals";
%Include 'DriveLetterofCDROM:\SASMacros\MoranResiduals.sas';
data xmat; set mercer; x1 = col; x2 = col**2; x3 = col**3;
keep x1 x2 x3;
run;
%RegressI(xmat=xmat,data=mercer,z=grain,weight=rook,local=1);

The data set xmat contains the regressor variables excluding the intercept. It should not
contain any additional variables. This code fits a large-scale mean model with cubic column
effects and no row effects (adding higher order terms for column effects leaves the results

© 2003 by CRC Press LLC


essentially unchanged). The ordinary least squares estimates are calculated and shown by the
macro (Output 9.6). If there is significant autocorrelation in the residuals the standard errors,
>-statistics and :-values for the parameter estimates are not reliable, however, and should be
disregarded. The value of ^9,= is slightly reduced from "#Þ'( (Output 9.3) to "!Þ#( indicating
that the column trends did add some spurious autocorrelation. The highly significant :-value
for M ‡ shows that further analysis of these data by classical methods for independent data is
treacherous. Spatial models and techniques must be used in further inquiry.

Output 9.6.
Moran's I for Mercer and Hall Wheat Yield Data
Calculated for Regression Residuals

Ordinary Least Squares Regression Results

OLS Analysis of Variance


SS df MS F Pr>F R2

Model 13.8261 3 4.6087 25.1656 355E-17 0.1321


Error 90.835 496 0.18314 . . .
C.Total 104.661 499 . . . .

OLS Estimates
Estimate StdErr Tobs Pr>|T|

Intcpt 3.90872 0.08964 43.6042 0


x1 0.10664 0.02927 3.64256 0.0003
x2 -0.0121 0.00259 -4.6896 3.54E-6
x3 0.00032 0.00007 4.82676 1.85E-6

Global Moran's I
I* E[I*] SE[I*] Zobs Pr > Zobs
0.32156 -0.0075 0.03202 10.2773 0

The %RegressI() and %MoranI() macros have an optional parameter local=. When set
to " (default is local=0) the macros will not only calculate the global M (or M ‡ ) statistic but
local versions thereof. The idea of a local indicator of spatial association (LISA) is due to
Anselin (1995). His notion was that although there may be no spatial autocorrelation globally,
there may exist local pockets of positive or negative spatial autocorrelation in the data, so
called hot-spots. This is only one possible definition of what constitutes a hot-spot. One could
also label as hot-spots sites that exceed (or fall short of) a certain threshold level. Hot-spot
definitions based on autocorrelation measures designate sites as unusual if the spatial depen-
dency is locally much different from other sites. The LISA version of Moran's M is
8
8
M3 œ 8 ?3 "A34 ?4 , [9.72]
!?#3 4
3

where 3 indexes the sites in the data set. That is, for each site s3 we calculate an M statistic
based on information from neighboring sites. For a "! ‚ "! lattice there are a total of "!" M
statistics. The global M according to [9.69] or [9.71] and "!! local M statistics according to
[9.72]. The expected value of M3 is EcM3 d œ  A3 Îa8  "b with A3 œ !84œ" A34 . The interpre-
tation is that if M3  EcM3 d then sites connected to s3 have attribute values dissimilar from
^ as3 b. A high (low) value at s3 is surrounded by low (high) values. If M3 ž EcM3 d, sites

© 2003 by CRC Press LLC


connected to s3 show similar value. A high (low) value at ^as3 b is surrounded by high (low)
values at connected sites.
The asymptotic Gaussian distribution of M (and M ‡ ) makes tempting the testing of hypoth-
eses based on the LISAs to detect local pockets where spatial autocorrelation is significant.
We discourage formal testing procedures based on LISAs. First, there is a serious multiplicity
problem. In a data set with 8 sites one would be testing 8 hypotheses with grave consequenc-
es for Type-I error inflation. Second, the 8 LISAs are not independent due to spatial depen-
dence and shared data points among the LISAs. We prefer a graphical examination of the
local M statistics over formal significance tests. The map of LISAs indicates which regions of
the domain behave differently from the rest and hopefully identify a spatially variable
explanatory variable that can be used in the analysis for adjustments of the large-scale trends.
For the detrended Mercer and Hall grain yield data Figure 9.36 shows sites with positive
LISAs. Hot-spots where autocorrelation is locally much greater than for the remainder of the
lattice are clearly recognizable (e.g., row "), col "').

25

19
Column

13

1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Row

Figure 9.36. Local indicators of positive spatial autocorrelation (M3 ž EcM3 d) calculated from
regression residuals after removing column trends.

9.8.2 Modeling the Semivariogram of Soil Carbon


In this application we demonstrate the basic steps in modeling the semivariogram of geosta-
tistical data. The data for this application were kindly provided by Dr. Thomas G. Mueller,
Department of Agronomy, University of Kentucky. An agricultural field had been in no-till
production for more than ten years in a corn-soybean rotation. Along strips total soil carbon
percentage was determined at two-hundred sampling sites (Figure 9.37). Eventually, the inter-
mediate strips were chisel-plowed but we concentrate on the no-tillage areas in this applica-
tion only. A comparison of GÎR ratios for chisel-plowed versus no-till treatments can be
found in §9.8.4 and a spatial regression application in §9.8.5.

© 2003 by CRC Press LLC


300

250

Y-Coordinate (ft) 200

150

100

50

0 100 200 300 400 500


X-Coordinate (ft)

Figure 9.37. Total soil carbon data. Size of dots is proportional to soil carbon percentage.
Data kindly provided by Dr. Thomas G. Mueller, Department of Agronomy, University of
Kentucky. Used with permission.
Halved Squared Diff.

0.4
0.2
0.0

0 50 100 150 200 250

lag distance
Halved Squared Diff.

0.4
0.2

o o o o o o o o o o o o o o o o o
o o o
0.0

0 50 100 150 200 250

lag distance
0.8
Sq. Root Diff.

o
0.4

o o o o o o o o o o o o o o o o
o o o
0.0

0 50 100 150 200 250

lag distance

Figure 9.38. Semivariogram cloud (upper panel) and box-plot of halved squared differences
(middle panel) for median-polished residuals for total soil carbon percentage. The bottom
panel shows box-plots of Èl^ as3 b  ^ as4 bl. Lag distance in feet.

© 2003 by CRC Press LLC


One of the exploratory tools to examine the second-order properties of geostatistical data
is the semivariogram cloud e^ as3 b  ^ as4 bf# (Chauvet 1982, Figure 9.38 upper panel). Be-
cause a large number of points share a certain lag distance, this plot is often too busy. A fur-
ther reduction is possible by calculating summaries such as box-plots of the half squared dif-
ferences !Þ&Ð^ as3 b  ^ as4 bÑ# or box-plots of the square root differences Èl^ as3 b  ^ as4 bl.
These summaries are shown in the lower panels of Figure 9.38 and it is obvious that the spa-
tial structure can be more easily be discerned from the graph of Èl^ as3 b  ^ as4 bl than that of
!Þ&Ð^ as3 b  ^ as4 bÑ# . The former is more robust to extreme observations which create large de-
viations that appear as outliers in the box-plots of the !Þ&Ð^ as3 b  ^ as4 bÑ# . Figure 9.38 was
produced in S+SpatialStats® with the statements
par(col=1,mfrow=c(3,1))
vcloud1 <- variogram.cloud(TC ~ loc(x,y),data=notill)
plot(vcloud1,xlab="lag distance",ylab="Halved Squared Diff.",col=1,cex=0.3)
boxplot(vcloud1,mean=T,pch.mean="o",xlab="lag distance",
ylab="Halved Squared Diff.")
vcloud2 <- variogram.cloud(TCN ~ loc(x,y),data=notill,
fun=function(zi,zj) sqrt(abs(zi-zj)))
boxplot(vcloud2,mean=T,pch.mean="o",xlab="lag distance",ylab="Sq. Root Diff.")

There appears to be spatial structure in these data, the medians increase for small lag dis-
tances. The assumption of second-order stationarity is not unreasonable as the medians re-
main relatively constant for larger lag distances. The square root difference plot is not an esti-
mate of the semivariogram. The square root differences are, however, the basic ingredients of
the robust Cressie and Hawkins semivariogram estimator. The halved squared differences are
the basic elements of the Matheron estimator. The classical and robust empirical semivario-
gram estimators are obtained in The SAS® System with the code
proc variogram data=NoTillData outvar=svar1;
compute lagdistance=10 maxlags=40 robust;
coordinates xcoord=x ycoord=y;
var TC;
run;
proc print data=svar1; run;

proc variogram data=NoTillData outvar=svar2;


compute lagdistance=7 maxlags=29 robust;
coordinates xcoord=x ycoord=y;
var TC;
run;

The first call to proc variogram calculates the estimator for forty lags of width "!, the
second call calculates the semivariogram for twenty-nine lags of width (. Thus, the two semi-
variograms will extend to %!! and #!! feet, respectively (Figure 9.39). The number of pairs
in a particular lag class is stored as variable count in the output data set of the variogram pro-
cedure (Output 9.7). The first observation corresponding to LAG=-1 lists the number of obser-
vations, their sample mean (AVERAGE=0.83672), and their sample variance (COVAR=0.025998).
The average distance among data pairs in the first lag class was )Þ%"' feet, the classical semi-
variogram estimate at that distance was !Þ!!'&$$ and the robust semivariogram estimate was
!Þ!!&#''. The estimate of the covariance function at this lag is !Þ!#$'%*. Recall that (i) the
estimate of the covariogram is biased and (ii) that SAS® reports the semivariogram estimates
although the columns are labels VARIOG and RVARIO. The number of pairs at each lag distance
are sufficient to produce reliable estimates. Notice that the recommendation is to have at least

© 2003 by CRC Press LLC


$! (&!) observations in each lag class. It is our experience that occasionally not even "!!
pairs provide reliable estimates of the semivariogram.

Output 9.7.
Obs LAG COUNT DISTANCE AVERAGE VARIOG COVAR RVARIO

1 -1 200 . 0.83672 . 0.025998 .


2 0 0 . . . . .
3 1 157 8.416 0.81832 0.006533 0.023649 0.005266
4 2 201 17.836 0.82296 0.006012 0.012518 0.004956
5 3 224 28.631 0.83011 0.009020 0.011840 0.008111
6 4 196 38.941 0.83845 0.012929 0.012672 0.010591
7 5 337 49.221 0.83842 0.020552 0.007283 0.017461
8 6 422 59.246 0.85073 0.022993 0.004784 0.022362
9 7 417 69.361 0.84645 0.018293 0.006607 0.016627
10 8 397 78.999 0.84154 0.019384 0.002179 0.017076
11 9 474 89.184 0.83732 0.019550 0.000790 0.017698
12 10 550 99.689 0.84283 0.020342 0.000327 0.016309
13 11 610 109.367 0.84089 0.021594 -0.001003 0.018939
14 12 550 119.269 0.83355 0.021640 -0.000607 0.019228
15 13 506 129.474 0.82409 0.020955 0.001440 0.019234
16 14 583 139.566 0.83470 0.021357 -0.000521 0.019269
17 15 735 149.374 0.83627 0.021699 0.001650 0.020352
18 16 786 159.507 0.82803 0.019115 0.001175 0.018162
19 17 802 169.781 0.82467 0.019607 0.001251 0.019202
20 18 626 179.443 0.82070 0.018929 0.001839 0.018636
21 19 586 189.454 0.82877 0.017413 0.000579 0.017180
22 20 683 199.384 0.84440 0.021483 0.001346 0.021308
23 21 721 209.227 0.84837 0.027410 -0.001529 0.029724
24 22 733 219.454 0.84424 0.027304 -0.002679 0.029259
25 23 690 229.483 0.83681 0.027045 -0.002245 0.029288
26 24 579 239.428 0.84117 0.026661 -0.001388 0.026804
27 25 583 249.804 0.83451 0.027932 -0.001344 0.025045
28 26 544 259.420 0.83707 0.045123 -0.009930 0.036596
29 27 543 269.434 0.83346 0.035870 -0.004092 0.030603
30 28 479 279.479 0.84181 0.028869 -0.002236 0.025001
31 29 463 289.598 0.83144 0.030523 -0.002682 0.025000
32 30 485 299.997 0.83932 0.028723 0.001142 0.027325
33 31 441 309.639 0.84391 0.031584 -0.001255 0.026534
34 32 405 319.382 0.82694 0.031652 -0.004144 0.028547
35 33 349 329.702 0.83655 0.036750 -0.003053 0.032942
36 34 368 339.701 0.84825 0.035579 -0.002544 0.038313
37 35 230 349.578 0.84224 0.030564 -0.001738 0.030180
38 36 222 359.770 0.84618 0.039159 -0.006564 0.038729
39 37 230 370.194 0.83104 0.033529 -0.007664 0.032587
40 38 162 379.691 0.83849 0.027808 -0.002152 0.022954
41 39 219 389.352 0.84015 0.035187 -0.005041 0.031494
42 40 256 400.117 0.83659 0.032328 -0.003964 0.025213

The empirical semivariogram in the upper panel of Figure 9.39 shows an interesting rise
at lag #!! ft. The number of pairs in lag classes ")  ## is sufficient to obtain a reliable esti-
mate, so that sparseness of observations cannot be the explanation. The semivariogram
appears to have a sill around !Þ!# for lags less than #!! feet and a sill of !Þ!$ for lags greater
than #!! feet. Possible explanations are nonstationarity in the mean, and/or a nested sto-
chastic process. The spatial (stochastic) variability may consist of two smooth-scale processes
that differ in their range. Whether this feature of the semivariogram is important depends on
the intended use of the semivariogram. If the purpose is that of spatial prediction of soil car-
bon at unobserved locations and kriging is performed within a local neighborhood of "&!

© 2003 by CRC Press LLC


feet, say, it is important to capture the spatial structure on that range, since data points more
distant than "&! feet from the prediction location are assigned zero weight. The long-range
features of the process are of lesser importance then. If the purpose of semivariogram mod-
eling is to partition the spatial process into sub-processes whose sill and range can be linked
to physical or biological features or if one performs kriging with a larger kriging radius, then
a nested semivariogram model or a nonparametric fit is advised. Large-scale changes in the
mean carbon percentages will be revisited in §9.8.5.

Max lag = 400


Matheron Estimator
Cressie and Hawkins Robust Estimator
0.02

0.00
Semivariance

0 100 200 300 400

Max lag = 200


0.02

0.01

0 50 100 150 200


Lag Distance (ft)

Figure 9.39. Classical and robust empirical semivariogram estimates for soil carbon percen-
tage. Upper panel shows semivariances up to llhll œ %!!, lower panel up to llhll œ #!! ft.

Another important feature of the semivariogram in the uper panel of Figure 9.39 is its in-
creasingly erratic nature for lags in excess of $&! feet. There are fewer pairs for large distance
classes and the (approximate) variance of the empirical semivariogram increases with the
square of #ahb (see [9.13], p. 590). Finally, we note that the robust and classical semivario-
grams differ little. The robust semivariogram is slightly downward-biased but traces the
profile of the classical estimator closely. This suggests that these data are not afflicted with
outlying observations. In the case of outliers, the classical estimator often appears shifted up-
ward from the robust estimator by a considerable amount.
In the remainder of this application we fit theoretical semivariogram models to the two
semivariograms in Figure 9.39 by ordinary and weighted nonlinear least squares, by
(restricted) maximum likelihood and composite likelihood. Recall that maximum and restrict-
ed maximum likelihood estimation operates on the raw data, not pairwise squared differences,
so that one cannot restrict the lag distance. With composite likelihood estimation, this is
possible (see §9.2.4 for a comparison of the estimation methods). The semivariogram models
investigated are the exponential, spherical, and gaussian models (§9.2.2). We need to decide
which semivariogram model best describes the stochastic dependency in the data and whether

© 2003 by CRC Press LLC


the model includes a nugget effect or not. The ostensibly simple question about the presence
of a nugget effect is not as simple to answer in practice. We illustrate with the fit of the
exponential semivariogram to the data in the lower panel of Figure 9.39. The SAS® state-
ments
proc nlin data=svar2 noitprint nohalve;
parameters nugget=0.01 sill=0.02 range=80;
bounds nugget > 0;
semivariogram = nugget + (sill-nugget)*(1-exp(-3*distance/range));
_weight_ = 0.5*count/(semivariogram**2);
model variog = semivariogram;
run;

fit the semivariogram by weighted nonlinear least squares. The bounds statement ensures that
the estimate of the nugget parameter is positive. Notice that the term (sill-nugget) is the
partial sill. The asymptotic confidence interval for the nugget parameter includes zero and
based on this fact one might be inclined to conclude that a nugget is not needed in the particu-
lar model (Output 9.8). The confidence intervals are based on the asymptotic estimated stan-
dard errors which are suspect. First, the data points are correlated and the weighted least
squares fit takes into account only the heteroscedasticity among the empirical semivariogram
values (and that only approximately), not their correlation. Second, the standard errors de-
pend on the number of data points which are the result of a user-driven grouping into lag
classes. Because the standard errors are not reliable, the confidence interval should not be
trusted. A sum of squares reduction test comparing the fit of a full and a reduced model is
also not a viable alternative. The weights depend on the semivariogram being fitted and
changing the model (i.e., dropping the nugget) changes the weights. The no-nugget model can
be fit by simply removing the nugget from the parameters statement in the previous code and
fixing its value at zero:
proc nlin data=svar2 noitprint nohalve;
parameters sill=0.02 range=80;
nugget = 0;
semivariogram = nugget + (sill-nugget)*(1-exp(-3*distance/range));
_weight_ = 0.5*count/(semivariogram**2);
model variog = semivariogram;
run;

The sum of squares from two fits with different weights are not comparable (compare
Output 9.8 and Output 9.9). For example, the uncorrected total sums of squares are ($%Þ$ and
((#Þ(, respectively.

Output 9.8. (abridged)


The NLIN Procedure

Sum of Mean Approx


Source DF Squares Square F Value Pr > F
Regression 3 5012.0 1670.7 117.61 <.0001
Residual 26 73.0856 2.8110
Uncorrected Total 29 5085.1
Corrected Total 28 734.3

Approx Approximate 95% Confidence


Parameter Estimate Std Error Limits
nugget 0.000563 0.00169 -0.00291 0.00403
sill 0.0210 0.000680 0.0196 0.0224
range 96.2686 19.7888 55.5926 136.9

© 2003 by CRC Press LLC


Output 9.9. (abridged)
The NLIN Procedure

NOTE: Convergence criterion met.

Sum of Mean Approx


Source DF Squares Square F Value Pr > F
Regression 2 5012.0 2506.0 921.04 <.0001
Residual 27 73.4622 2.7208
Uncorrected Total 29 5085.5
Corrected Total 28 772.7

Approx Approximate 95% Confidence


Parameter Estimate Std Error Limits
sill 0.0209 0.000607 0.0197 0.0221
range 91.4617 11.4371 67.9950 114.9

To settle the issue on whether to include a nugget effect with a formal test we can call
upon the likelihood ratio principle. Provided we are willing to make the assumption of a
Gaussian random field the nugget and no-nugget models can be fit with proc mixed of The
SAS® System.
/* nugget model */
proc mixed data=NoTillData method=ml noprofile;
model tc = ;
repeated / subject=intercept type=sp(exp)(x y) local;
parms /* partial sill */ ( 0.025 )
/* range */ ( 32 )
/* nugget */ ( 0.005 );
run;

/* no-nugget model */
proc mixed data=NoTillData method=ml;
model tc = ;
repeated / subject=intercept type=sp(exp)(x y);
parms /* range */ ( 32 )
/* sill */ ( 0.025 ) ;
run;

To fit a model with nugget effect the local option is added to the repeated statement
and the noprofile option is added to the proc mixed statement. The latter is necessary to
prevent proc mixed from estimating an extra scale parameter that it would profile out of the
likelihood. The parms statement lists starting values for the covariance parameters. In the no-
nugget model the local and noprofile options are removed. Also notice that the order of the
covariance parameters in the parms statement changes between the nugget and no-nugget
models. The correct order in which to enter starting values in the parms statement can be
gleaned from the Covariance Parameter Estimates table of the proc mixed output. The
subject= option of the repeated statement informs the procedure which observations are
considered correlated in the data. Observations with different values of the subject variable
are considered independent. By specifying subject=intercept the variable identifying the
clusters in the data is a column of ones. Spatial data is treated as if it comprises a single
cluster of size 8 (see §2.6 on the progression of clustering from independent to spatial data).
The converged parameter estimates in the full model (containing a nugget effect) are par-
tial sill œ !Þ!##**, nugget œ !Þ!!$&&(, and (&Þ!&)) for the range parameter. Since proc
mixed parameterizes the exponential correlation function as "  expe  llhllÎ!f, the estimat-
ed practical range is $!
s œ $‡(&Þ!&)) œ ##&Þ#' feet, considerably larger than the estimates

© 2003 by CRC Press LLC


from the nonlinear least squares fit. This is not too surprising. The maximum likelihood fit is
based on all data, whereas the least squares fit on data pairs with lags up to #!! feet only. If
the exponential semivariogram is fit to the data in the upper panel of Figure 9.39, the estimate
of the practical range is )%!Þ" feet.
Minus twice the log likelihood for the full model (containing a nugget effect) and the re-
duced model without nugget effect are  $$"Þ* and  $"&Þ', respectively (Outputs 9.10 and
9.11). The likelihood ratio statistic of A œ $$"Þ*  $"&Þ' œ "'Þ$ has :-value : œ !Þ!!!!&.
Inclusion of the nugget effect provides a model improvement.

Output 9.10. (abridged) The Mixed Procedure

Model Information
Data Set WORK.NOTILLDATA
Dependent Variable TC
Covariance Structures Spatial Exponential, Local Exponential
Subject Effect Intercept
Estimation Method ML
Residual Variance Method None
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Between-Within

Covariance Parameter Estimates

Cov Parm Subject Estimate


Variance Intercept 0.02299
SP(EXP) Intercept 75.0588
Residual 0.003557

Fit Statistics
-2 Log Likelihood -331.9
AIC (smaller is better) -323.9
AICC (smaller is better) -323.7
BIC (smaller is better) -310.7

Output 9.11. (abridged) The Mixed Procedure

Model Information
Data Set WORK.NOTILLDATA
Dependent Variable TC
Covariance Structure Spatial Exponential
Subject Effect Intercept
Estimation Method ML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Between-Within

Covariance Parameter Estimates

Cov Parm Subject Estimate


SP(EXP) Intercept 29.5734
Residual 0.02402

Fit Statistics
-2 Log Likelihood -315.6
AIC (smaller is better) -309.6
AICC (smaller is better) -309.5
BIC (smaller is better) -299.7

© 2003 by CRC Press LLC


Results of the ordinary and weighted least squares fits for the various semivariogram
models are summarized in Table 9.8 and the results for restricted/maximum likelihood in
Table 9.9. Missing entries in these tables indicate that the particular model did not converge
or that a boundary constraint was violated (nugget estimate less than zero, for example).

Table 9.8. Results of least squares fits. )! , 0, ! denote the nugget, sill, and (practical) range,
respectively. WWV is residual sum of squares in ordinary least squares fit. Data œ #!!
refers to lower panel of Figure 9.39 with lags restricted to Ÿ #!! feet
WLS OLS
Data Model Nugget )! 0 ! )! 0 ! WWV †
#!! Exponential No !Þ!#" *"Þ%' !Þ!#" *%Þ** "%*
Yes !Þ!!!& !Þ!#" *'Þ#(    
Gaussian No !Þ!#! #*Þ'% !Þ!#! &)Þ!* "#$
Yes !Þ!!% !Þ!#" '*Þ!( !Þ!!$ !Þ!#! '&Þ!% "!#
Spherical No !Þ!#! ''Þ!* !Þ!#! (#Þ#$ "!)
Yes !Þ!!# !Þ!#! (*Þ""    
%!! Exponential No !Þ!#) #!#Þ$ !Þ!$% $&$Þ* *!(
Yes !Þ!!* !Þ!%# )%!Þ" !Þ!!) !Þ!%% )'(Þ) ($(
Gaussian No !Þ!#& $'Þ)) !Þ!#) "!!Þ& "&(0
Yes !Þ!!' !Þ!#( "")Þ) !Þ!"$ !Þ!$& $((Þ! (**
Spherical No !Þ!#' *(Þ#& !Þ!$# #*!Þ! "#"0
Yes !Þ!"" !Þ!$) &$%Þ( !Þ!!* !Þ!$% %""Þ# ($&

‚ "!%

Based on the ordinary least squares fit one would select a gaussian model with nugget
effect for the data in the lower panel of Figure 9.39 and a spherical semivariogram with
nugget effect for the data in the upper panel. The Pseudo-V # measures for these models are
!Þ)( and !Þ(&, respectively. While the nugget and sill estimates are fairly stable it is note-
worthy that the range estimates can vary widely among different models. Since the range
essentially determines the strength of the spatial dependency, kriging predictions from spatial
processes that differ greatly in their range can be very different. For the maxlag %!! data, for
example, the exponential and spherical models with nugget effects have very similar residual
sums of squares a($( and ($&b, but the range of the exponential model is more than twice that
of the spherical model (OLS results). The lesser spatial continuity of the exponential model
can be more than offset by a doubling of the range. The weighted least squares estimates of
the range appear particularly unstable. An indication of a well-fitting model is good agree-
ment between the ordinary and unweighted least squares estimates. On these grounds the
gaussian no-nugget models and the spherical no-nugget model for the second data set can be
ruled out. The fitted semivariograms for the gaussian and spherical nugget models are shown
in Figure 9.40.
Based on the (restricted) maximum likelihood fits, the spherical nugget model emerges as
the best-fitting model (Table 9.9). Nugget and no-nugget models can be compared via likeli-
hood ratio tests which indicates for the exponential and gaussian models that a nugget effect
is needed. To compare across semivariogram models the AIC criterion is used since the expo-
nential, gaussian, and spherical models are not nested. This leads to the selection of the
spherical nugget model as the best-fitting model in this group. Its range is considerably less
than that of the corresponding model fitted by least squares. Notice that the REML estimates

© 2003 by CRC Press LLC


are uniformly larger than the ML estimates. REML estimation reduces the negative bias of
maximum likelihood estimators of covariance parameters. Also notice that the estimates of
the range are less erratic than for the least squares estimates. We could overlay the fitted
semivariograms in Figure 9.40 with the semivariogram models implied by the maximum
likelihood estimates. This comparison is not fair, however. The least squares methods fit the
data to the scatter of points shown in Figure 9.40 whereas the likelihood methods fit a
covariance function model to the original data. By virtue of least squares no other fitting
method will be closer to the empirical semivariogram cloud. This does not imply that the least
squares estimates are the best estimates of the spatial dependency structure.

Max lag = 400

0.02

0.00
Semivariance

0 100 200 300 400

Max lag = 200


0.02

0.01

0 50 100 150 200


Lag Distance (ft)

Figure 9.40. Semivariograms fitted by least squares. Spherical nugget semivariogram for max
lag œ %!! and gaussian nugget semivariogram for max lag œ #!!. Ordinary least squares fits
shown.

Table 9.9. Results of maximum and restricted maximum likelihood estimation. )! , 0, !


denote the nugget, sill, and (practical) range, respectively
ML REML
Model Nugget )! 0 ! )! 0 ! -2logL† AIC£
Exponential No !Þ!#% ))Þ(" !Þ!#& *#Þ&#  $"!  $!'
Yes !Þ!!$ !Þ!#' ##&Þ$ !Þ!!$ !Þ!$" #*$Þ#  $#)  $##
Gaussian No    
Yes !Þ!!' !Þ!#% "!)Þ! !Þ!!( !Þ!#' """Þ&  $#$  $"(
Spherical No !Þ!&! ""&Þ' !Þ!&! ""&Þ*  #*'  #*#
Yes !Þ!!% !Þ!#& "$$Þ# !Þ!!% !Þ!#& "$%Þ'  $$!  $#%

negative of twice the restricted maximum likelihood
£
Akaike's information criterion (smaller is better variety)

Finally, we conclude this application by fitting the basic semivariogram models by the
composite likelihood (CL) principle. This principle is situated between genuine maximum
likelihood and least squares. It has features in common with both. As the empirical Matheron

© 2003 by CRC Press LLC


semivariogram estimator the principle is based on pairwise squared differences. It does not
average the squared differences, however, but fits the semivariogram to a data set of all
8a8  "bÎ# pairwise differences. Composite likelihood estimation has in common with the
likelihood principles to stipulate a distribution of the a^ as3 b  ^ as4 bb# , thereby indirectly
specifying a distribution of the raw data ^as3 b. It has in common with least squares estimation
that restricting estimation of the semivariogram parameters to certain lag distances is simple.
The negative impact of erratic squared differences at large lags can thereby be reduced. Fur-
thermore, estimation can be carried out in weighted and unweighted form. The unweighted
composite likelihood estimator is in fact a Generalized Estimating Equation estimate as we
establish in §A9.9.2. CL estimators are easily obtained with the SAS® macro %cl() contained
in \SASMacros\CL.sas.
%include 'DriveLetterofCDROM\SASMacros\CL.sas';
/* Create a data set with the squared differences */
proc variogram data=NoTillData outpair=pairs;
compute novariogram;
coordinates xcoord=x ycoord=y;
var TC;
run;
/* Call the macro */
%cl(nugget=1,covmod=E,maxrng=200,
nuggetstart=0.005,sillstart=0.03,rangestart=80);

Table 9.10. Results of composite likelihood fits. )! , 0, ! denote the nugget, sill, and
(practical) range, respectively. Data œ #!! refers to lower panel of Figure 9.39
with lags restricted to Ÿ #!! feet

Data Model Nugget )! 0 ! WWVA†


#!! Exponential No !Þ!#" )%Þ! "#ß !"!
Yes !Þ!!# !Þ!#" **Þ$ ""ß *)!
Gaussian No !Þ!#! ##Þ" "#ß !))
Yes !Þ!!& !Þ!#! '*Þ% ""ß *"%
Spherical No !Þ!#' "$"Þ$ "%ß "'!
Yes    
%!! Exponential No !Þ!#) "*'Þ) #%ß $$%
Yes !Þ!!* !Þ!%) "!()Þ" #$ß $'"
Gaussian No !Þ!#& #*Þ$ #%ß (*$
Yes !Þ!"' !Þ!%# &#)Þ& #$ß "*$
Spherical No !Þ!$' #(&Þ! $"ß %!"
Yes !Þ!"" !Þ!%) (&'Þ! #$ß #))

weighted residual sum of squares

The CL algorithm converged for all but one setting. The estimates are in general close to
the least squares estimates in Table 9.8. In situations where the least squares fits to the empiri-
cal semivariogram did poorly (e.g., gaussian no-nugget model), so did the composite likeli-
hood fit. The weighted residual sums of squares are not directly comparable among the
models as the weights depend on the model. Based on their magnitude alone the spherical
model is ruled out for the maxlag #!! data and no-nugget models are ruled out for the maxlag
%!! data, however.

© 2003 by CRC Press LLC


9.8.3 Spatial Prediction — Kriging of Lead Concentrations
Recreational shooting ranges are popular as sites to develop and practice sporting firearms
activities and, at the same time, are becoming sites of potential concern because of the
accumulation of metals, in particular, lead. The data analyzed in this application was collected
on a shooting range a few miles west of Blacksburg, Montgomery County, Virginia, operated
by the United States Forest Service in the George Washington-Jefferson National Forest. For
a more detailed account of this study including analysis of a slightly larger data set see Craig
et al. (2002). The range consists of two shooting areas, a rifle range, and a shotgun range. Our
data pertain to the shotgun range. The range lies in a second growth mixed hardwood forest
on the Devonian Brallier Formation composed primarily of a deeply weathered black shale.
The range has been in continuous use since 1993, operated for approximately 350 days a
year, and closed periodically for maintenance and general cleaning. The shooting range
consists of an open, gently sloping surface, approximately '# meters in length by about '&
meters in width, that was cleared in the forest. The shooting box is located near the center of
the range and is apparently used by most shooters. A clay pigeon launching site is situated
approximately seven meters to the right of the shooting box.
Of interest to the investigators was the determination of the area and impact of the shot,
the spatial distribution of lead, and an estimate of the total amount of lead on the shooting
range. The area of interest was determined to be a rectangle #%! meters wide and $!! meters
long with the shooting box located at B œ "!!, C œ ! (Figure 9.3, p. 568, left panel). It was
believed that much of the shot would occur on the approximately '! ‚ '! surface that had
been cleared in front of the shooting box. Accordingly, the initial sampling was carried out at
&-meter intervals along a line extending from B œ "!! toward the center of the far edge of the
cleared area. Additional samples were collected on transects emanating from the central
transect at "!- or #!-meter intervals. At each sampling point all material within a &! ‚ &!
centimeter square was extracted and sieved through a '-millimeter metal sieve. After stirring
and agitating individually in tap water the samples were transferred to a "%-inch Garrett
Gravity trap gold pan. The recovered metal materials were examined under a binocular micro-
scope and all extraneous material was removed. Calibration showed that this procedure had a
lead recovery rate of **Þ&%. It is recognized that some material could be lost during recovery
efforts at any site and that lead lodged in trees is not recovered. The estimates of total lead
derived in what follows are thus conservative.
This sampling design does not provide even coverage of the shotgun range; most samples
were collected in areas where high lead concentrations were anticipated. An estimate of total
lead based on the sample average is thus positively biased, possibly severely so. If ^as3 b de-
notes lead in 1Î7# at sampling site s3 , then ^ œ %*&Þ#" 1Î7# and the total load on the range
would be estimated as $&Þ'&& tons. The spatial analysis commences by calculating the
empirical semivariogram of the lead concentration (Figure 9.41, top panel). There appears to
be little spatial structure in the lead concentrations. Closer examination of the data shows,
however, that the lead concentrations in the cleared area of the shotgun range are
considerably higher than in other areas creating a right-skewed distribution of lead. Most of
the extreme values in the right tail come from this area close to the shooting box. To achieve
greater symmetry in the data a log transformation is applied. The empirical semivariogram of
the lneleadf values shows considerably more spatial structure (Figure 9.41, bottom panel).

© 2003 by CRC Press LLC


500000
300000
gamma

0 100000

0 50 100 150

distance
5
4
gamma

3
2
1
0

0 50 100 150

distance

Figure 9.41. Semivariograms of lead (top panel) and lneleadf (bottom panel) concentrations
before detrending.

Fitting a semivariogram by weighted least squares to the log-lead values in


S+SpatialStats® with the statements
sg.varlglead <- variogram(sg$lglead ~ loc(x,y),data=sg,method="robust",
lag=7,nlag=22)
SvarFit(data=sg.varlglead,type="spherical",weighted=T,
start=list(sill=3.5,range=90))

yields a sill estimate of %Þ$" and a range "#)Þ) meters. The SvarFit() function was
developed by the authors and is contained on the CD-ROM.
It is likely that even after the transformation the mean of lne^ asbf is not stationary.
Removing by ordinary least squares a response surface in the coordinates and fitting a
spherical semivariogram by weighted least squares to the residuals leads to Output 9.12 and
Figure 9.42. The S+SpatialStats® statements producing the output and figure are
sg.lm <- lm(lglead ~ x + y + x*y + x^2,data=sg)
sg.lmres <- sg$lglead - predict(sg.lm)
sg.varlmres <- variogram(sg.lmres ~ loc(x,y),data=sg,method="robust",
lag=7,nlag=22)
SvarFit(data=sg.varlmres,type="spherical",weighted=T,
start=list(sill=1.5,range=60))

The sill of the semivariogram is drastically reduced as well as the range compared to the
nondetrended data. The reduction in sill shows the smaller variability of the model residuals,
the reduction in range that spurious autocorrelation caused by a large-scale trend was re-
moved.

© 2003 by CRC Press LLC


Output 9.12.
Formula: ~ spher.wfunnonug(gamma, distance, range, sill, np)

Parameters:
Value Std. Error t value
sill 2.0647 0.124854 16.53680
range 94.9438 11.400800 8.32783

Residual standard error: 2.45191 on 20 degrees of freedom


Residual sum of squares : 120.2369
2.5
2.0
1.5
S.variogram
1.0
0.5
0.0

0 50 100 150

Distance lag

Figure 9.42. Spherical semivariogram fit by weighted least squares to ordinary least squares
residuals.

To incorporate the nonconstant mean in spatial predictions of the lead concentrations we


perform universal kriging with the same trend model as used in modeling the semivariogram
and predict the log-lead concentration on a #%! ‚ $!! grid. To obtain predictions and *&%
prediction intervals on the original scale the predictions and intervals on the logarithmic scale
are back-transformed. In this process some transformation bias is incurred. On the upside this
procedure guarantees that all predictions are positive without imposing additional constraints
on the kriging weights. The S+SpatialStats® statements that solve the universal kriging equa-
tions, calculate predictions of log-lead on the grid, and back-transform the results are:
sg.ukrige <- krige(lglead ~ loc(x,y) + x + y + x*y +x^2,
data=sg,covfun=spher.cov,range=94.94377,
sill=2.0469,nugget=0)

grid <- list(x=seq(0,240,4),y=seq(0,300,4))


grid <- expand.grid(grid)
sg.ukp <- predict.krige(sg.ukrige,newdata=grid)

# Confidence intervals and predictions on original scale


sg.ukp$l95 <- sg.ukp$fit - 1.96*sg.ukp$se.fit
sg.ukp$u95 <- sg.ukp$fit + 1.96*sg.ukp$se.fit
sg.ukp$l95 <- exp(sg.ukp$l95)
sg.ukp$u95 <- exp(sg.ukp$u95)
sg.ukp$efit <- exp(sg.ukp$fit)/1000

© 2003 by CRC Press LLC


The surface of predicted values shows two large spikes in front of the shooting box
(Figure 9.43). The anomaly from a homogeneous distribution at #& to $! meters results from
users mounting targets at this distance. Common targets include clay pigeons, golf balls,
plastic jugs, glass bottles, fruits and vegetables. Shooting at sofas has been observed, reports
of shooting at computers and toilet bowls are anecdotal, and confirmation exists in the dis-
covery of numerous damaged keyboard keys.
The anomaly of lead at approximately )! meters apparently results from the accumula-
tion of lead fired at elevated trajectories in attempts to hit clay pigeons that have been
launched or thrown. This anomaly is wider than the closer one because of the spread of the
shot at a greater distance and because shooters are tracking a moving target across the range
as they are firing. The peak at approximately )! meters results from the combined effect of
the normal trajectory of much of the shot and from the slowing of some of the shot by leaves
and branches. The smaller anomalies at approximately ")! meters are more difficult to
explain. They may result from a higher trajectory that arcs up over the first line of trees.

Figure 9.43. Exponentiated lneleadf predictions obtained by ordinary kriging.

An estimate of the total lead concentration can be obtained by integrating the surface in
Figure 9.43. A quick estimate is calculated as the average ordinate of the surface, which
yields "'"Þ##* 1Î7# and a total load of ""Þ'!) tons. Although we believe this estimate of the
total lead concentration to be fairly accurate, two problems remain to be resolved. In estimat-
ing the total on the logarithmic scale and exponentiating the result some transformation bias is
incurred. The total amount of lead will be underestimated. Also, no standard error is available
for this estimate. In order to predict the total amount of lead on the original scale without bias
we need to consider the block average

^ aEb œ ( ^ aub. u,
E

where E is the rectangle a!ß #%!b ‚ a!ß $!!b. Some details on block-kriging appear in
§A9.9.6. But there appears to be little spatial structure in the lead concentrations (Figure 9.41,
top panel). The solution is to model the large-scale trend allowing the mean of ^asb to
capture the two large spikes in Figure 9.43. The semivariogram can then be developed based
on the residuals of this fit. Alternatively, one can exclude the two spikes in calculating the
empirical semivariogram. With the former procedure it was determined that a spherical semi-

© 2003 by CRC Press LLC


variogram with range ))Þ(&% 7# fits the empirical semivariogram well. With a properly craft-
ed mean model the block total ^ aEb can then be obtained by universal block-kriging. The
estimate of the total amount of lead so obtained is "$Þ*&) tons with a prediction standard
error of "Þ') tons. A *&% prediction interval for the total is thus c"!Þ'' tonsß "(Þ#' tonsd. The
surface of the universal kriging predictions (Figure 9.44) differs little from back-transformed
ordinary kriging predictions on the logarithmic scale (Figure 9.43).

Figure 9.44. Predicted surface of lead in kg/m# obtained by universal kriging on the original
scale.

9.8.4 Spatial Random Field Models — Comparing C/N Ratios


among Tillage Treatments
When data are collected under different conditions, such as treatments, an obvious question is
to determine whether the conditions are different from each other, and if so, how the dif-
ferences manifest themselves. In a classical field experiment contrasts among the treatment
means are estimated and tested to formulate statements about the differences among and
effects of the experimental conditions. If the data collected under various conditions are
autocorrelated, then one needs to rethink what precisely we mean by differences in the condi-
tions. We now return to the soil carbon data first introduced in §9.8.2. After ten years of a
corn-soybean rotation without tillage, intermediate strips of the field were chisel-plowed.
Two months after the soils were first chisel-plowed in the spring samples from ! to # inch
depths were collected and total R percentage aX R b and total carbon percentage aGR b were
determined. The sampling locations and the strips are shown in Figure 9.45.
Since sampling occurred very soon after tillage we do not anticipate fundamental changes
in the X G and X R values or the GÎR ratio between the two treatments. Because of the spa-
tial sampling context and the presence of two conditions on the field, however, the data are
perfectly suited to demonstrate the basic manipulations and computations involved in a ran-
dom field analysis that involves treatment structure. We furthermore note from Figure 9.45
that the strips were not randomized. An analysis as a randomized experiment with subsam-
pling of six replications of two treatments is therefore tenuous. Instead, we analyze the data as

© 2003 by CRC Press LLC


a spatial random field with a mean structure given by the two treatment conditions and pos-
sible spatial autocorrelation among the sampling sites.

Chisel-Plow
No Tillage

300

250

200
Y-Coordinate (ft)

150

100

50

0 100 200 300 400 500


X-Coordinate (ft)

Figure 9.45. Sampling locations at which total soil R (%) and total soil G (%) were observed
for two tillage treatments. Treatment strips are oriented in East-West direction.

The target attribute for this application is the GÎR ratio and a simplistic pooled >-test
comparing the two tillage treatments leads to a :-value of !Þ)!* from which one would con-
clude that there are no differences in the average GÎR ratios. This test does not account for
spatial autocorrelation treating the "*& samples on chisel-plow strips and #!! samples on no-
till strips as independent. Furthermore, it does not convey whether there are differences in the
spatial structure of the treatments. Even if the means are the same the spatial dependency
might develop differently. This, too, would be a difference in the treatments that should be
recognized by the analyst. Omnidirectional semivariograms were calculated with the
variogram procedure in The SAS® System and spherical semivariogram models were fit to
the empirical semivariograms (Figure 9.46) with proc nlin by weighted least squares:
proc sort data=CNRatio; by tillage; run;
proc variogram data=CNRatio outvar=svar;
compute lagdistance=13.6 maxlag=19 robust;
coordinates xcoord=x ycoord=y;
var cn;
by tillage;
run;
proc nlin data=fitthis nohalve method=newton noitprint;
parameters sillC=0.093 sillN=0.1414 rangeC=116.6 rangeN=197.2
nugget=0.1982;
if tillage='ChiselPlow' then
sphermodel = nugget + (distance <= rangeC)*sillC*(1.5*(distance/rangeC) -
0.5*((distance/rangeC)**3)) + (distance > rangeC)*sillC;
else
sphermodel = nugget + (distance <= rangeN)*sillN*(1.5*(distance/rangeN) -
0.5*((distance/rangeN)**3)) + (distance > rangeN)*sillN;
model rvario = sphermodel;
_weight_ = 0.5*count/(sphermodel**2);
run;

© 2003 by CRC Press LLC


In anticipation of obtaining generalized least squares and restricted maximum likelihood
inferences in proc mixed a common nugget effect was fit for both tillage treatments but the
sills and ranges of the semivariogram were varied. The sill and range estimates for the chisel-
plow treatment were !Þ!*# and "#(Þ!, respectively. The corresponding estimates for the no-
till treatment were !Þ"$*( and "**Þ# (Output 9.13). Notice that to a considerable degree
variability in GÎR ratios is due to the nugget effect. The relative structured variability is $"%
for the chisel-plow and %"% for the no-till treatment. The GÎR ratio of the undisturbed no-
till sites is more spatially structured, however, as can be seen from the larger range.

0.35

0.30
Semivariance

0.25

0.20

0 50 100 150 200 250


Distance

Figure 9.46. Omnidirectional empirical semivariograms for GÎR ratio under chisel-plow
(open circles) and no-till (full circles) treatments. Weighted least squares fit of spherical
semivariograms are shown.

Output 9.13. (abridged)


The NLIN Procedure

Sum of Mean Approx


Source DF Squares Square F Value Pr > F
Regression 5 13117.5 2623.5 23.92 <.0001
Residual 33 53.2243 1.6129
Uncorrected Total 38 13170.7
Corrected Total 37 207.6

Approx Approximate 95% Confidence


Parameter Estimate Std Error Limits
sillC 0.0920 0.0151 0.0612 0.1228
sillN 0.1397 0.0152 0.1089 0.1706
rangeC 127.0 19.4203 87.4950 166.5
rangeN 199.2 29.7131 138.8 259.7
nugget 0.2000 0.0139 0.1717 0.2284

Next we obtain generalized least squares estimates of the treatment effect as well as
predictions of the GÎR ratio over the entire field with proc mixed of The SAS® System. A
data set containing the prediction locations for both treatments (data set filler) is created

© 2003 by CRC Press LLC


and appended to the data set containing the observations. The response variable of the filler
data set is set to missing values. This will prevent proc mixed from using the information in
the prediction data set for estimation. In calculating predicted values these observations can
be used, however, since they contain all information apart from the response.
data filler;
do tillage='ChiselPlow','NoTillage';
do x = 0 to 500 by 10; do y = 0 to 300 by 10; cn=.; output; end; end;
end;
run;
data fitthis; set filler cnratio; run;

proc mixed data=fitthis noprofile;


class tillage;
model CN = tillage /ddfm=contain outp=p;
repeated / subject=intercept type=sp(sph)(x y) local group=tillage;
parms /* sill ChiselPlow */ 0.0920
/* range ChiselPlow */ 127.0
/* sill NoTillage */ 0.1397
/* range NoTillage */ 199.2
/* nugget (common) */ 0.2000 / noiter;
run;

The call to proc mixed has several important features. The model statement describes the
mean structure of the model. GÎR ratios are assumed to depend on the tillage treatments. The
outp=p option of the model statement produces a data set (named p) containing the predicted
values. The repeated statement identifies the spatial covariance structure to be spherical
(type=sp(sph)(x y)). The subject=intercept option indicates that the data set comprises a
single subject, all observations are assumed to be correlated. The group=tillage option re-
quests that the spatial covariance parameters are varied by the values of the tillage variable.
This allows modeling separate covariance structures for the chisel-plow and no-till treatments
to reflect the differences in spatial structure evident in Figure 9.46. Finally, the local option
adds a nugget effect. Since proc mixed adds only a single nugget effect, it was important in
fitting the semivariograms to ensure that the nugget effect was held the same for the two
treatments. The parms statement provides starting values for the covariance parameters. The
order in which the values are listed equals the order in which the values appear in the
Covariance Parameter Estimates table of the proc mixed output. A trial run is sometimes
necessary to determine the correct order. The starting values are set at the converged iterates
from the weighted least squares fit of the theoretical semivariogram (Output 9.13). The
noiter option of the parms statement prevents iterations of the covariance parameters and
holds them fixed at the starting values provided. To produce restricted maximum likelihood
estimates of the covariance parameters, simply remove the noiter option. The noprofile
option of the proc mixed statement prevents profiling of the nugget variance. Without this
option proc mixed would make slight adjustments to the sill and nugget even if the /noiter
option is specified.
The Dimensions table indicates that $*& observations were used in model fitting and
$"'# observations were not used (Output 9.14). The latter comprise the filler data set of
prediction locations for which the CN variable was assigned a missing value. The -2 Res Log
Likelihood of &(!Þ$ in the table of Fit Statistics equals minus twice the residual log
likelihood in the Parameter Search table. The latter table gives the likelihood for all sets of
starting values. Here only one set of starting values was used and the equality of the -2 Res
Log Likelihood values shows that no iterative updates of the covariance parameters took
place. The estimates shown in the Covariance Parameter Estimates table are identical to the

© 2003 by CRC Press LLC


starting values provided in the parms statement. Finally, the Type 3 Tests of Fixed Effects
table shows that there is no significant difference between the mean GÎR ratios of the two
tillage treatments a: œ !Þ(**b.

Output 9.14.
The Mixed Procedure

Model Information

Data Set WORK.FITTHIS


Dependent Variable cn
Covariance Structures Spatial Spherical,
Local Exponential
Subject Effect Intercept
Group Effect tillage
Estimation Method REML
Residual Variance Method None
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Containment

Class Level Information

Class Levels Values


tillage 2 ChiselPlow NoTillage

Dimensions

Covariance Parameters 5
Columns in X 3
Columns in Z 0
Subjects 1
Max Obs Per Subject 395
Observations Used 395
Observations Not Used 3162
Total Observations 3557

Parameter Search

CovP1 CovP2 CovP3 CovP4 CovP5 -2 Res Log Like


0.09200 127.00 0.1397 199.20 0.2000 570.2618

Covariance Parameter Estimates

Cov Parm Subject Group Estimate


Variance Intercept tillage ChiselPlow 0.09200
SP(SPH) Intercept tillage ChiselPlow 127.00
Variance Intercept tillage NoTillage 0.1397
SP(SPH) Intercept tillage NoTillage 199.20
Residual 0.2000

Fit Statistics

-2 Res Log Likelihood 570.3


AIC (smaller is better) 570.3
AICC (smaller is better) 570.3
BIC (smaller is better) 570.3

Type 3 Tests of Fixed Effects

Num Den
Effect DF DF F Value Pr > F
tillage 1 393 0.06 0.7990

© 2003 by CRC Press LLC


The predicted GÎR surfaces for the two tillage treatments are shown in Figure 9.47.
Both surfaces vary about the same mean but the greater spatial continuity (larger range) of the
no-till sites is evident in a smoother, less variable surface. Positive autocorrelations are
stronger over the same distance under this treatment as compared to the chisel-plow treat-
ment. At this point it is worthwhile to revisit the question raised early in this application.
What do we mean by differences in experimental conditions if the observations collected
from each site have a spatial context? There is no difference in the average GÎR values in
this study as can be expected when sampling only two months after installment of the treat-
ments. There appear to be differences in the spatial structure of the treatments, however.
Fitting a single spherical semivariogram to the empirical semivariograms shown in Figure
9.46 a residual sum of square of *$Þ!* on $& degrees of freedom is obtained. A sum of square
reduction test leads to
a*$Þ!*  &$Þ#bÎ#
J9,= œ œ "#Þ$(
&$Þ#Î$$

with a :-value of !Þ!!!!*. If the semivariogram is estimated by ordinary (instead of


weighted) least squares the statistics are J9,= œ ""Þ)& and : œ !Þ!!!". There are significant
differences among the treatments in the autocorrelation structure, albeit not in the average
GÎR ratio. One can argue that after ten years of continuous no-till management there is
greater continuity in the GÎR ratios compared to what can be observed shortly after a distur-
bance through plowing.

Chisel-Plow No Tillage

Figure 9.47. Predicted GÎR surface under chisel-plow and no-till treatments.

The predicted surfaces in Figure 9.47 were obtained from the generalized least squares fit
which assumed that the supplied starting values of the covariance parameters are the true
values. This is akin to the assumption in kriging methods that the semivariogram values used
in solving the kriging equations are known. Removing the noiter option of the parms state-
ment in proc mixed the spatial covariance parameters are updated iteratively by the method of
restricted maximum likelihood. Twice the negative residual log likelihood at convergence can
be compared to the same statistic calculated from the starting values. This likelihood ratio test
indicates whether the REML estimates are a significant improvement over the starting values.
The mixed procedure displays the result of this test in the PARMS Model Likelihood Ratio

© 2003 by CRC Press LLC


Test table (Output 9.15). In this application convergence was achieved after twelve time-con-
suming iterations with no significant improvement over the starting values a: œ !Þ"*)"b.

Output 9.15. (abridged)


The Mixed Procedure

Fit Statistics
-2 Res Log Likelihood 562.9
AIC (smaller is better) 572.9
AICC (smaller is better) 573.1
BIC (smaller is better) 592.8

PARMS Model Likelihood Ratio Test


DF Chi-Square Pr > ChiSq
5 7.32 0.1981

9.8.5 Spatial Random Field Models — Spatial Regression of


Soil Carbon on Soil N
In the previous application GÎR ratio was modeled directly and compared between the two
tillage treatments. In many applications one attribute emerges as the primary variable of in-
terest and other variables are secondary attributes which are to be linked to the primary attrib-
ute. This approach is particularly meaningful if the secondary attributes are easy to measure
or available in dense coverage (e.g., sensed images, GIS) and the primary attribute is more
difficult to determine. If the relationship between primary and secondary attributes can be
modeled for a particular data set where both variables have been measured, the model can
then be applied to situations where only the secondary attributes are available. Consider that
we are interested in predicting soil carbon as a function of soil nitrogen. From Figure 9.48 it
is clearly seen that the relationship between X G and X R is very strong (V # œ !Þ*"'), close
to linear, and differs not between the two tillage treatments.

1.4
Chisel-Plow
No Tillage

1.2
Total C (%)

1.0

0.8

0.6

0.4

0.04 0.06 0.08 0.10 0.12


Total N (%)

Figure 9.48. Relationship between total G (%) and total R (%) of chisel-plow and no-till
areas.

© 2003 by CRC Press LLC


For the time being we disregard the fact that the data are collected under two different
tillage regimes. Incorporating additional tillage treatment effects in the models developed
subsequently is straightforward. Figure 9.48 belies the fact that both variables are spatially
heterogeneous. A side-by-side graph of the X G and X R contours shows more clearly how
areas of high (low) X G are associated with areas of high (low) X R (Figure 9.49). Given a
sample of X R and X G values we want to model the relationship between X G and X R taking
into account that X G observations are spatially autocorrelated and given a sample or map of
X R we want to use the modeled relationship to predict total carbon percentage at arbitrary
locations.
T otal C arbon (% ) T otal N (% )

250 250

200 200

150
y

150
y

100 100

50 50

0 0
0 100 200 300 400 0 100 200 300 400
x x

Figure 9.49. Contour plots of ordinary kriging predictions of total G (%) and total R (%)
irrespective of tillage treatment.

Based on the relationship in Figure 9.48 an obvious place to start is


X G as3 b œ "! € "" X R as3 b € /as3 b, [9.73]

where the errors /as3 b are spatially autocorrelated. We emphasize again that such a spatial re-
gression model differs conceptually from a cokriging model where primary and secondary
attribute are spatially autocorrelated and models for the semivariogram (covariogram) of
X G as3 b, X R as3 b and the cross-covariogram of X G as3 b and X R as3 b must be derived. X R as3 b
is considered fixed in [9.73] and the only semivariogram that needs to be modeled is that of
X G as3 b after adjusting its mean for the dependency on X R as3 b. The spatial regression model
expresses the relationship between X G as3 b and X R as3 b not through a cross-covariogram but
models it as deterministic dependency of EcX G as3 bd on X R as3 b; they are simpler to fit com-
pared to cokriging models and standard statistical procedures such as proc mixed of The
SAS® System can be employed.
The semivariogram of /as3 b is modeled in two steps. First, the model is fit by ordinary
least squares and the empirical semivariogram of the OLS residuals is computed to suggest a
theoretical semivariogram model. This theoretical model is fit to produce starting values for

© 2003 by CRC Press LLC


the semivariogram parameters. Next, the mean and autocorrelation structure (for the suggest-
ed theoretical semivariogram model) are estimated simultaneously by (restricted) maximum
likelihood. A simple likelihood ratio test can be employed to examine whether the MLEs pro-
duce a significantly better fit of the model than the semivariogram parameters derived initially
from the OLS residuals. Finally, predictions and their standard errors are calculated for the
primary attribute at locations where values of the secondary attribute are available. If map-
ping of the primary attribute is desired, one can first obtain a surface of the secondary
attribute (e.g., the X R surface shown in Figure 9.49) and predict at these locations.
This approach is an obvious generalization of the universal kriging method where now
the large-scale trend is not only modeled as functions of the spatial coordinates, but as func-
tions of other, spatially heterogeneous variables. We illustrate the implementation for model
[9.73] with The SAS® System. The procedure driving estimation of both the mean function
and the autocorrelation structure as well as prediction at unobserved locations is proc mixed.
The OLS residuals and their empirical semivariogram are obtained with the statements
proc mixed data=CNRatio;
model TC = TN / outp=OLSresid s;
run;
proc variogram data=OLSResid outvar=svar;
compute lagdistance=13.5 maxlag=21 robust;
coordinates xcoord=x ycoord=y;
var resid;
run;
proc nlin data=svar nohalve noitprint;
parameters sill=0.0009 Range=40.4 nugget=0.0009;
expomodel = nugget+sill*(1-exp(-distance/range));
model variog = expomodel;
_weight_ = 0.5*count/(expomodel**2);
run;

Output 9.16. (abridged)


The NLIN Procedure

Sum of Mean Approx


Source DF Squares Square F Value Pr > F
Regression 3 28877.0 9625.7 119.34 <.0001
Residual 19 28.2724 1.4880
Uncorrected Total 22 28905.3
Corrected Total 21 383.4

Approx Approximate 95% Confidence


Parameter Estimate Std Error Limits
sill 0.000864 0.000064 0.000730 0.000999
Range 72.3276 11.9674 47.2796 97.3755
nugget 0.000946 0.000074 0.000790 0.00110

The exponential semivariogram fits the empirical semivariogram well, Pseudo-V # œ


"  #)Þ#(Î$)$Þ% œ !Þ*# (Output 9.16). We adopt it as the semivariogram model for /as3 b in
[9.73]. Next, we krige a surface of X R with proc krige2d and add the prediction locations to
the original data set. In this combined data set (fitthis below) the response variable X G will
have missing values for all prediction locations. This will prevent proc mixed from using the
prediction observations in estimating the regression coefficients and the spatial dependency
parameters. It will, however, produce predicted values for all observations in the data set that
have complete regressor information.

© 2003 by CRC Press LLC


data predgrid;
do x = 0 to 500 by 25; do y = 0 to 300 by 10; TN=.; output; end; end;
run;
data obsdata; set CNRatio(keep=x y tn); run;
proc krige2d data=obsdata outest=krigeEst;
coordinates xcoord=x ycoord=y;
grid griddata=predgrid xcoord=x ycoord=y;
predict var=TN radius=50 maxpoints=50 minpoints=30;
model form=exponential range=43.063 scale=0.0001775;
run;
data KrigeEst; set KrigeEst;
if stderr ne .;
rename gxc=x gyc=y estimate=TN;
run;
data fitthis; set KrigeEst CNRatio; run;

The final step is to submit the data fitthis to proc mixed to fit the spatial regression
model [9.73] by (restricted) maximum likelihood:
proc mixed data=fitthis noprofile;
model TC = TN / ddfm=contain s outp=p;
repeated / subject=intercept type=sp(exp)(x y) local;
parms /* sill */ 0.000864
/* range */ 72.3276
/* nugget */ 0.000946 ;
run;

The starting values for the autocorrelation parameters are chosen as the converged iter-
ates in Output 9.16. Because the parms statement does not have a /noiter option proc mixed
will estimate these parameters iteratively commencing at the starting values. The outp=p
option of the model statement creates a data set containing the predicted values.
The Dimensions table of the output shows that the data are representing a single subject
and that &$" of the *#' observations in the data set have not been used in estimation (Output
9.17). These are the observations at the prediction locations for which the response variable
TC was set to missing. Minus twice the (residual) log likelihood evaluated to –"&#'Þ! at the
starting values and was subsequently improved upon during five iterations. At convergence -2
Res Log Like œ  "&#(Þ('. The difference between the initial and converged -2 Res Log
Like can be used to test whether the iterations significantly improved the model fit. The dif-
ference is not statistically significant (: œ !Þ'#&$, see PARMS Model Likelihood Ratio
Test). The iterated REML estimates of sill, range, and nugget are shown in the Covariance
Parameter Estimates table.

Output 9.17. (abridged) The Mixed Procedure

Dimensions
Covariance Parameters 3
Columns in X 2
Columns in Z 0
Subjects 1
Max Obs Per Subject 395
Observations Used 395
Observations Not Used 531
Total Observations 926

Parameter Search
CovP1 CovP2 CovP3 Res Log Like -2 Res Log Like
0.000946 49.7380 0.000784 763.0047 -1526.0093
Convergence criteria met.

© 2003 by CRC Press LLC


Output 9.17 (continued).
Covariance Parameter Estimates
Cov Parm Subject Estimate
Variance Intercept 0.001193
SP(EXP) Intercept 87.7099
Residual 0.000801

Fit Statistics
-2 Res Log Likelihood -1527.8
AIC (smaller is better) -1521.8
AICC (smaller is better) -1521.7
BIC (smaller is better) -1509.8

PARMS Model Likelihood Ratio Test


DF Chi-Square Pr > ChiSq
3 1.75 0.6253

Solution for Fixed Effects


Standard
Effect Estimate Error DF t Value Pr > |t|
Intercept -0.01542 0.02005 393 -0.77 0.4423
TN 11.1184 0.2066 393 53.83 <.0001

The estimates of the regression coefficients are " s ! œ  !Þ!"&%# and " s " œ ""Þ"")%,
respectively (Solution for Fixed Effects Table). With every additional percent of total R
the total G percentage increases by ""Þ"" units. It is interesting to compare these estimates to
the ordinary least squares estimates in the model
X G3 œ !! € !" X R3 € /3 , /3 µ 33. ˆ!ß 5 # ‰,

which does not incorporate spatial autocorrelation (Output 9.18). The estimates are slightly
different and their standard errors are very optimistic (too small). Furthermore, for a given
value of X R , the prediction of X G is the same under the classical regression model, regard-
less of where the X R observation is located. In the spatial regression model with auto-
correlated errors, the best linear unbiased predictions of X Gas3 b take the spatial correlation
structure into account. At two sites with identical X R values the predicted values in the
spatial model will differ depending on where the sites are located. Compare, for example the
two observations in Output 9.19. Estimates of EcX G as3 bd would be calculated as
s! € "
scX G as3 bd œ "
E s " X R as3 b œ  !Þ!"&%# € ""Þ"")%‡!Þ!&('#$ œ !Þ'#&#$

regardless of the location s3 . The values computed by proc mixed are predictions of X Gas3 b
and vary by location. If estimates of the mean are desired one can add the statement outpm=pm
to the code above.

Output 9.18.
Solution for Fixed Effects

Standard
Effect Estimate Error DF t Value Pr > |t|

Intercept -0.03138 0.01323 393 -2.37 0.0181


TN 11.2213 0.1710 393 65.61 <.0001

© 2003 by CRC Press LLC


Output 9.19.
x y TN Pred

420 80 0.057623 0.61429


420 20 0.057623 0.65172

The spatial predictions of the total carbon percentage are shown in the left panel of
Figure 9.50 and estimates of the mean in the right panel. Because the underlying X R surface
is spatially variable (right panel of Figure 9.49) so are the estimates of EcX G as3 bd which
follow the pattern of X R very closely. The predictions of X Gas3 b follow the same pattern as
EcX G as3 bd, but exhibit more variability. The left-hand panel of Figure 9.50 is less smooth.

P redictions of T C E stim ates of E [T C ]

250 250

200 200

150 150
y

100 100

50 50

0 0
0 100 200 300 400 0 100 200 300 400
x x

Figure 9.50. Spatial regression predictions and estimates of X G a%b.

9.8.6 Spatial Generalized Linear Models — Spatial Trends in


the Hessian Fly Experiment
In §6.7.2 we analyzed data from a variety field trial in which sixteen varieties of wheat were
compared with respect to their infestation with the Hessian fly. The entries were arranged in a
randomized complete block design with four blocks and the outcome of interest was the
sample proportion ]34 œ ^34 Î834 , where ^34 is the number of plants infested with the Hessian
fly for entry (= variety) 3 in block 4 and 834 is the total number of plants on the experimental
unit. Because the data are proportions out of a given total, the Binomial distribution is a
natural model and a generalized linear model for Binomial data with a logit link was fit. It
was noticed, however, that the data appear overdispersed relative to the Binomial model. One
possible reason for overdispersion is positive autocorrelation between the experimental units.
If an experimental unit shows a high degree of infestation, it is likely that a nearby unit also is
highly infested. This autocorrelation can be linked to spatially varying environmental condi-

© 2003 by CRC Press LLC


tions that inhibit or enhance infestation beyond the varietal differences. One approach to
account for overdispersion is to add an additional scale parameter to the variance of the out-
come. The Binomial law states that Varc]34 d œ 134 a"  134 bÎ834 , where 134 is the probability
that a plant of variety 3 in block 4 is infested. In an overdispersed model one can put
Varc]34 d œ 9134 a"  134 bÎ834 and allow the parameter 9 to adjust the dispersion. By adding
the scale parameter 9 this model is no longer a Binomial one, but statistical inference can pro-
ceed nevertheless along similar lines using quasi-likelihood ideas and estimating 9 by the
method of moments. This was the solution chosen for these data in §6.7.2. If the over-
dispersion arises from a spatially varying characteristic it is more appropriate to include the
spatial variability directly in the model rather than to rely on one multiplicative scale
parameter to patch things up. The scale parameter 9 adjusts only the standard errors of the
mean parameter estimates, not the estimates themselves. The estimated probability of infec-
tion for a particular variety does not depend on whether 9 is in the model or not. The models
that account for the spatial dependencies among experimental units will adjust the estimates
of treatment effects as well as their standard errors.
To motivate statistical models that incorporate spatial autocorrelation and non-normal
responses, first recall the case of independent observations. A generalized linear model can be
written as
] œ 1" a(b € /,

where 1" a(b œ . is the inverse link function, ( is the linear predictor of the form xw " and /
is a random error with mean ! and variance Varc/d œ 2a.b<. In the randomized block
Hessian fly experiment, if one assumes that ]34 is a Binomial proportion, one obtains (34 œ
! € 34 € 73 , Varc/34 d œ 2a.b œ 1" a(34 ba"  1" a(34 bbÎ834 and < œ ". Choosing 1ab as the
logit transform is common for such data. In vector/matrix notation the model for the complete
data is written as
Y œ 1" a( b € e, e µ a0ß Diage2a.bfb œ a0ß Ha.bb.

A spatially varying process that induces autocorrelations can be accommodated in two ways.
The marginal formulation replaces the error vector e with the vector d such that
½
Varcdd œ Ha.b½ RHa.b .

The matrix R is a spatial correlation matrix and the diagonal matrices Ha.b adjust the corre-
lations to yield the correct variances and covariances. The matrix R typically corresponds to
the correlation model derived from one of the basic isotropic semivariogram models (§9.2.2).
For example, if the spatial dependency between experimental units can be described by an
exponential semivariogram, elements of R are calculated as expe  $lls5  s6 llÎ!f, where s5
and s6 are the spatial coordinates representing two units. The marginal formulation was cho-
sen by Gotway and Stroup (1997) in modeling the Hessian fly data.
The conditional formulation of a spatial generalized linear model assumes that condi-
tionally on the realization of the spatial process the observations are uncorrelated. This
formulation is akin to the generalized linear mixed models of §8. In vector notation we can
put
Y œ 1" a( € Uasbb € e, e µ a0ß Ha.bb. [9.74]

© 2003 by CRC Press LLC


Uasb is a second-order stationary mean zero random field with covariogram
CovcY asbß Y as € hbd œ G ahb, and semivariogram ½VarcY asb  Y as € hbd œ # ahb.
To make estimation practical one chooses a correlation matrix R in the marginal model
or a semivariogram (covariogram) for Y asb in the conditional model from the models in
§9.2.2 and replaces R with Rahß )b and # ahb with # ahß )b. The estimation process is then
doubly iterative. Given an estimate ( s of ( we estimate ) and given s ) we update the estimate
of ( . Each step is itself iterative and the entire procedure is continued until some overall con-
vergence criterion is met (e.g., track the largest relative change in elements of s ) between
iterations). The interested reader can find details of this process in §A9.9.7. It is important to
point out, however, that some of the approaches put forth in the literature require the repeated
inversion of large matrices which is computationally expensive. The pseudo-likelihood
approach of Wolfinger and O'Connell (1993), for example, requires the inversion of an
a8 ‚ 8b matrix at every iteration that updates ). Originally designed for clustered data, this is
not a big issue there, since VarcYd is block-diagonal and can be inverted in blocks. For spatial
data, VarcYd is not block-diagonal and few computational shortcuts are available. Zimmerman
(1989) described inversion procedures that exploit the structure of VarcYd when data are
collected on a rectangular or parallelogram lattice. For the models considered here, these
methods do not apply.
To overcome the possible numerical problems, we consider the following approach.
Since the parameter vector ( is of main interest and the covariance parameter vector ) is a
vector of nuisance parameters, our chief concern is to estimate ) consistently. A consistent
estimate of ) can be calculated quickly by applying the composite likelihood principle (see
§9.2.4). The procedure is as follows. Start by fitting a generalized linear model (§6) for inde-
pendent data (i.e., assume Y asb ´ !). Transform residuals <as3 b from the fit in such a way
that their mean is (approximately) zero and Varc<as3 b  <as4 bd œ ## as3  s4 ß )b. Apply the
composite likelihood principle to the transformed residuals to obtain an estimate of ).
Construct an estimate of the marginal variance-covariance matrix VarcYd with s ) and re-esti-
mate the linear predictor ( . Formulate new residuals and continue the process until some con-
vergence criterion is met. For example, continue until the largest relative change in one of the
model parameters is less than some critical number % (see §A9.9.7 for details).
The estimation process has been coded in a SAS® macro contained on the CD-ROM
(macro %GlmSpat() in file \SASMacros\GLMCompLike.sas). We demonstrate its usage here for
the Hessian fly data. Before fitting the spatial generalized linear model with composite likeli-
hood estimation of the spatial dependence parameters, we compare the estimates of block and
treatment effects obtained from a regular generalized linear model with Binomial errors and
an overdispersed Binomial model. These models have linear predictor
(34 œ . € 73 € 34 ,

where 73 , 3 œ "ß âß "', denotes the entries, and 34 , 4 œ "ß âß %, the block effects. The link
function was chosen as the logit, and consequently, loge134 Îa"  134 bf œ (34 .
proc genmod data=HessianFly;
class block entry;
model z/n = block entry / link=logit dist=binomial type3;
ods output ParameterEstimates=GLMEst;
ods output type3=GLMType3;
run;

© 2003 by CRC Press LLC


proc genmod data=HessianFly;
class block entry;
model z/n = block entry / link=logit dist=binomial type3 dscale;
ods output ParameterEstimates=ODEst;
ods output Type3=ODType3;
run;

The two proc genmod calls fit the regular GLM and the overdispersed GLM (by adding
the dscale option to the model statement) and save estimates as well as the tests for treatment
effects in data sets. After processing the output data sets (see code on CD-ROM), we obtain
Output 9.20. The estimate of the intercept, block and treatment effects are the same in both
models. The standard errors of the overdispersed model are uniformly "Þ'' times larger than
the standard errors in the regular GLM. This is also reflected in the test of entry effects. The
Chi-square statistic in the overdispersed model is "Þ''# œ #Þ(& times smaller than the corre-
sponding statistic in the GLM. Not accounting for overdispersion overstates the precision of
parameter estimates. Test statistics are too large and :-values too small.

Output 9.20.
GLM GLM Overd. Overd.
Parameter Level Estimate StdErr Estimate StdErr
Intercept -1.2936 0.3908 -1.2936 0.6487
block 1 -0.0578 0.2332 -0.0578 0.3870
block 2 -0.1838 0.2303 -0.1838 0.3822
block 3 -0.4420 0.2328 -0.4420 0.3863
entry 1 2.9509 0.5397 2.9509 0.8958
entry 2 2.8098 0.5158 2.8098 0.8561
entry 3 2.4608 0.4956 2.4608 0.8225
entry 4 1.5404 0.4564 1.5404 0.7575
entry 5 2.7784 0.5293 2.7784 0.8785
entry 6 2.0403 0.4889 2.0403 0.8115
entry 7 2.3253 0.4966 2.3253 0.8242
entry 8 1.3006 0.4754 1.3006 0.7890
entry 9 1.5605 0.4569 1.5605 0.7582
entry 10 2.3058 0.5203 2.3058 0.8635
entry 11 1.4957 0.4710 1.4957 0.7818
entry 12 1.5068 0.4767 1.5068 0.7911
entry 13 -0.6296 0.6488 -0.6296 1.0768
entry 14 0.4460 0.5126 0.4460 0.8507
entry 15 0.8342 0.4698 0.8342 0.7798

GLM GLM Overd. Overd.


Source DF ChiSq Pvalue Chisq Pvalue
block 3 4.27 0.2337 1.55 0.6707
entry 15 132.62 <.0001 48.15 <.0001

The previous analysis maintains that observations from different experimental units are
independent; it simply allows the variance of the observations to exceed the variability dic-
tated by the Binomial law. If the data are overdispersed relative to the Binomial model be-
cause of positive spatial autocorrelation among the experimental units, the spatial process can
be modeled directly. The following code analyzes the Hessian fly experiment with model
[9.74], where Y asb has exponential semivariogram without nugget effect (options Covmod=E,
nugget=0). The sx= and sy= parameters denote the variables of the data set containing longi-
tude and latitude information, the margin= parameter specifies the marginal variance function
2a.b. Starting values for the sill and range are set at "Þ& and &, respectively, and the range
parameter is constrained to be at least #. Setting a minimum value for the range is recom-

© 2003 by CRC Press LLC


mended if numerical problems prevent the macro from converging. This value should not be
set, however, without evidence that the range is definitely going to exceed this value.
%include 'CDRomDriveLetter:\SASMacros\GLMCompLike.sas';
%glmspat(data=HessianFly,
procopt=order=data,
stmts=%str(class block entry;
model z/n = block entry / s;
lsmeans entry /diff;
),
sx = sx, sy = sy,
link=logit, margin=binomial,
minrange=2, CovMod=E,
nugget=0, sillstart=1.5, rangestart=5,
title=Hessian Fly Data - GLM-CL,
options=);

The stmts=%str( ) block of the macro call assembles statements akin to proc mixed
syntax. The s option of the model statement requests a printout of the fixed effects estimates
(solutions). For predicted values add the p option to the model statement. The algorithm con-
verged after fourteen iterations, that is, the parameters of the exponential semivariogram were
updated fourteen times following an update of the block and entry effects.
The sill and range parameter of the exponential semivariogram (covariogram) are esti-
mated as !Þ$(& and *Þ'*% 7, respectively. Notice that these estimates differ from those of
Gotway and Stroup (1997) who estimated the range at ""Þ' 7 and the sill at $Þ$). Their
model uses a marginal formulation, whereas the model fitted here is a conditional one that in-
corporates a latent random field inside the link function. Furthermore, in their marginal
formulation Gotway and Stroup (1997) settled on a spherical semivariogram. For the condi-
tional model we found an exponential model to fit the semivariogram of the transformed resi-
duals better. Finally, their method does not iterate between updates of the fixed effects and
updates of the semivariogram parameters.
The estimates of the fixed effects in the spatial model (Output 9.21) differ from the cor-
responding estimates in the generalized linear model (compare to Output 9.20), as expected.
The overall impact of different estimates is difficult to judge since the treatment effects,
which are averaged across the blocks, are of interes. If s(34 is the estimated linear predictor for
entry 3 in block 4, we need to determine s(3Þ , the effect of the 3th entry after removing and
averaging over the block effect. The estimates s(3Þ are shown in the Least Squares Means table
(Output 9.22). The least square mean for entry ", for example, is calculated from the param-
eter estimates in Output 9.21 as
"
s("Þ œ  "Þ%!#' € a  !Þ"%%%  !Þ"#'*  !Þ%%$" € !b € $Þ$##) œ "Þ(%"'.
%
The probability that a plant of entry " is infected with the Hessian Fly is then obtained by
applying the inverse link function,
"
1
s" œ œ !Þ)&.
" € expe  "Þ(%"'f

The table titled Differences of Least Square Means in Output 9.22 can be used to assess
differences in the infestation probabilities among pairs of entries. Only part of the lengthy
table of least squares mean differences are shown.

© 2003 by CRC Press LLC


Output 9.21.
Hessian Fly Data - GLM-CL
Class Level Information (WORK._CLASS)

Class Levels Values


block 4 1 2 3 4
entry 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Covariance Parameter Estimates (WORK._SOLR)


Parameter Estimate
Sill 0.37516
Range 9.69449

Parameter Estimates and Standard Errors (WORK._SOLF)

Effect Parameter Std. Wald Pr >


name block entry Estimate Error Chi-Sq. Chi-Sq.
Intercept -1.4026 0.5568 6.3453 0.0118
block 1 -0.1444 0.4283 0.1137 0.7360
block 2 -0.1269 0.4085 0.0965 0.7560
block 3 -0.4431 0.4093 1.1721 0.2790
entry 1 3.3228 0.7004 22.5093 <.0001
entry 2 3.1181 0.6759 21.2838 <.0001
entry 3 2.6294 0.6472 16.5047 <.0001
entry 4 1.8789 0.6083 9.5408 0.0020
entry 5 2.8513 0.6545 18.9795 <.0001
entry 6 2.1405 0.6345 11.3826 0.0007
entry 7 2.4266 0.6428 14.2524 0.0002
entry 8 1.4999 0.6283 5.6988 0.0170
entry 9 1.7197 0.6179 7.7469 0.0054
entry 10 2.3922 0.6601 13.1345 0.0003
entry 11 1.4721 0.6260 5.5298 0.0187
entry 12 1.7885 0.6265 8.1487 0.0043
entry 13 -0.5651 0.7778 0.5278 0.4676
entry 14 0.6812 0.6494 1.1003 0.2942
entry 15 0.8458 0.6233 1.8418 0.1747

Tests of Fixed Effects (WORK._TESTS)


Chi- Deg. of Pr >
Effect Square freed. Chi-Square
BLOCK 1.2556 3 0.73971
ENTRY 69.2083 15 0.00000

Output 9.22.
Least Squares Means (WORK._LSM)

Std.Err Std.Err of
of Predicted Predicted
Effect Level LS Mean LSMean Mean Mean

ENTRY 1 1.74153 0.52693 0.85088 0.06686


ENTRY 2 1.53690 0.49153 0.82301 0.07160
ENTRY 3 1.04815 0.45201 0.74042 0.08688
ENTRY 4 0.29763 0.40583 0.57386 0.09924
ENTRY 5 1.27001 0.48065 0.78075 0.08228
ENTRY 6 0.55929 0.44482 0.63629 0.10294
ENTRY 7 0.84537 0.44968 0.69960 0.09451
ENTRY 8 -0.08132 0.42416 0.47968 0.10586
ENTRY 9 0.13847 0.40264 0.53456 0.10018
ENTRY 10 0.81100 0.47430 0.69232 0.10103
ENTRY 11 -0.10911 0.42283 0.47275 0.10539

© 2003 by CRC Press LLC


Output 9.22 (continued).
ENTRY 12 0.20723 0.42782 0.55162 0.10581
ENTRY 13 -2.14631 0.63084 0.10468 0.05912
ENTRY 14 -0.90006 0.45888 0.28904 0.09430
ENTRY 15 -0.73541 0.42675 0.32401 0.09347
ENTRY 16 -1.58124 0.49182 0.17062 0.06960

Differences of Least Squares Means (WORK._DIFFS)

First Sec. LSMeans SE of Chi- Deg. of Pr >


Effect Level Level Difference Difference Square freed. Chi-Squ.

ENTRY 1 2 0.2046 0.68852 0.0883 1 0.76631


ENTRY 1 3 0.6934 0.66650 1.0823 1 0.29818
ENTRY 1 4 1.4439 0.63530 5.1656 1 0.02304
ENTRY 1 5 0.4715 0.68719 0.4708 1 0.49262
ENTRY 1 6 1.1822 0.66175 3.1917 1 0.07401
ENTRY 1 7 0.8962 0.66734 1.8034 1 0.17931
ENTRY 1 8 1.8229 0.65175 7.8224 1 0.00516
ENTRY 1 9 1.6031 0.63310 6.4114 1 0.01134
ENTRY 1 10 0.9305 0.68361 1.8528 1 0.17345
ENTRY 1 11 1.8506 0.65423 8.0018 1 0.00467
ENTRY 1 12 1.5343 0.64620 5.6374 1 0.01758
ENTRY 1 13 3.8878 0.80369 23.4015 1 0.00000
ENTRY 1 14 2.6416 0.68019 15.0825 1 0.00010
ENTRY 1 15 2.4769 0.65568 14.2706 1 0.00016
ENTRY 1 16 3.3228 0.70036 22.5093 1 0.00000
ENTRY 2 3 0.4888 0.64143 0.5806 1 0.44608
ENTRY 2 4 1.2393 0.59844 4.2883 1 0.03838
ENTRY 2 5 0.2669 0.65625 0.1654 1 0.68424
ENTRY 2 6 0.9776 0.63584 2.3639 1 0.12417

and so forth ...

To compare the results of the GLM and spatial analysis in terms of infection probabili-
ties, predicted probabilities of infestation with the Hessian fly for the "' entries in the study
were graphed in Figure 9.51. Four methods were employed to calculate these probabilities.
The upper left-hand panel shows the probabilities calculated from the entry least squares
means in the overdispersed GLM analysis. The predictions in the other three panels are ob-
tained through different techniques of accounting for spatial correlations among experimental
units. Gotway and Stroup refers to the noniterative technique of Gotway and Stroup (1997),
Pseudo-Likelihood to the techniques by Wolfinger and O'Connell (1993) that are coded in the
%glimmix() macro (www.sas.com). It is noteworthy that the predicted probabilities are very
similar for the spatial analyses and that the (overdispersed) GLM results in predictions quite
similar to the spatial analyses. The standard errors of the predicted probabilities are very
homogeneous across entries in the GLM analysis. The dots are of similar size. The spatial
analyses show much greater heterogeneity in the standard errors for the predicted infestation
probabilities. There is little difference in the standard errors among the three spatial analyses,
however.

© 2003 by CRC Press LLC


1 2 3 4 5 6 7 8 9 10111213141516

Overdispersed GLM Gotway and Stroup


1.0

0.8

0.6

0.4
Predicted Probabilities

0.2

0.0
Composite Likelihood Pseudo-Likelihood
1.0

0.8

0.6

0.4

0.2

0.0
1 2 3 4 5 6 7 8 9 10111213141516
Cultivar (Entry)

Figure 9.51. Predicted probability of infection by entry for four different methods of incorpo-
rating overdispersion or spatial correlations. The size of the dots is proportional to the stan-
dard error of the predicted probability.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Overdispersed GLM Gotway and Stroup

15

10

5
Entry

Composite Likelihood Pseudo-Likelihood

15

10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Entry

Figure 9.52. Results of pairwise comparisons among entries. A dot indicates a significant dif-
ference in infestation probabilities among a pair of entries (at the &% level).

Differences in predicted probabilities and their standard errors are reflected in multiple
comparisons of entries (Figure 9.52). The three spatial analyses produce very similar results.
The overdispersed GLM yields fewer significant differences in this application which is due
to the large and homogeneous standard errors of the predicted probabilities (Figure 9.51)

© 2003 by CRC Press LLC


Adding an overdispersion parameter to a generalized linear model is simple. It is not the
appropriate course of action if overdispersion is due to positive autocorrelation among the ob-
servations. The overdispersed GLM assumes that VarcYd remains a diagonal matrix and in-
creases the size of the diagonal values compared to a regular GLM. In the presence of spatial
correlations VarcYd is no longer a diagonal matrix and the covariances must be taken into
account. Three approaches to incorporate spatial autocorrelation in a model for non-normal
data were compared in this analysis and it appears that the results are quite similar. In our ex-
perience, this is a rather common finding. The spatial dependency parameters are nuisance
parameters that must be estimated to obtain more efficient estimates of the fixed effects
parameters and more accurate and precise estimates of their dispersion. The fixed effects are
the quantities of primary interest here, and most reasonable methods of estimating the
covariance parameters will lead to similar results.
An issue we have not addressed yet is the selection of a semivariogram or covariogram
model. Fortunately, this choice is less critical in situations where the mean parameters are of
primary interest than in the case of spatial prediction. However, one should still choose the
model carefully. We base the initial selection of a semivariogram model on the Pearson
residuals from a standard generalized linear model fit. The empirical semivariogram of these
residuals will suggest a semivariogram model and starting values for its parameters. Then, the
composite likelihood fit is carried out. At convergence we obtain the transformed residuals
and calculate their empirical semivariogram (Figure 9.53). The composite likelihood estimate
of the semivariogram should fit the empirical semivariogram reasonably well. We do not
expect the composite likelihood estimate to fit "too" well. The composite likelihood estimates
are obtained by fitting a semivariogram model to pseudo-data a^ as3 b  ^ as4 bb# , not by fitting
a model to the empirical semivariogram.

0.45
Semivariogram of Transformed Residuals

0.40

0.35

0.30

0.25

0.20

0 5 10 15 20 25
Lag distance (m)

Figure 9.53. Empirical semivariogram of transformed residuals at convergence for the


Hessian fly data. Solid line represents composite likelihood estimate of exponential
semivariogram.

© 2003 by CRC Press LLC


9.8.7 Simultaneous Spatial Autoregression — Modeling
Wiebe's Wheat Yield Data
Correlations among the yields of neighboring field plots can have damaging effect on the
experimental error variance of field experiments. Blocking was advocated by R.A. Fisher as a
means to eliminate the effects of spatial variability and randomization to neutralize those
effects unaccounted for by the blocking scheme. If variability due to blocks does not elimi-
nate the spatial heterogeneity because the shape and size of blocks does not coincide with the
pattern of spatial variability, the experimental error will be increased. In the Alliance,
Nebraska, wheat variety trial (see §9.5) this was the reason for not finding any significant dif-
ferences among the &' varieties. Because of the recognized effect of unaccounted spatial
variability among experimental units on the analysis, uniformity trials where a single treat-
ment is applied throughout an experimental area, have received much attention in the past.
Limitations in space, time and the economics of operating agricultural experiment stations
have almost eliminated uniformity trials as a vehicle to study experimental conditions,
although there is much to be learned from past uniformity trials.
Wiebe (1935) discusses the yields of wheat on "ß &!! nursery plots grown in the summer
of 1927 on the west end of series "!! on the Aberdeen Substation, Aberdeen, Idaho. Each
plot consisted of fifteen-foot rows of wheat spaced twelve inches apart. The plots are
arranged in a regular rectangular lattice with "#& rows and "# columns. The data are too volu-
minous to reproduce here but are contained on the companion CD-ROM and printed in Table
6.2 of Andrews and Herzberg (1985). Griffith and Layne (1999) conduct an analysis of these
data with spatial autoregressive lattice models. These authors recommend analyzing a trans-
formation of the grain yields rather than the raw data to achieve greater symmetry in the data
and to stabilize the variance. Following their example we analyze the square root of the
yields. In our examination of Wiebe's wheat yield data we contrast three different methods to
examine, measure, and account for spatial dependencies among field plots. Figure 9.54 shows
the sample medians of the square root yields by row and column (series). There are obvious
trends in both directions which could be modeled as large-scale trends. A study of the spatial
dependency of grain yields should remove these trends and we consider ordinary least
squares, median polishing, and a simultaneous spatial autoregressive model to accomplish
this task. The trends in the column and row medians appear quadratic or cubic but we choose
to remove only linear trends in rows and columns. This appears like a foolish decision at first.
What we are interested in, however, is which of the three methods can accommodate the
unaccounted large-scale effects and possible spatial autocorrelations among plot yields best in
this setting. Recall that a lattice model with autocorrelated spatial errors can be thought of as
supplanting the missing information on important covariates with the neighborhood connec-
tivity of the lattice sites. Our expectation is thus that the SSAR model will provide a better fit
to the data in light of an ill-specified large-scale trend model compared to the OLS model.
The subsequent analyses were performed with the S+SpatialStats® module. To convert
the data from Microsoft® Excel format (CD-ROM) into an S-PLUS® object named wwy and to
calculate the square root of the wheat yields the statements
import.data(FileName = "D:\\...Path to File...\\WiebeWheatYield.xls",
FileType = "Excel", TargetStartCol = "1",
DataFrame = "wwy", StartCol = "1",
EndCol = "END", StartRow = "1", EndRow = "END")

© 2003 by CRC Press LLC


wwy$ryield <- sqrt(wwy$yield)

are executed. A calculation of Moran's M statistic with the statements


wwy.snhbr <- neighbor.grid(nrow=125,ncol=12,neighbor.type="first.order")
spatial.cor(wwy$ryield,neighbor=wwy.snhbr,statistic="moran",
sampling="free",npermutes=0)

shows significant spatial autocorrelation (Output 9.23) among plot yields which may be
caused by a nonstationary mean.
26
25
Grain Yield

24
23
22

0 20 40 60 80 100 120

Rows
26
25
Grain Yield

24
23

2 4 6 8 10 12

Series

Figure 9.54. Row and column (series) sample medians for Wiebe's wheat yield data.

Output 9.23.
Spatial Correlation Estimate

Statistic = "moran" Sampling = "free"

Correlation = 0.2311
Variance = 3.484e-4
Std. Error = 0.01866

Normal statistic = 12.42


Normal p-value (2-sided) = 2.055e-35

Null Hypothesis: No spatial autocorrelation

Before the SSAR model can be fit the neighborhood structure must be defined. We
choose a rook definition and standardize the weights to sum to one to mirror the SSAR
analysis in Griffith and Layne (1999):
wwy.snhbr <- neighbor.grid(nrow=12,ncol=125,neighbor.type="first.order")
n <- wwy.snhbr[length(wwy.snhbr[,1]),1]
for (i in 1:n) {
wwy.snhbr$weights[wwy.snhbr$row.id==i] <- 1/sum(wwy.snhbr$row.id == i)

© 2003 by CRC Press LLC


}
wwy.SAR <- slm(ryield ~ row + series ,data=wwy,cov.family=SAR,
spatial.arglist=list(neighbor=wwy.snhbr),start=0.3)
summary(wwy.SAR)
lrt(wwy.SAR,parameters=c(0))

The abbreviated output shows a large estimate of the spatial interaction parameter
as
3 œ !Þ)%'*, Output 9.24b and the likelihood ratio test for L! : 3 œ ! is soundly rejected
a: œ !b. There is significant spatial autocorrelation in these data beyond row and column
effects.

Output 9.24.
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 24.1736 0.4704 51.3909 0.0000
row 0.0129 0.0050 2.5732 0.0102
series -0.1243 0.0456 -2.7261 0.0065

Residual standard error: 1.0779 on 1496 degrees of freedom

rho = 0.8469

Likelihood Ratio Test

Chisquare statistic = 1182.836, df =1, p.value = 0

Note: The estimates for intercept, row and column effects and their standard errors differ from those in Griffith and
Layne (1999). These authors standardize the square root yield to have sample mean 0 and sample variance 1 and
standardize the row and series effects to have mean 0.

The ordinary least squares analysis assuming independence of the plot yields with the
statements
wwy.OLS <- lm(ryield ~ row + series, data=wwy)
summary(wwy.OLS)

yields parameter estimates that are not too different from the SSAR estimate (Output 9.25)
but their standard errors are too optimistic.

Output 9.25.
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 24.7626 0.1292 191.6848 0.0000
row 0.0138 0.0013 10.5847 0.0000
series -0.2264 0.0136 -16.6716 0.0000

Residual standard error: 1.816 on 1497 degrees of freedom


Multiple R-Squared: 0.2067
F-statistic: 195 on 2 and 1497 degrees of freedom, the p-value is 0

Finally, median polishing in S-PLUS® is accomplished with the statements


wwy.mp <- twoway(ryield~row+series,data=wwy)
wwy.mp$signal <- wwy$ryield - wwy.mp$residual

© 2003 by CRC Press LLC


2 4 6
SAR Residual

-2
-6

22 23 24 25 26

Fitted Value SAR


2 4 6
OLS Residual

-2
-6

22 23 24 25 26

Fitted Value OLS


2 4 6
MP Residual

-2
-6

22 23 24 25 26

Fitted Value MP

Figure 9.55. Residuals in SSAR model, OLS, and median polished residuals for Wiebe's
wheat yield data. Only row and column trends were removed as large-scale trends.

The quality of the fit of the three models is assessed by plotting residuals against the
predicted (fitted) yields (Figure 9.55). The spread of the fitted values indicates the variability
in the predictions under the respective models. The SSAR model yields the least dispersed
fitted values followed by the OLS fit and the median polishing. The OLS residuals exhibit a
definite trend which shows the incomplete removal of the large-scale trend. No such trend is
apparent in the SSAR residuals. The spatial neighborhood structure has supplanted the mis-
sing quadratic and cubic trends in the large-scale model well. Maybe surprisingly, median
polishing performs admirably in removing the trend in the data. The residuals exhibit almost
no trend. Compared to the SSAR fit the median polished residuals are considerably more dis-
persed, however. If one compares the sample variance of the various residuals, the incomplete
trend removal in the OLS fit and the superior quality of the SSAR fit are evident:
=#SPW œ $Þ#*, =#Q T œ "Þ)%, =#WWEV œ "Þ"'.
How well the methods accounted for the spatial variability in the plot yields can be
studied by calculating the empirical semivariograms of the respective residuals. If spatial
variability  both large-scale and small-scale  is accounted for, the empirical semivariogram
should resemble a nugget-only model. An assumption of second-order stationarity can be
made for the three residual semivariograms but not for the raw data (Figure 9.56). The empir-
ical semivariogram of the SAR residuals shows the small variability (low sill) of these resid-
uals and the complete removal of spatial autocorrelation. Residual spatial dependency re-
mains in the median polished and OLS residuals. The sills of the residual semivariograms
agree well with the sample variances. The lattice model clearly outperforms the other two
methods of trend removal. It is left to the reader to examine the relative performance of the
three approaches if not only linear row and column trends are removed but quadratic or cubic
trends.

© 2003 by CRC Press LLC


Original Data Sqrt Transformed SAR Residuals

5
4

4
3

3
gamma

gamma
2

2
1

1
0

0
0 20 40 60 80 100 0 20 40 60 80 100

distance distance

OLS Residuals Median Polished Residuals


5

5
4

4
3

3
gamma

gamma
2

2
1

1
0

0 20 40 60 80 100 0 20 40 60 80 100

distance distance

Figure 9.56. Empirical semivariograms of raw data, SSAR, OLS, and median polished resi-
duals for Wiebe's wheat yield data. All semivariograms are scaled identically to highlight the
differences in the sill.

9.8.8 Point Patterns — First- and Second-Order Properties of a


Mapped Pattern
In this final application we demonstrate the important steps in analyzing a mapped spatial
point pattern. We examine a pattern's first- and second-order properties through estimation of
the intensity and O -function analysis. The hypothesis of complete spatial randomness is test-
ed by means of Monte Carlo tests based on nearest-neighbor distances. The analysis is carried
out with functions of the S+SpatialStats® module and S-PLUS® functions developed by the
authors. Nearest-neighbor analyses Monte Carlo tests can also be performed in SAS® with
the macro %Ghatenv() contained in file \SASMacros\NearestNeighbor.sas. The point pat-
tern under consideration is shown in Figure 9.57. It contains 8 œ ")! events located on a rec-
tangle with boundary a!ß %!!b ‚ Ð!ß #!!Ñ. This is a simulated point pattern but we shall not
reveal the point pattern model that generated it. Rather, we ask the reader to consider Figure
9.57 and query his/her intuition whether the process that produced Figure 9.57 is completely
random, clustered, or regular.
If the observed pattern is the realization of a CSR process, then the number of events in
nonoverlapping intervals are independent and events are furthermore uniformly distributed.
The question of uniformity can be answered by estimating the intensity function

© 2003 by CRC Press LLC


EcR a. sbd
-asb œ lim ,
l. sl Ä ! l. sl

that represents the number of events per unit area. If the intensity does not vary with spatial
location, the process is first-order stationary (= homogeneous).
200
150
Y-coordinate

100
50
0

0 100 200 300 400

X-coordinate

Figure 9.57. Mapped point pattern of 8 œ ")! events on the a!ß %!!b ‚ Ð!ß #!!Ñ rectangle.

The common intensity estimators are discussed in §A9.9.10. The naïve estimator is
simply, -s œ 8ÎlEl, where lEl is the area of the domain considered. Here, 8 œ ")! and
lEl œ %!!‡#!!, hence - s œ !Þ!!##&. This estimator does not vary with spatial location, it is
appropriate only if the process is homogeneous. Location-dependent estimators can be
obtained in a variety of ways. One can grid the domain and count the number of events in a
grid cell. This process is usually followed by some type of smoothing of the raw counts.
S+SpatialStats® terms this the binning estimator. One can also apply nonparametric smooth-
ing techniques such as kernel estimation (§A4.8.7) directly. The smoothness (= spatial resolu-
tion) of binning estimators depends on the number of grid cells and the smoothness of kernel
estimators on the choice of bandwidth (§4.7.2) The statements below calculate the binning
estimator on a #! ‚ "! grid and kernel estimators with gaussian weight function for three
different bandwidths (Figure 9.58).
par(mfrow=c(2,2))
image(intensity(sppattern,method="binning",nx=20,ny=10))
title(main="20*10 Binning w/ LOESS")
image(intensity(sppattern,method="gauss2d",bw=25))
title(main="Kernel, Bandwidth=25")
image(intensity(sppattern,method="gauss2d",bw=50))
title(main="Kernel, Bandwidth=50")
image(intensity(sppattern,method="gauss2d",bw=100))
title(main="Kernel, Bandwidth=100")

With increasing bandwidth the kernel smoother approaches the naïve estimator and the
location-dependent features of the intensity can no longer be discerned. The binning estimator
as well as the kernel smoothers with bandwidths #& and &! show a concentration of events in
the southeast and northwest corners of the area.

© 2003 by CRC Press LLC


20*10 Binning w/ LOESS Kernel, Bandwidth=25

200

200
150

150
100

100
50
50

0
0

0 100 200 300 400 0 100 200 300 400

Kernel, Bandwidth=50 Kernel, Bandwidth=100


200

200
150

150
100

100
50

50
0

0 100 200 300 400 0 100 200 300 400

Figure 9.58. Spatially explicit intensity estimators for point pattern in Figure 9.57. Lighter
colors correspond to higher intensity.

The first three panels of Figure 9.58 suggest that events tend to group in certain areas,
and that the process appears to be clustered. At this point there are three possible
explanations, and further progress depends on which is trusted.
1. The number of events in nonoverlapping areas are independent. There is no repulsion
or attraction of events. Instead, the first-order intensity (the mean function) -asb is
simply a function of spatial location. An inhomogeneous Poisson process is a reason-
able model and it remains to estimate the intensity function -asb.
2. The first-order intensity does not depend on the spatial location, i.e., -asb œ -. The
grouping of events is due only to spatial interaction of events. The second-order prop-
erties of the point pattern (the spatial dependency) suffice to explain the nonhomoge-
neous distribution of events.
3. In addition to interactions among events the first-order intensity is not constant.

The three conditions are roughly equivalent to the following scenarios for geostatistical
data. Independent observations with large-scale variation in the mean (1), a constant mean
with spatial autocorrelation (2), and large-scale variations combined with spatial autocorrela-
tion (3). While random field models for geostatistical and lattice data allow the separation of
large-scale and smooth-scale spatial variation, less constructive theory is available for point
pattern analysis. Second-order methods for point pattern analysis require stationarity of the
intensity just as semivariogram analysis for geostatistical data requires stationarity of the
mean function. There we can either detrend the data or rely on methods that simultaneously
estimate the mean and second-order properties (e.g., maximum likelihood). With point
patterns, this separation is not straightforward. If we consider explanation 2 we can proceed

© 2003 by CRC Press LLC


with examining the second-order properties (O -function) of the process. Following explana-
tions 1 or 2 this is not possible.

We developed an S-PLUS® function that provides a comprehensive analysis of a spatial


point pattern (function GAnalysis() on CD-ROM). The function performs four specific tasks.
1. It plots the realization of the spatial point pattern (upper left panel of Figure 9.59)
2. It calculates the empirical distribution function of nearest-neighbor distances and com-
pares those to the theoretical distribution function of a CSR process (upper right panel
of Figure 9.59).
3. It performs Monte Carlo simulations under the CSR hypothesis and graphs the upper
and lower simulation envelopes of nearest neighbor distances against the observed
pattern (lower left panel of Figure 9.59). The empirical :-values of the test against the
clustered and regular alternative are calculated, the test statistic is the average nearest-
neighbor distance.
4. It estimates the O -function from the observed point pattern and plots P s a2 b  2 œ
s 1Ñ  2 against distance 2 (lower right panel of Figure 9.59). We prefer this
ÐOÎ !Þ&

graph because a graph of O s a2b against the CSR benchmark 12# often fails to reveal
the subtle deviations from complete spatial randomness. The P s  2 versus 2 plot
amplifies the deviation from CSR visually. The CSR process is represented by a hori-
sa2b  2 rises above the zero
zontal line at ! in this plot. Clustering is indicated when P
line. Because the variance of O s a2b increases sharply with 2 interpretation of these
graphs should be restricted to a distance no greater than one half of the length of the
shorter side of the bounding rectangle (here 2 œ "!!).

After making the GAnalysis() function available to S+SpatialStats® , all of these tasks
are accomplished by the function call GAnalysis(sppattern,n=180,sims=100,cluster="
"). A descriptive string can be assigned to the cluster= argument which will be shown on
the output (Figure 9.59, argument was omitted here). Based on the Monte Carlo test of near-
est-neighbor distances with "!! simulations we conclude that the observed pattern exhibits
clustering. Among all "!" point patterns (one observed, one hundred simulated), the observed
pattern had the smallest nearest neighbor distance (rank œ "), leading to a :-value of !Þ!!**
against the clustered alternative. The observed K s function is close to the upper simulation
s
envelope and crosses it repeatedly. The Pa2b  2 plot shows the elevation above the zero line
that corresponds to a CSR process. The expected number of extra events within distance 2
from an arbitrary event is larger than under CSR. To see a drop of the P sa2b  2 plot below
zero for larger distances in clustered processes is common. This occurs when distances are
larger than the cluster diameters and cover a lot of white space. Recall the recommendation
sa2b  2 not be interpreted for distances in excess of one half of the length of the
that P
smaller side of the bounding rectangle. Up to 2 œ "!! clustering of the process is implied.
s analysis and the P
Having concluded that this is a clustered point pattern, based on the K s
function, we would like to know whether the conclusion is correct. It is indeed. The point
pattern in Figure 9.57 was simulated with S+SpatialStats® with the statements

© 2003 by CRC Press LLC


set.seed(24)
sppattern <- make.pattern(n=180,process="cluster",radius=35,cpar=25,
boundary=bbox(x=c(0,400),y=c(0,200)))

Point Pattern n = 180 Ghat Function vs. CSR G Function

1.0
200

0.8
150

0.6
Ghat
100
y

0.4
rank = 1
p(Reg) = 0.9901
50

0.2
p(Clu) = 0.0099
0

0.0
0 100 200 300 400 0.0 0.2 0.4 0.6 0.8 1.0
x CSR G(y)

Ghat Simulation, sims = 100 Lhat-h versus distance


1.0

6
0.8
Ghat and envelopes

4
Lhat-distance
0.6

2
0.4

0
0.2

-2
0.0

-4

0.0 0.2 0.4 0.6 0.8 1.0 0 50 100 150 200


Simulated mean G(y) Distance

Figure 9.59. Results of analyzing the point pattern in Figure 9.57 with GAnalysis().

The set.seed() statement fixes the seed of the random number generator at a given
value. Subsequent runs of the program with the same seed will produce identical point
patterns. The make.pattern() function simulates the realization of a particular point process.
Here, a cluster process is chosen with parameters radius=35 and cpar=25. Twenty-five
parent events are placed according to a homogeneous Poisson process. Around each parent,
offspring events are placed independently of each other within radius $& of the parent
location. Finally, the parent events are deleted and only the offspring locations are retained.
This is known as a Poisson Cluster process, special cases of which are the Neyman-Scott
processes (see §A9.9.11 and Neyman and Scott 1972). Although this is difficult to discern
from Figure 9.57, the process consists of #& clusters. Furthermore, following explanation 2.
above was the correct course of action. This Neyman-Scott process is a stationary process.

© 2003 by CRC Press LLC


Bibliography

Agresti, A. (1990) Categorical Data Analysis. John Wiley & Sons, New York
Akaike, H. (1974) A new look at the statistical model identification. IEEE Transaction on Automatic
Control, AC-19:716-723
Allen, D.M. (1974) The relationship between variable selection and data augmentation and a method of
prediction. Technometrics, 16:125-127
Allender, W.J. (1997) Effect of trifluoperazine and verapamil on herbicide stimulated growth of cotton.
Journal of Plant Nutrition, 20(1):69-80
Allender, W.J., Cresswell, G.C., Kaldor, J., and Kennedy, I.R. (1997) Effect of lithium and lanthanum
on herbicide induced hormesis in hydroponically-grown cotton and corn. Journal of Plant
Nutrition, 20:81-95
Amateis, R.L. and Burkhart, H.E. (1987) Cubic-foot volume equations for loblolly pine trees in cutover,
site-prepared plantations. Southern Journal of Applied Forestry, 11:190-192
Amemiya, T. (1973) Regression analysis when the variance of the dependent variable is proportional to
the square of its expectation. Journal of the American Statistical Association, 68:928-934
Anderson, J.A. (1984) Regression and ordered categorical variables. Journal of the Royal Statistical
Society (B), 46(1):1-30
Anderson, R.L. and Nelson, L.A. (1975) A family of models involving intersecting straight lines and
concomitant experimental designs useful in evaluating response to fertilizer nutrients. Biometrics,
31:303-318
Anderson, T.W. and Darling, D.A. (1954) A test of goodness of fit. Journal of the American Statistical
Association, 49:765-769
Andrews, D.F., Bickel, P.J., Hampel, F.R., Huber, P.J., Rogers, W.H., and Tukey, J.W. (1972) Robust
Estimates of Location: Survey and Advances. Princeton University Press, Princeton, NJ
Andrews, D.F. and Herzberg, A.M. (1985) Data. A Collection of Problems from Many Fields for the
Student and Research Worker. Springer-Verlag, New York.
Anscombe, F.J. (1948) The transformation of Poisson, binomial, and negative-binomial data.
Biometrika, 35:246-254
Anscombe, F.J. (1960) Rejection of outliers. Technometrics, 2:123-147
Anselin, L. (1995) Local indicators of spatial association — LISA. Geographical Analysis, 27:93-115
Armstrong, M. and Delfiner, P. (1980) Towards a more robust variogram: A case study on coal.
Technical Report N-671. Centre de Géostatistique, Fontainebleau, France
Baddeley, A.J. and Silverman, B.W. (1984) A cautionary example on the use of second-order methods
for analyzing point patterns. Biometrics, 40:1089-1093
Bailey, R.L. (1994) A compatible volume-taper model based on the Schumacher and Hall generalized
form factor volume equation. Forest Science, 40:303-313
Barnes, R.J. and Johnson, T.B. (1984) Positive kriging. In: Geostatistics for Natural Resource
Characterization Part 1 (Verly, G., David, M., Journel, A.G. and Maréchal, A. eEds.) Reidel,
Dortrecht, The Netherlands, p. 231-244
Barnett, V. and Lewis, T. (1994) Outliers in Statistical Data, 3rd ed. John Wiley & Sons, New York
Bartlett, M.S. (1937a) Properties of sufficiency and statistical tests. Proceedings of the Royal Statistical
Society, Series A, 160:268-282
Bartlett, M.S. (1937b) Some examples of statistical methods of research in agriculture and applied
biology. Journal of the Royal Statistical Society, Suppl., 4:137-183
Bartlett, M.S. (1938) The approximate recovery of information from field experiments with large
blocks. Journal of Agricultural Science, 28:418-427
Bartlett, M.S. (1978a) Nearest-neighbour models in the analysis of field experiments (with discussion).
Journal of the Royal Statistical Society (B), 40:147-174
Bartlett, M.S. (1978b) Stochastic Processes. Methods and Applications. Cambridge University Press,
London
Bates, D.M. and Watts, D.G. (1980) Relative curvature measures of nonlinearity. Journal of the Royal
Statistical Society (B), 42:1-25
Bates, D. M., and Watts, D.G. (1981) A relative offset orthogonality convergence criterion for nonlinear
least squares. Technometrics, 123:179-183.
Beale, E.M.L. (1960) Confidence regions in non-linear estimation. Journal of the Royal Statistical
Society (B), 22:41-88
Beaton, A.E. and Tukey, J.W. (1974) The fitting of power series, meaning polynomials, illustrated on
band-spectroscopic data. Technometrics, 16:147-185
Beck, D.E. (1963) Cubic-foot volume tables for yellow poplar in the southern Appalachians. USDA
Forest Service, Research Note SE-16.
Becker, M.P. (1989) Square contingency tables having ordered categories and GLIM. GLIM Newsletter
No. 19. Royal Statistical Society, NAG Group
Becker, M.P. (1990a) Quasisymmetric models for the analysis of square contingency tables. Journal of
the Royal Statistical Society (B), 52:369-378
Becker, M.P. (1990b) Algorithm AS 253; Maximum likelihood estimation of the RC(M) association
model. Applied Statistics, 39:152-167
Beltrami, E. (1998) Mathematics for Dynamic Modeling. 2nd ed. Academic Press, San Diego, CA
Berkson, J. (1950) Are there two regressions? Journal of the American Statistical Association, 45:164-
180
Besag, J.E. (1974) Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal
Statistical Society (B), 36:192-236
Besag, J.E. (1975) Statistical analysis of non-lattice data. The Statistician, 24:179-195.
Besag, J. and Kempton, R. (1986) Statistical analysis of field experiments using neighboring plots.
Biometrics, 42(2):231-251
Biging, G.S. (1985) Improved estimates of site index curves using a varying parameter model. Forest
Science, 31:248-259
Binford, G.D., Blackmer, A.M., and Cerrato, M.E. (1992) Relationship between corn yield and soil
nitrate in late spring. Agronomy Journal, 84:53-59
Birch, J.B. and Agard, D.B. (1993) Robust inference in regression: a comparative study.
Communications in Statistics  Simulation, 22(1):217-244
Black, C.A. (1993) Soil Fertility Evaluation and Control. Lewis Publishers, Boca Raton, FL
Blackmer, A.M., Pottker, D., Cerrato, M.E., and Webb, J. (1989) Correlations between soil nitrate
concentrations in late spring and corn yields in Iowa. Journal of Production Agriculture, 2:103-
109
Bleasdale, J.K.A. and Nelder, J.A. (1960) Plant population and crop yield. Nature, 188:342
Bleasdale, J.K.A. and Thompson, B. (1966) The effects of plant density and the pattern of plant
arrangement on the yield of parsnips. Journal of Horticultural Science 41:145-153
Bose, R.C. and Nair, K.R. (1939) Partially balanced incomplete block designs. Sankhya, 4:337-372
Bowman, D.T. (1990) Trend analysis to improve efficiency of agronomic trials in flue-cured tobacco.
Agronomy Journal, 82:499-501
Box, G.E.P. (1954a) Some theorems on quadratic forms applied in the study of analysis of variance
problems, I. Effects of inequality of variance in the one-way classification. Annals of
Mathematical Statistics, 25:290-302
Box, G.E.P. (1954b) Some theorems on quadratic forms applied in the study of analysis of variance
problems, II. Effects of inequality of variance and of correlations between errors in the two-way
classification. Annals of Mathematical Statistics, 25:484-498
Box, G.E.P. and Andersen, S.L. (1955) Permutation theory in the derivation of robust criteria and the
study of departures from assumption. Journal of the Royal Statistical Society (B), 17:1-26
Box, G.E.P. and Cox, D.R. (1964) The analysis of transformations. Journal of the Royal Statistical
Society (B), 26:211-252
Box, G.E.P. Jenkins, G.M., and Reinsel, G.C. (1994) Time Series Analysis: Forecasting and Control.
Prentice Hall, Englewood Cliffs, NJ
Bozdogan, H. (1987) Model selection and Akaike's information criterion (AIC): the general theory and
its analytical extensions. Psychometrika, 52:345-370
Brain, P. and Cousens, R. (1989) An equation to describe dose responses where there is stimulation of
growth at low doses. Weed Research, 29: 93-96
Breslow, N.E. and Clayton, D.G. (1993) Approximate inference in generalized linear mixed models.
Journal of the American Statistical Association, 88:9-25
Brown, M.B. and Forsythe, A.B. (1974) Robust tests for the equality of variances. Journal of the
American Statistical Association, 69:364-367
Brown, R.L., Durbin, J., and Evans, J.M. (1975) Techniques for testing the constancy of regression
relationships over time. Journal of the Royal Statistical Society (B), 37:149-192
Brownie, C., Bowman, D.T., and Burton, J.W. (1993) Estimating spatial variation in analysis of data
from yield trials: a comparison of methods. Agronomy Journal, 85:1244-1253
Brownie, C. and Gumpertz, M.L. (1997) Validity of spatial analysis for large field trials. Journal of
Agricultural, Biological, and Environmental Statistics, 2(1):1-23
Bunke, H. and Bunke, O. (1989) Nonlinear Regression, Functional Relationships and Robust Methods.
John Wiley & Sons, New York
Burkhart, H.E. (1977) Cubic-foot volume of loblolly pine to any merchantable top limit. Southern
Journal of Applied Forestry, 1:7-9
Carroll, R.J. and Ruppert, D. (1984) Power transformations when fitting theoretical models to data.
Journal of the American Statistical Association, 79:321-328
Carroll, R.J. Ruppert, D., and Stefanski, L.A. (1995) Measurement Error in Nonlinear Models.
Chapman and Hall, New York
Cerrato, M.E. and Blackmer, A.M. (1990) Comparison of models for describing corn yield response to
nitrogen fertilizer. Agronomy Journal, 82:138-143
Chapman, D.G. (1961) Statistical problems in population dynamics. In: Proceedings of the Fourth
Berkeley Symposium on Mathematical Statistics and Probability. University of California Press,
Berkeley
Chauvet, P. (1982) The variogram cloud. In: Proceedings of the "(th APCOM International Symposium.
Golden, CO, 757-764
Chilès, J.-P. and Delfiner, P. (1999) Geostatistics. John Wiley & Sons, New York
Cleveland, W.S. (1979) Robust locally weighted regression and smoothing scatterplots. Journal of the
American Statistical Association, 74:829-836
Cleveland, W.S., Devlin, S.J., and Grosse, E. (1988) Regression by local fitting. Journal of
Econometrics, 37:87-114
Cliff, A.D. and Ord, J.K. (1973) Spatial Autocorrelation. Pion, London
Cliff, A.D. and Ord, J.K. (1981) Spatial Processes; Models and Applications, Pion, London
Clutter, J.L., Fortson, J.C., Pienaar, L.V. Brister, G.H., and Bailey, R.L. (1992) Timber Management.
Krieger Publishing, Malabar, FL
Cochran, W.G. (1941) The distribution of the largest of a set of estimated variances as a fraction of their
total. Annals of Eugenics, 11:47-52
Cochran, W.G. (1954) Some methods for strengthening the common ;# tests. Biometrics, 10:417-4517
Cochran, W.G. and Cox, G.M. (1957) Experimental Design 2nd ed. John Wiley & Sons, New York
Cochrane, D. and Orcutt, G.H. (1949) Applications of least square regression to relationships containing
autocorrelated error terms. Journal of the American Statistical Association, 44:32-61
Cole, J.W.L. and Grizzle, J.E. (1966) Applications of multivariate analysis of variance to repeated
measures experiments. Biometrics, 22:810-828
Cole, T.J. (1975) Linear and proportional regression models in the prediction of ventilatory function.
Journal of the Royal Statistics Society (A), 138:297-333
Coleman, D., Holland, P., Kaden, N., Klema, V., and Peters, S. C. (1980) A system of subroutines for
iteratively re-weighted least-squares computations. ACM Transactions on Mathematical
Software, 6:327-336.
Colwell, J.D., Suhet, A.R., and Van Raij, B. (1988) Statistical procedures for developing general soil
fertility models for variable regions. Report No. 93, CSIRO Division of Soils (Australia),
Cook, R.D. (1977) Detection of influential observations in linear regression. Technometrics, 19:15-18
Cook, R.D. and Tsai, C.-L. (1985) Residuals in nonlinear regression. Biometrika, 72:23-29
Corbeil, R.R. and Searle, S.R. (1976) A comparison of variance component estimators, Biometrics,
32:779-791
Courtis, S.A. (1937) What is a growth cycle? Growth, 1:247-254
Cousens, R. (1985) A simple model relating yield loss to weed density. Annals of Applied Biology,
107:239-252
Cox, C. (1988) Multinomial regression models based on continuation ratios. Statistics in Medicine,
7:435-441.
Cox, D.R. and Snell, E.J. (1989) The Analysis of Binary Data, 2nd ed. Chapman and Hall, London
Craig, J.R., Edwards, D., Rimstidt, J.D., Scanlon, P.F., Collins, T.K., Schabenberger, O., and Birch, J.B.
(2002) Lead distribution on a public shotgun range. Environmental Geology, 41:873-882
Craven, P. and Wahba, G. (1979) Smoothing noisy data with spline functions. Numerical Mathematics,
31:377-403
Cressie, N. (1985) Fitting variogram models by weighted least squares. Journal of the International
Association for Mathematical Geology, 17:563-586
Cressie, N.A.C. (1986) Kriging nonstationary data. Journal of the American Statistical Association,
81:625-634
Cressie, N.A.C. (1993) Statistics for Spatial Data. Revised Ed. John Wiley & Sons, New York
Cressie, N.A.C. and Hawkins, D.M. (1980) Robust estimation of the variogram, I. Journal of the
International Association for Mathematical Geology, 12:115-125
Crowder, M.J. and Hand, D.J. (1990) Analysis of Repeated Measures. Chapman and Hall, New York
Curriero, F.C. and Lele, S. (1999) A composite likelihood approach to semivariogram estimation.
Journal of Agricultural, Biological, and Environmental Statistics, 4(1):9-28
Davidian, M. and Giltinan, D.M. (1993) Some general estimation methods for nonlinear mixed-effects
models. Journal of Biopharmaceutical Statistics, 3(1):23-55
Davidian, M. and Giltinan, D.M. (1995) Nonlinear Models for Repeated Measurement Data. Chapman
and Hall, New York
Delfiner, P. (1976) Linear estimation of nonstationary spatial phenomena. In: Advanced Geostatistics in
the Mining Industry (M. Guarascio, M. David, C. Huijbregts, eds.) Reidel, Dortrecht, The
Netherlands, pp. 49-68
Delfiner, P., Renard D., and Chilès, J.P. (1978) Bluepack-3D Manual, Centre de Geostatistique,
Fontainebleau, France
Diggle, P. (1983) Statistical Analysis of Spatial Point Patterns. Academic Press, London
Diggle, P.J. (1988) An approach to the analysis of repeated measurements. Biometrics, 44:959-971
Diggle, P.J. (1990) Time Series: A Biostatistical Introduction. Clarendon Press, Oxford, UK
Diggle, P., Besag, J.E. and Gleaves, J.T. (1976) Statistical analysis of spatial patterns by means of
distance methods. Biometrics, 32:659-667
Diggle, P.J., Liang, K.-Y., and Zeger, S.L. (1994) Analysis of Longitudinal Data. Clarendon Press,
Oxford, UK
Draper, N.R. and Smith, H. (1981) Applied Regression Analysis. 2nd ed. John Wiley & Sons, New York
Dunkl, C.F. and Ramirez, D.E. (2001) Computation of the generalized F distribution. The Australian
and New Zealand Journal of Statistics, 43:21-31
Durbin, J. and Watson, G.S. (1950) Testing for serial correlation in least squares regression. I.
Biometrika, 37:409-428
Durbin, J. and Watson, G.S. (1951) Testing for serial correlation in least squares regression. II.
Biometrika, 38:159-178
Durbin, J. and Watson, G.S. (1971) Testing for serial correlation in least squares regression. III.
Biometrika, 58:1-19
Eisenhart, C. (1947) The assumptions underlying the analysis of variance. Biometrics, 3:1-21
Engel, J. (1988) Polytomous logistic regression. Statistica Neerlandica, 42(4):233-252.
Emerson, J.D. and Hoaglin, D.C. (1983) Analysis of two-way tables by medians. In: Understanding
Robust and Exploratory Data Analysis (Hoaglin D.C., Mosteller, F., and Tukey, J.W., eds.), John
Wiley & Sons, New York, pp. 166-207
Emerson, J.D. and Wong, G.Y. (1985) Resistant nonadditive fits for two-way tables. In: Exploring
Data Tables, Trends, and Shapes (Hoaglin, D.C., Mosteller, F., and Tukey, J.W., eds.), John
Wiley & Sons, New York, pp. 67-124
Engelstad, O.P. and Parks, W.L. (1971) Variability in optimum N rates for corn. Agronomy Journal,
63:21-23
Epanechnikov, V. (1969) Nonparametric estimates of a multivariate probability density. Theory of
Probability and its Applications, 14:153-158
Eubank, R.L. (1988) Spline Smoothing and Nonparametric Regression. Marcel Dekker, New York
Fahrmeir, L. and Tutz, G. (1994) Multivariate Statistical Modelling Based on Generalized Linear
Models. Springer-Verlag, New York
Federer, W.T. and Schlottfeldt, C.S. (1954) The use of covariance to control gradients in experiments.
Biometrics, 10:282-290
Fedorov, V.V. (1974) Regression problems with controllable variables subject to error. Biometrika,
61:49-56
Fieller, E.C. (1940) The biological standardization of insulin. Journal of the Royal Statistical Society
(Suppl.), 7:1-64
Fienberg, S.E. (1980) The Analysis of Cross-classified Categorical Data. MIT Press, Cambridge, MA
Finney, D.J. (1978) Statistical Methods in Biological Assay, 3rd ed. Macmillan, New York
Firth, D. (1988) Multiplicative errors: log-normal or gamma. Journal of the Royal Statistical Society
(B), 50:266-268
Fisher, R.A. (1935) The Design of Experiments. Oliver and Boyd, Edinburgh
Fisher, R.A. (1947) The Design of Experiments, 4th ed. Oliver and Boyd, Edinburgh
Folks, J.L. and Chhikara, R.S. (1978) The inverse Gaussian distribution and its statistical application: a
review. Journal of the Royal Statistical Society (B), 40:263-275
Freney, J.R. (1965) Increased growth and uptake of nutrients by corn plants treated with low levels of
simazine. Australian Journal of Agricultural Research, 16:257-263
Gabriel, K.R. (1962) Ante-dependence analysis of an ordered set of variables. Annals of Mathematical
Statistics, 33:201-212
Gallant, A.R. (1975) Nonlinear regression. The American Statistician, 29:73-81
Gallant, A.R. (1987) Nonlinear Statistical Models. John Wiley & Sons, New York
Gallant, A.R. and Fuller, W.A. (1973) Fitting segmented poynomial regression models whose join
points have to be estimated. Journal of the American Statistical Association, 68:144-147
Galpin, J.S.and Hawkins, D.M. (1984) The use of recursive residuals in checking model fit in linear
regression. The American Statistician, 38(2):94-105
Galton, F. (1886) Regression towards mediocrity in hereditary stature. Journal of the Anthropological
Institute, 15:246-263
Gayen, A.K. (1950) The distribution of the variance ratio in random samples of any size drawn from
non-normal universes. Biometrika, 37:236-255
Geary, R.C. (1947) Testing for normality. Biometrika, 34:209-242
Geary, R.C. (1954) The contiguity ratio and statistical mapping. The Incorporated Statistician, 5:115-
145
Geisser, S. and Greenhouse, S.W. (1958) An extension of Box's results on the use of the F-distribution
in multivariate analysis. Annals of Mathematical Statistics, 29:885-891
Gerrard, D.J. (1969) Competition quotient: a new measure of the competition affecting individual forest
trees. Research Bulletin No. 20, Michigan Agricultural Experiment Station, Michigan State
University
Gillis, P.R. and Ratkowsky, D.A. (1978) The behaviour of estimators of the parameters of various yield-
density relationships. Biometrics, 34:191-198
Gilmour, A.R., Cullis, B.R., and Verbyla, A.P. (1997) Accounting for natural and extraneous variation
in the analysis of field experiments. Journal of Agricultural, Biological, and Environmental
Statistics, 2(3):269-293
Godambe, V.P. (1960) An optimum property of regular maximum likelihood estimation. Annals of
Mathematical Statistics, 31:1208-1211
Golden, M.S., Knowe, S.A., and Tuttle, C.L. (1982) Cubic-foot volume for yellow-poplar in the hilly
coastal plain of Alabama. Southern Journal of Applied Forestry, 6:167-171
Goldberg, R.R. (1961) Fourier Transforms. Cambridge University Press, Cambridge
Goldberger, A.S. (1962) Best linear unbiased prediction in the generalized linear regression model,
Journal of the American Statistical Association, 57:369-375
Gompertz, B. (1825) On the nature of the function expressive of the law of human mortality, and on a
new method of determining the value of life contingencies. Phil. Trans. Roy. Soc., 513-585
Goodman, L.A. (1979a) Simple models for the analysis of association in cross-classifications having
ordered categories. Journal of the American Statistical Association, 74:537-552
Goodman, L.A. (1979b) Multiplicative models for square contingency tables with ordered categories.
Biometrika, 66:413-418
Goodman, L.A. (1985) The analysis of cross-classified data having ordered and/or unordered categories:
association models, correlation models, and asymmetry models for contingency tables with or
without missing entries. Annals of Statistics, 13:10-69
Goovaerts, P. (1997) Geostatistics for Natural Resources Evaluation. Oxford University Press, New
York
Goovaerts, P. (1998) Ordinary cokriging revisited. Journal of the International Association of
Mathematical Geology, 30:21-42
Gotway, C.A. and Stroup, W.W. (1997) A generalized linear model approach to spatial data analysis
and prediction. Journal of Agricultural, Biological, and Environmental Statistics, 2(2):157-178.
Graybill, F.A. (1969) Matrices with Applications in Statistics. 2nd ed. Wadsworth International,
Belmont, CA.
Greenhouse, S.W. and Geisser, S. (1959) On methods in the analysis of profile data. Psychometrika,
32:95-112
Greenwood, C. and Farewell, V. (1988) A comparison of regression models for ordinal data in an
analysis of transplanted-kidney function. Canadian Journal of Statistics, 16(4):325-335.
Gregoire, T.G. (1985) Generalized error structure for yield models fitted with permanent plot data.
Ph.D. dissertation, Yale University, New Haven, CT
Gregoire, T.G. (1987) Generalized error structure for forestry yield models. Forest Science, 33:423-444
Gregoire, T.G., Brillinger, D.R., Diggle, P.J., Russek-Cohen, E., Warren, W.G., and Wolfinger, R.D.
(eds). (1997) Modelling Longitudinal and Spatially Correlated Data. Springer-Verlag, New
York, 402 pp.
Gregoire, T.G., Schabenberger, O., and Barrett, J.P. (1995) Linear modelling of irregularly spaced,
unbalanced, longitudinal data from permanent plot measurements. Canadian Journal of Forest
Research, 25(1):137-156
Gregoire, T.G. and Schabenberger, O. (1996a) Nonlinear mixed-effects modeling of cumulative bole
volume with spatially correlated within-tree data. Journal of Agricultural, Biological, and
Environmental Statistics, 1(1):107-119
Gregoire, T.G. and Schabenberger, O. (1996b) A non-linear mixed-effects model to predict cumulative
bole volume of standing trees. Journal of Applied Statistics, 23a2&3b:257-271
Griffith, D.A. (1996) Some guidelines for specifying the geographic weights matrix contained in Spatial
statistical models. In: Practical Handbook of Spatial Statistics (S.L. Arlinghaus, ed.), CRC Press,
Boca Raton, FL, pp. 65-82
Griffith, D.A. and Layne, L.J. (1999) A Casebook for Spatial Statistical Data Analysis. A Compilation
of Analyses of Different Thematic Data Sets. Oxford University Press, New York
Grondona, M.O. and Cressie, N.A. (1991) Using spatial considerations in the analysis of experiments.
Technometrics, 33:381-392
Härdle, W. (1990) Applied Nonparametric Regression. Cambridge University Press, Cambridge
Haining, R. (1990) Spatial Data Analysis in the Social and Environmental Sciences. Cambridge
University Press, Cambridge
Hampel, F.R. (1974) The influence curve and its role in robust estimation. Journal of the American
Statistical Association, 69:383-393
Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., and Stahel, W.A. (1986) Robust Statistics, The
Approach Based on Influence Functions. John Wiley & Sons, New York
Hanks, R.J., Sisson, D.V., Hurst, R.L., and Hubbard, K.G. (1980) Statistical analysis of results from
irrigation experiments using the line source sprinkler system. Journal of the American Soil
Science Society, 44:886-888
Harris, T.R. and Johnson, D.E. (1996) A regression model with spatially correlated errors for comparing
remote sensing and in-situ measurements of a grassland site. Journal of Agricultural, Biological,
and Environmental Statistics, 1:190-204
Hart, L.P. and Schabenberger, O. (1998) Variability of vomitoxin in a wheat scab epidemic. Plant
Disease, 82:625-630.
Hartley, H.O. (1950) The maximum J -ratio as a short-cut test for heterogeneity of variance.
Biometrika, 31:249-255
Hartley, H.O. (1961) The modified Gauss-Newton method for the fitting of nonlinear regression
functions by least squares. Technometrics, 3:269-280
Hartley, H.O. (1964) Exact confidence regions for the parameters in nonlinear regression laws.
Biometrika, 51:347-353
Hartley, H.O. and Booker, A. (1965) Nonlinear least square estimation. Annals of Mathematical
Statistics, 36(2):638-650
Harville, D.A. (1974) Bayesian inference for variance components using only error contrasts.
Biometrika, 61:383-385
Harville, D.A. (1976a) Extension of the Gauss-Markov theorem to include the estimation of random
effects. The Annals of Statistics, 4:384-395
Harville, D.A. (1976b) Confidence intervals and sets for linear combinations of fixed and random
effects. Biometrics, 32:320-395
Harville, D.A. (1977) Maximum-likelihood approaches to variance component estimation and to related
problems. Journal of the American Statistical Association, 72:320-340
Harville, D.A. and Jeske, D.R. (1992) Mean squared error of estimation or prediction under a general
linear model. Journal of the American Statistical Association, 87:724-731
Hastie, T.J. and Tibshirani, R.J. (1990) Generalized Additive Models. Chapman and Hall, New York
Haseman, J.K. and Kupper, L.L. (1979) Analysis of dichotomous response data from certain toxicolo-
gical experiments. Biometrics, 35:281-293
Hayes, W.L. (1973) Statistics for the Social Sciences. Holt, Rinehart and Winston, New York
Heagerty, P.J. and Lele, S.R. (1998) A composite likelihood approach to binary spatial data. Journal of
the American Statistical Association, 93:1099-1111
Healy, M.J.R. (1986) Matrices for Statistics. Clarendon Press, Oxford, UK
Hearn, A.B. (1972) Cotton spacing experiments in Uganda. Journal of Agricultural Science, 48:19-28
Hedeker, D. and Gibbons, R.D. (1994) A random effects ordinal regression model for multilevel analy-
sis. Biometrics, 50:933-944
Henderson, C.R. (1950) The estimation of genetic parameters. The Annals of Mathematical Statistics,
21:309-310
Henderson, C.R. (1963) Selection index and expected genetic advance. In: Statistical Genetics and
Plant Breeding (NRC Publication 982), Washington, D.C. National Academy of Sciences, pp.
141-163
Henderson, C.R. (1973) Sire evaluation and genetic trends. In: Proceedings of the Animal Breeding and
Genetics Symposium in Honor of Dr. J.L. Lush, Champaign, IL: ASAS and ADSA, pp. 10-41
Heyde, C.C. (1997) Quasi-likelihood and Its Application: A General Approach to Optimal Parameter
Estimation. Springer-Verlag, New York
Himmelblau, D.M. (1972) A uniform evaluation of unconstrained optimization techniques. In:
Numerical Methods for Nonlinear Optimization (F.A. Lootsma, ed.), Academic Press, London
Hinkelmann, K. and Kempthorne, O. (1994) Design and Analysis of Experiments. Volume I.
Introduction to Experimental Design. John Wiley & Sons, New York
Hoerl, A.E. and Kennard, R.W. (1970a) Ridge regression: biased estimation for nonorthogonal
problems. Technometrics, 12:55-67
Hoerl, A.E. and Kennard, R.W. (1970b) Ridge regression: applications to nonorthogonal problems.
Technometrics, 12:69-82
Holland, P.W. and Welsch, R.E. (1977) Robust regression using iteratively reweighted least squares.
Communications in Statistics A, 6:813-888
Holliday, R. (1960) Plant population and crop yield: Part I. Field Crop Abstracts, 13:159-167
Hoshmand, A.R. (1994) Experimental Research Design and Analysis. CRC Press, Boca Raton, FL
Hsiao, A.I., Liu, S.H. and Quick, W.A. (1996) Effect of ammonium sulfate on the phytotoxicity, foliar
uptake, and translocation of imazamethabenz in wild oat. Journal of Plant Growth Regulation,
15:115-120
Huber, O. (1981) Robust Statistics. John Wiley & Sons, New York
Huber, P.J. (1964) Robust estimation of a location parameter. Annals of Mathematical Statistics, 35:
73-101
Huber, P.J. (1973) Robust regression: asymptotics, conjectures, and Monte Carlo. Annals of Statistics,
1:799-821
Hurvich, C.M. and Simonoff, J.S. (1998) Smoothing parameter selection in nonparametric regression
using an improved Akaike information criterion. Journal of the Royal Statistical Society (B),
60:271-293
Huxley, J.S. (1932) Problems of Relative Growth. Dial Press, New York
Huynh, H. and Feldt, L.S. (1970) Conditions under which mean square ratios in repeated measurements
designs have exact F-distributions. Journal of the American Statistical Association, 65:1582-1589
Huynh, H. and Feldt, L.S. (1976) Estimation of the Box correction for degrees of freedom from sample
data in the randomized block and split plot designs. Journal of Educational Statistics, 1:69-82
Isaaks, E. and Srivastava, R. (1989) An Introduction to Applied Geostatistics. Oxford University Press,
New York
Jansen, J. (1990) On the statistical analysis of ordinal data when extravariation is present. Applied
Statistics, 39:75-84
Jennrich, R.J. and Schluchter, M.D. (1986) Unbalanced repeated-measures models with structured
covariance matrices. Biometrics, 42:805-820
Jensen, D.R. and Ramirez, D.E. (1998) Some exact properties of Cook's HM . In: Handbook of Statistics,
Vol. 16 (Balakrishnan, N. and Rao, C.R. eds)., pp. 387-402 Elsevier Science Publishers,
Amsterdam
Jensen, D.R. and Ramirez, D.E. (1999) Recovered errors and normal diagnostics in regression. Metrica,
49:107-119
Johnson, N.L., Kotz, S., and Kemp, A.W. (1992) Univariate Discrete Distributions, 2nd. ed., John
Wiley & Sons, New York
Johnson, N.L., Kotz, S. and Balakrishnan, N. (1995) Univariate Continuous Distributions, Vol. 2, 2nd
ed. Wiley and Sons, New York
Jones, R.H. (1993) Longitudinal Data with Serial Correlation: A State-space Approach. Chapman and
Hall, New York
Jones, R.H. and Boadi-Boateng, F. (1991) Unequally spaced longitudinal data with ARa"b serial corre-
lation. Biometrics, 47:161-176
Journel, A.G. and Huijbregts, C.J. (1978) Mining Geostatistics. Academic Press, London
Kackar, R.N. and Harville, D.A. (1984) Approximations for standard errors of fixed and random effects
in mixed linear models. Journal of the American Statistical Association, 79:853-862
Kaluzny, S.P., Vega, S.C., Cardoso, T.P., and Shelly, A.A. (1998) S+ SpatialStats. User's Manual for
Windows® and Unix. Springer-Verlag, New York
Kempthorne, O. (1952) Design and Analysis of Experiments. John Wiley & Sons, New York
Kempthorne, O. (1955) The randomization theory of experimental inference. Journal of the American
Statistical Association, 50:946-967
Kempthorne, O. (1975) Fixed and mixed model analysis of variance. Biometrics, 31:473-486
Kempthorne, O. and Doerfler, T.E. (1969) The behaviour of some significance tests under randomiza-
tion. Biometrika, 56:231-248
Kendall, M.G. and Stuart, A. (1961) The Advanced Theory of Statistics, Vol 2. Griffin, London
Kenward, M.G. (1987) A method for comparing profiles of repeated measurements. Applied Statistics,
36:296-308
Kenward, M.G. and Roger, J.H. (1997) Small sample inference for fixed effects from restricted
maximum likelihood. Biometrics, 53:983-997
Kianifard, F. and Swallow, W. H. (1996) A review of the development and application of recursive
residuals in linear models. Journal of the American Statistical Association, 91:391-400
Kirby, E.J.M. (1974) Ear development in spring wheat. Journal of the Agricultural Society, 82:437-447
Kirk, H.J., Haynes, F.L., and Monroe, R.J. (1980) Application of trend analysis to horticultural field
trials. Journal of the American Society of Horticultural Science, 105:189-193
Kirk, R.E. (1995) Experimental Design: Procedures for the Behavioral Sciences, 3rd ed., Duxbury
Press, Belmont, CA
Kitanidis, P.K. (1983) Statistical estimation of polynomial generalized covariance functions and hydro-
logical applications. Water Resources Research, 19:909-921
Kitanidis, P.K. and Lane, R.W. (1985) Maximum likelihood parameter estimation of hydrological spa-
tial processes by the Gauss-Newton method. Journal of Hydrology, 79:53-71
Kitanidis, P.K. and Vomvoris, E.G. (1983) A geostatistical approach to the inverse problem in ground-
water modeling (steady state) and one-dimensional simulations. Water Resources Research,
19:677-690
Knoebel, B.R., Burkhart, H.E., and Beck, D.E. (1984) Stem volume and taper functions for yellow-
poplar in the southern Appalachians. Southern Journal of Applied Forestry, 8:185-188
Korn, E.L. and Whittemore, A.S. (1979) Methods for analyzing panel studies of acute health effects of
air pollution. Biometrics, 35:795-802
Kvålseth, T.O. (1985) Cautionary note about R# . The American Statistician, 39(4):279-285
Läärä, E. and Matthews, J. N. S. (1985) The equivalence of two models for ordinal data. Biometrika,
72:206-207.
Lærke, P.E. and Streibig, J.C. (1995) Foliar absorption of some glyphosate formulations and their
efficacy on plants. Pesticide Science, 44:107-116
Laird, A.K. (1965) Dynamics of relative growth. Growth, 29:249-263
Laird, N.M. (1988) Missing data in longitudinal studies. Statistics in Medicine, 7:305-315
Laird, N.M. and Louis, T.A. (1982) Approximate posterior distributions for incomplete data problems.
Journal of the Royal Statistical Society (B), 44:190-200
Laird, N.M. and Ware, J.H. (1982) Random-effects models for longitudinal data. Biometrics, 38:963-
974
Lee, K.R. and Kapadia, C.H. (1984) Variance component estimators for the balanced two-way mixed
model. Biometrics, 40:507-512
Lele, S. (1997) Estimating functions for semivariogram estimation. In: Selected Proceedings of the
Symposium on Estimating Functions (I.V. Basawa, V.P. Godambe, and R.L. Taylor, eds.),
Hayward, CA: Institute of Mathematical Statistics, pp. 381-396.
Lerman, P.M. (1980) Fitting segmented regression models by grid search. Applied Statistics, 29:77-84
Levenberg, K. (1944) A method for the solution of certain problems in least squares. Quarterly Journal
of Applied Mathematics, 2:164-168
Levene, H. (1960) Robust test for equality of variances. In Contributions to Probability and Statistics,
I. Olkin (ed.). pp. 278-292. Stanford University Press, Stanford, CA
Lewis, P.A.W. and Shedler, G.S. (1979) Simulation of non-homogeneous Poisson processes by
thinning. Naval Research Logistics Quarterly, 26:403-413
Liang, K.-Y. and Zeger, S.L. (1986) Longitudinal data analysis using generalized linear models.
Biometrika, 73:13-22
Liang, K.-Y., Zeger, S.L., and Qaqish, B. (1992) Multivariate regression analysis for categorical data.
Journal of the Royal Statistical Society (B), 54:3-40
Lindsay, B.G. (1988), Composite likelihood methods. Contemporary Mathematics, 80:221-239
Lindstrom, M.J. and Bates, D.M. (1988) Newton-Raphson and EM algorithms for linear mixed-effects
models for repeated measures data. Journal of the American Statistical Society, 83:1014-1022
Lindstrom, M.J. and Bates, D.M. (1990) Nonlinear mixed effects models for repeated measures data.
Biometrics, 46:673-687
Littell, R.C., Milliken, G.A., Stroup, W.W., and Wolfinger, R.D. (1996). SAS ® System for Mixed
Models. SAS Institute Inc., Cary, NC
Little, R.J. and Rubin, D.B. (1987) Statistical Analysis with Missing Data. John Wiley & Sons, New
York
Longford, N.T. (1993) Random Coefficient Models. Clarendon Press, Oxford, UK
Lumer, H. (1937) The consequences of sigmoid growth for relative growth functions. Growth, 1:140-
154
Machiavelli, R.E. and Arnold, S.F. (1994) Variable order antedependence models. Communications in
Statistics - Theory and Methods, 23:2683-2699
Maddala, G.S. (1983) Limited-Dependent and Qualitative Variables in Econometrics. Cambridge
University Press, Cambridge, MA
Magee, L. (1990) V # measures based on Wald and likelihood ratio joint significance tests. The Ameri-
can Statistician, 44:250-253
Magnus, J.R. (1988) Matrix Differential Calculus with Applications in Statistics and Econometrics.
John Wiley & Sons, New York
Mallows, C.L. (1973) Some comments on C: Þ Technometrics, 15:661-675
Marquardt, D.W. (1963) An algorithm for least squares estimation of nonlinear parameters, Journal of
the Society for Industrial and Applied Mathematics, 2:431-441
Matheron, G. (1962) Traite de Geostatistique Appliquee, Tome I. Memoires du Bureau de Recherches
Geologiques et Minieres, No. 14. Editions Technip, Paris
Matheron, G. (1963) Principles of geostatistics. Economic Geology, 58:1246-1266
Matheron, G. (1971) The theory of regionalized variables and its applications. Cahiers du Centre de
Morphologie Mathematique, No. 5. Fontainebleau, France
Mays, J., Birch, J.B., and Starnes, B. (2001) Model robust regression: combining parametric, nonpara-
metric, and semiparametric methods. Journal of Nonparametric Statistics, 13:245-277
McCullagh, P. (1980) Regression models for ordinal data. Journal of the Royal Statistical Society (B),
42:109-142.
McCullagh, P. (1983) Quasi-likelihood functions. The Annals of Statistics, 11:59-67
McCullagh, P. (1984) On the elimination of nuisance parameters in the proportional odds model. Jour-
nal of the Royal Statistical Society (B), 46:250-256.
McCullagh, P. and Nelder Frs, J.A. (1989) Generalized Linear Models. 2nd ed. Chapman and Hall, New
York
McKean, J.W. and Schrader, R.M. (1987) Least absolute errors analysis of variance. In: Statistical Data
Analysis Based on the L" -Norm and Related Methods (Dodge, Y., ed.), North-Holland, New York
McLean, R.A., Sanders, W.L., and Stroup, W.W. (1991) A unified approach to mixed linear models.
The American Statistician, 45:54-64
McPherson, G. (1990) Statistics in Scientific Investigation. Springer-Verlag, New York
McShane, L.M., Albert, P.S., and Palmatier, M.A. (1997) A latent process regression model for spatially
correlated count data. Biometrics, 53:698-706
Mead, R. (1967) A mathematical model for the estimation of inter-plant competition. Biometrics,
23:189-205
Mead, R. (1970) Plant density and crop yield. Applied Statistics, 19:64-81
Mead, R. (1979) Competition experiments. Biometrics, 35:41-54
Mead, R. Curnow, R.N. and Hasted, A.M. (1993) Statistical Methods in Agriculture and Experimental
Biology, 2nd ed. Chapman and Hall/CRC Press LLC, New York and Boca Raton, FL
Mercer, W.B. and Hall, A.D. (1911) The experimental error of field trials. Journal of Agricultural
Science, 4:107-132
Miller, M.D., Mikkelsen, D.S., and Huffaker, R.C. (1962) Effects of stimulatory and inhibitory levels of
2,4-D, iron, and chelate supplements on juvenile growth of field beans. Crop Science, 2:111-114
Milliken, G.A. and Johnson, D.E. (1992) Analysis of Messy Data. Volume 1: Designed Experiments.
Chapman and Hall, New York
Minot, C.S. (1908) The Problem of Age, Growth and Death: A Study of Cytomorphosis. Knickerbocker
Press, New York
Mitscherlich, E.A. (1909) Das Gesetz des Minimums und das Gesetz des Abnehmenden Bodenertrags.
Zeitschrift für Pflanzenernährung, Düngung und Bodenkunde, 12:273-282
Moore, E.H. (1920) On the reciprocal of the general algebraic matrix. Bulletin of the American Mathe-
matical Society, 26:394-395
Moran, P.A.P. (1948) The interpretation of statistical maps. Journal of the Royal Statistical Society (B),
10:243-251
Moran, P.A.P. (1950) Notes on continuous stochastic phenomena. Biometrika, 37:17-23
Moran, P.A.P. (1971) Estimating structural and functional relationships. Journal of Multivariate
Analysis, 1:232-255
Morgan, P.H., Mercer, L.P., and Flodin, N.W. (1975) General model for nutritional responses of higher
organisms. Proceedings of the National Academy of Science, USA, 72:4327-4331
Morris, G.L. and Odell, P.L. (1968) A characterization for generalized inverses of matrices. SIAM
Review, 10(2):208-211
Mueller, T.G. (1998) Accuracy of soil property maps for site-specific management. Ph.D. dissertation,
Michigan State University, East Lansing, MI (Diss Abstr. 99-22353, Diss Abstr. Int. 60B:0901)
Mueller, T.G., Pierce, F.J., Schabenberger, O., and Warncke, D.D. (2001) Map quality for site-specific
fertility management. Journal of the Soil Science Society of America, 65: 1547-1558
Myers, R.H. (1990) Classical and Modern Regression with Applications, 2nd ed. Duxbury Press,
Boston
Nadaraya, E.A. (1964). On estimating regression. Theory of Probability and its Applications, 10:186-
190
Nagelkerke, N.J.D. (1991) A note on a general definition of the coefficient of determination.
Biometrika, 78:691-692
Nelder, J.A. and Wedderburn, R.W.M. (1972) Generalized linear models. Journal of the Royal
Statistical Society (A), 135:370-384
Neter, J., Wasserman, W., and Kutner, M.H. (1990) Applied Linear Statistical Models. 3rd ed., Irwin,
Boston, MA
Neuman, S.P. and Jacobson, E.A. (1984) Analysis of nonintrinsic spatial variability by residual kriging
with applications to regional groundwater levels. Journal of the International Association of
Mathematical Geology, 16:499-521
Newberry, J.D. and Burk, T.E. (1985) SF distribution-based models for individual tree merchantable
volume-total volume ratios, Forest Science, 31:389-398
Neyman, J. and Scott, E.L. (1972) Processes of clustering and applications. In: Stochastic Point
Processes (P.A.W. Lewis, ed.). Wiley and Sons, New York, pp. 646-681
Nichols, M.A. (1974a) Effect of sowing rate and fertilizer application on the yield of dwarf beans. New
Zealand Journal of Experimental Agriculture, 2:155-158
Nichols, M.A. (1974b) A plant spacing study with sweet corn. New Zealand Journal of Experimental
Agriculture, 2:377-379
Nichols, M.A. and Nonnecke, I.L. (1974) Plant spacing studies with processing peas in Ontario, Canada.
Scientia Horticulturae 2:112-122
Nichols, M.A., Nonnecke, I.L., and Pathak, S.C. (1973) Plant density studies with direct seeded
tomatoes in Ontario, Canada. Scientiae Horticulturae, 1:309-320
Olkin, I., Gleser, L.J., and Derman, C. (1978) Probability Models and Applications. Macmillan
Publishing, New York
Ord, J.K. (1975) Estimation methods for models of spatial interaction. Journal of the American Statisti-
cal Association, 70:120-126
Papadakis, J.S. (1937) Méthode statistique pour des expériences sur champ. Bull. Inst. Amelior. Plant.
Thessalonique, 23
Patterson, H.D. and Thompson, R. (1971) Recovery of inter-block information when block sizes are un-
equal. Biometrika, 58:545-554
Pázman, A. (1993) Nonlinear Statistical Models. Kluwer Academic Publishers, London
Pearl, R. and Reed, L.J. (1924) The probable error of certain constraints of the population growth curve.
American Journal of Hygiene, 4(3):237-240
Pearson, E.S. (1931) The analysis of variance in case of non-normal variation. Biometrika, 23:114-133
Penrose, R.A. (1955) A gerneralized inverse for matrices. Proceedings of the Cambridge Philosophical
Society, 51:406-413
Petersen, R.G. (1994) Agricultural Field Experiments. Design and Analysis. Marcel Dekker, New York.
Pierce, F.J., Fortin, M.-C., and Staton, M.J. (1994) Periodic plowing effects on soil properties in a no-till
farming system. Journal of the American Soil Science Society, 58:1782-1787
Pierce, F.J. and Warncke, D.D. (2000) Soil and crop response to variable-rate liming in two Michigan
fields. Journal of the Soil Science Society of America, 64:774-780
Pinheiro, J.C. and Bates, D.M. (1995) Approximations to the log-likelihood function in the nonlinear
mixed-effects model. Journal of Computational and Graphical Statistics, 4:12-35.
Potthoff, R.F. and Roy, S.N. (1964) A generalized mutivariate analysis of variance model useful
especially for growth curve problems. Biometrika, 51:313-326
Prasad, N.G.N. and Rao, J.N.K. (1990) The estimation of the mean squared error of small-area
estimators. Journal of the American Statistical Association, 85:163-171
Prentice, R.L. (1988) Correlated binary regression with covariates specific to each binary observation.
Biometrics, 44:1044-1048
Press, W.H, Teukolsky, S.A., Vetterling, W.T., and Flannery, B.P. (1992) Numerical Recipes. The Art
of Scientific Computing. 2nd ed. Cambridge University Press, New York
Priebe, D.L. and Blackmer, A.M. (1989) Preferential movement of oxygen-18-labeled water and nitro-
gen-15-labeled urea through macropores in a Nicollet soil. Journal of Environmental Quality,
18:66-72
Quiring, D.P. (1941) The scale of being according to the power formula. Growth, 2:335-346
Radosevich, S.R. and Holt, J.S. (1984) Weed Ecology. John Wiley & Sons, New York
Ralston, M.L. and Jennrich, R.I. (1978) DUD, a derivative-free algorithm for nonlinear least squares.
Technometrics, 20:7-14
Rao, C.RÞ (1965) The theory of least squares when the parameters are stochastic and its application to
the analysis of growth curves. Biometrika, 58:545-554
Rao, C.R. and Mitra, S.K. (1971) Generalized Inverse of Matrices and its Applications. John Wiley &
Sons, New York
Rasse, D.P., Smucker, A.J.M., and Schabenberger, O. (1999) Modifications of soil nitrogen pools in res-
ponse to alfalfa root systems and shoot mulch. Agronomy Journal, 91:471-477
Ratkowsky, D.A. (1983) Nonlinear Regression Modeling. Marcel Dekker, New York
Ratkowsky, D.A. (1990) Handbook of Nonlinear Regression Models. Marcel Dekker, New York
Reed, R.R. (2000) Factors influencing biotite weathering. M.S. Thesis, Department of Crop and Soil
Environmental Sciences, Virginia Polytechnic Institute and State University (Available at
http://scholar.lib.vt.edu/theses)
Rennolls, K. (1993) Forest height growth modeling. In: Proceedings from the IUFRO Conference,
Copenhagen, June 14-17, 1993. Forskningsserien Nr. 3, 231-238
Richards, F.J. (1959) A flexible growth function for empirical use. Journal of Experimental Botany,
10:290-300
Rigas, A.G. (1991) Spectral analysis of stationary point processes using the fast Fourier transform
algorithm. Journal of Time Series Analysis. 13:441-450
Ripley, B.D. (1976) The second-order analysis of stationary point processes. Journal of Applied
Probability, 13:255-266
Ripley, B.D. (1977) Modeling spatial patterns. Journal of the Royal Statistical Society (B), 39:172-192
(with discussion, 192-212)
Ripley, B.D. (1981) Spatial Statistics. John Wiley & Sons, New York
Ripley, B.D. (1988) Statistical Inference for Spatial Processes. Cambridge University Press, Cambridge
Ripley, B.D. and Silverman, B.W. (1978) Quick tests for spatial interaction. Biometrika, 65:641-642
Roberts, H.A., Chancellor, R.J., and Hill, T.A. (1982) The biology of weeds. In: Weed Control
Handbook: Principles, 7th ed. (H.A. Roberts, ed.). Blackwell Scientific, Oxford, pp. 1-36
Robertson, T.B. (1923) The Chemical Basis of Growth and Senescence. J.P. Lippincott Co., Phila-
delphia and London
Robinson, G.K. (1991) That BLUP is a good thing: the estimation of random effects. Statistical Science,
6(1):15-51
Rohde, C.A. (1966) Some results on generalized inverses. SIAM Review, 8(2):201-205
Rubin, D.R. (1976) Inference and missing data. Biometrika, 63:581-592
Rubinstein, R.Y. (1981) Simulation and the Monte Carlo Method. John Wiley & Sons, New York
Russo, D. (1984) Design of an optimal sampling network for estimating the variogram. Journal of the
Soil Science Society of America, 48:708-716
Russo, D. and Bresler, E. (1981) Soil hydraulic properties as stochastic processes, 1. An analysis of
field spatial variability. Journal of the Soil Science Society of America, 45:682-687
Russo, D. and Jury, W.A. (1987a) A theoretical study of the estimation of the correlation scale in
spatially variable fields. 1. Stationary fields. Water Resources Research, 7:1257-1268
Russo, D. and Jury, W.A. (1987b) A theoretical study of the estimation of the correlation scale in
spatially variable fields. 2. Nonstationary fields. Water Resources Research, 7:1269-1279
Sahai, H. and Ageel, M.I. (2000) The Analysis of Variance. Fixed, Random and Mixed Models. Birk-
häuser, Boston
Sandland, R.L. (1983) Mathematics and the growth of organisms — some historical impressions.
Mathematical Scientist, 8:11-30
Sandland, R.L. and McGilchrist, C.A. (1979). Stochastic growth curve analysis. Biometrics, 35:255-272
Sandral, G.A., Dear, B.S., Pratley, J.E., and Cullis, B.R. (1997) Herbicide dose rate response curves in
subterranean clover determined by a bioassay. Australian Journal of Experimental Agriculture,
37:67-74
Satterthwaite, F.E. (1946) An approximate distribution of estimates of variance components. Biometrics,
2:110-114
Schabenberger, O. (1994) Nonlinear mixed effects growth models for repeated measures in ecology. In:
Proceedings of the Section on Statstics and the Environment, Annual Joint Statistical Meetings,
Toronto, Canada, Aug. 13-18, 1994, pp. 156-161
Schabenberger, O. (1995) The use of ordinal response methodology in forestry. Forest Science,
41(2):321-336.
Schabenberger, O. and Birch, J.B. (2001) Statistical dose-response models with hormetic effects.
International Journal of Human and Ecological Risk Assessment, 7(4):891-908
Schabenberger, O. and Gregoire, T.G. (1995) A conspectus on estimating function theory and its
applicability to recurrent modeling issues in forest biometry. Silva Fennica, 29(1):49-70
Schabenberger, O. and Gregoire, T.G. (1996) Population-averaged and subject-specific approaches for
clustered categorical data. Journal of Statistical Computation and Simulation, 54:231-253
Schabenberger, O., Gregoire, T.G., and Burkhart, H.E. (1995) Commentary: Multi-state models for
monitoring individual trees in permanent observation plots by Urfer, W., Schwarzenbach, F.H.
Kütting, J., and Müller, P. Journal of Environmental and Ecological Statistics, 1(3):171-199
Schabenberger, O., Gregoire, T.G., and Kong, F. (2000) Collections of simple effects and their relation-
ship to main effects and interactions in factorials. The American Statistician, 54:210-214
Schabenberger, O., Tharp, B.E., Kells, J.J., and Penner, D. (1999) Statistical tests for hormesis and
effective dosages in herbicide dose response. Agronomy Journal, 91:713-721
Schnute, J. and Fournier, D. (1980) A new approach to length-frequency analysis: growth structure.
Canadian Journal of Fisheries and Aquatic Science, 37:1337-1351
Schrader, R.M. and Hettmansberger, T.P. (1980) Robust analysis of variance based on a likelihood
criterion. Biometrika, 67:93-101
Schrader, R.M. and McKean, J.W. (1977) Robust analysis of variance. Communications in Statistics A,
6:979-894
Schulz, H. (1888) Über Hefegifte. Pflügers Archiv der Gesellschaft für Physiologie, 42:517-541
Schumacher, F.X. (1939) A new growth curve and its application to timber yield studies. Journal of
Forestry, 37:819-820
Schwarz, G. (1978) Estimating the dimension of a model. Annals of Statistics, 6:461-464
Schwarzbach, W. (1984) A new approach in the evaluation of field trials: The determination of the most
likely genetic ranking of varieties. Proceedings EUCARPIA Cer. Sect. Meet., Vortr. Pflanzen-
zucht, 6:249-259
Schwertman, N.C. (1996) A connection between quadratic-type confidence limits and fiducial limits.
The American Statistician, 50(3):242-243
Searle, S.R. (1971) Linear Models. John Wiley & Sons, New York
Searle, S.R. (1982) Matrix Algebra Useful for Statisticians. John Wiley & Sons, New York
Searle, S.R. (1987) Linear Models for Unbalanced Data. John Wiley & Sons, New York
Searle, S.R., Casella, G., and McCulloch, C.E. (1992) Variance Components. John Wiley & Sons, New
York
Seber, G.A.F. and Wild, C.J. (1989) Nonlinear Regression. John Wiley & Sons, New York
Seefeldt, S.S., Jensen, J.E., and Fuerst, P. (1995) Log-logistic analysis of herbicide dose-response
relationships. Weed Technology, 9:218-227
Shapiro, S.S. and Wilk, M.B. (1965) An analysis of variance test for normality (complete samples).
Biometrika, 52:591-612
Sharples, K. and Breslow, N. (1992) Regression analysis of correlated binary data: some small sample
results for the estimating equation approach. Journal of Statistical Computation and Simulation,
42:1-20
Sheiner, L.B. and Beal, S.L. (1980) Evaluation of methods for estimating population pharmacokinetic
parameters. I. Michaelis-Menten model: routine clinical pharmacokinetic data. Journal of
Pharmacokinetics and Biopharmaceutics, 8:553-571
Sheiner, L.B. and Beal, S.L. (1985) Pharmacokinetic parameter estimates from several least squares
procedures: Superiority of extended least squares. Journal of Pharmacokinetics and Biopharma-
ceutics, 13:185-201
Shinozaki, K. and Kira, T. (1956) Intraspecific competition among higher plants. VII. Logistic theory of
the C-D effect. J. Inst. Polytech. Osaka City University, D7:35-72
Snedecor, G.W. and Cochran, W.G. (1989) Statistical Methods, 8th ed. Iowa State University Press,
Ames, Iowa.
Solie, J.B., Raun, W.R., and Stone, M.L. (1999) submeter spatial variability of selected soil and ber-
mudagrass production variables. Journal of the Soil Science Society of America, 63:1724-1733
Steel, R.G.D., Torrie, J.H., and Dickey, D.A. (1997) Principles and Procedures of Statistics. A Biomet-
rical Approach. McGraw-Hill, New York.
Stein, M.L. (1999) Interpolation of Spatial Data. Some Theory of Kriging. Springer-Verlag, New York
Stevens, W.L. (1951) Asymptotic regression. Biometrics, 7:247-267
Streibig, J.C. (1980) Models for curve-fitting herbicide dose response data. Acta Agriculturæ Scandina-
vica, 30:59-63
Streibig, J.C. (1981) A method for determining the biological effect of herbicide mixtures. Weed
Science, 29:469-473
Stroup, W.W., Baenziger, P.S., and Mulitze, D.K. (1994) Removing spatial variation from wheat yield
trials: a comparison of methods. Crop Science, 86:62-66.
Sweeting, T.J. (1980) Uniform asymptotic normality of the maximum likelihood estimator. Annals of
Statistics, 8:1375-1381
Swinton, S.M. and Lyford, C.P. (1996) A test for choice between hyperbolic and sigmoidal models of
crop yield response to weed density. Journal of Agricultural, Biological, and Environmental
Statistics, 1:97-106
Tanner, M.A. and Young, M.A. (1985) Modeling ordinal scale disagreement. Psychological Bulletin,
98:408-415
Tharp, B.E., Schabenberger, O., and Kells, J.J. (1999) response of annual weed species to glufosinate
and glyphosate. Weed Technology, 13:542-547
Theil, H. (1971) Principles of Econometrics. John Wiley & Sons, New York
Thiamann, K.V. (1956) Promotion and inhibition: twin themes of physiology, The American Naturalist,
40:145-162
Thompson, R. and Baker, R.J. (1981) Composite link functions in generalized linear models. Applied
Statistics, 30:125-131
Thornley, J.H.M. and Johnson, I.R. (1990) Plant and Crop Models. Clarendon Press, Oxford, UK
Tobler, W. (1970) A computer movie simulating urban growth in the Detroit region. Economic Geogra-
phy, 46:234-240
Tukey, J.W. (1949) One degree of freedom for nonadditivity. Biometrics, 5:232-242
Tukey, J.W. (1977) Exploratory Data Analysis. Addison-Wesley, Reading, MA
Tweedie, M.C.K. (1945) Inverse statistical variates. Nature, 155:453
Tweedie, M.C.K. (1957a) Statistical properties of inverse Gaussian distributions I. Annals of Mathemat-
ical Statistics, 28:362-377
Tweedie, M.C.K. (1957b) Statistical properties of inverse Gaussian distributions II. Annals of Mathe-
matical Statistics, 28:696-705
UNSCEAR (1958) Report of the United Nations Scientific Committee on the Effects of Atomic Radia-
tion. Official Records of the General Assembly, 13th Session, Supplement No. 17.
Upton, G.J.G. and Fingleton, B. (1985) Spatial Data Analysis by Example, Vol.1: Point Pattern and
Quantitative Data. John Wiley & Sons, New York
Urquhart, N.S. (1968) Computation of generalized inverse matrices which satisfy specified conditions.
SIAM Review, 10(2):216-218
Utomo, I.H. (1981) Weed competition in upland rice. In: Proceedings of the 8th Asian-Pacific Weed
Science Society Conference, Vol II: 101-107
Valentine, H.T. and Gregoire, T.G. (2001) A switching model of bole taper. Canadian Journal of Forest
Research. To appear
Van Deusen, P.C., Sullivan, A.D., and Matney, T.G. (1981) A prediction system for cubic foot volume
of loblolly pine applicable through much of its range. Southern Journal of Applied Forestry,
5:186-189
Verbeke, G. and Molenberghs, G. (1997) Linear Mixed Models in Practice: A SAS-oriented Approach.
Springer-Verlag, New York
Verbyla, A.P., Cullis, B.R., Kenward, M.G., and Welham S.J. (1999) The analysis of designed experi-
ments and longitudinal data by using smoothing splines. Applied Statistics, 48:269-311
Vitosh, M.L, Johnson, J.W., and Mengel, D.B. (1995) Tri-state fertilizer recommendations for corn,
soybeans, wheat and alfalfa. Michigan State University Extension Bullettin E-2567.
Von Bertalanffy, L. (1957) Quantitative laws in metabolism and growth. Quarterly Reviews in Biology,
32:217-231
Vonesh, E.F. and Carter, R.L. (1992) Mixed-effects nonlinear regression for unbalanced repeated meas-
ures. Biometrics, 48:1-17
Vonesh, E.F. and Chinchilli, V.M. (1997) Linear and Nonlinear Models for the Analysis of Repeated
Measurements. Marcel Dekker, New York
Wakeley, J.T. (1949) Annual Report of the Soils-Weather Project, 1948. University of North Carolina
(Raleigh) Institute of Statistics Mimeo Series, 19
Wallsten, T.S. and Budescu, D.V. (1981) Adaptivity and nonadditivity in judging MMPI profiles.
Journal of Experimental Psychology: Human Perception and Performance, 7:1096-1109
Walters, K.J., Hosfield, G.L., Uebersax, M.A., and Kelly, J.D. (1997) Navy bean canning quality:
correlations, heritability estimates, and randomly mmplified polymorphic DNA markers associa-
ted with component traits. Journal of the American Society for Horticultural Sciences 122(3):
338-343
Wang, Y.H. (2000) Fiducial intervals: what are they? The American Statistician, 54(2):105-111
Warrick, A.W. and Myers, D.E. (1987) Optimization of sampling locations for variogram calculations.
Water Resources Research, 23:496-500
Watson, G.S. (1964). Smooth regression analysis. Sankhya (A), 26:359-372
Watts, D.G. and Bacon, D.W. (1974) Using an hyperbola as a transition model to fit two-regime
straight-line data. Technometrics, 16:369-373
Waugh, D.L., Cate Jr., R.B., and Nelson, L.A. (1973) Discontinuous models or rapid correlation, inter-
pretation, and utilization of soil analysis and fertilizer response data. International Soil Fertility
Evaluation and Improvement Program, Technical Bulletin No. 7, North Carolina State Univer-
sity, Raleigh, NC
Webster, R. and Oliver, M.A. (1992) Sample adequately to estimate variograms for soil properties.
Journal of Soil Science 43:177-192
Wedderburn, R.W.M. (1974) Quasilikelihood functions, generalized linear models and the Gauss-New-
ton method. Biometrika, 61:439-447
Welch, B.L. (1937) The significance of the difference between two means when the population varianc-
es are unequal. Biometrika, 29:350-362
White, H. (1980) A heteroskedasticity-consistent covariance matric estimator and a direct test for
heteroskedasticity. Econometrica, 48:817-838
White, H. (1982) Maximum likelihood estimation of misspecified models. Econometrics, 50:1-25
Whittle, P. (1954) On stationary processes in the plane. Biometrika, 41:434-449
Wiebe, G.A. (1935) Variation and correlation among 1500 wheat nursery plots. Journal of Agricultural
Research, 50:331-357
Wiedman, S.J. and Appleby, A.P. (1972) Plant growth stimulation by sublethal concentrations of herbi-
cides. Weed Research, 12:65-74
Wilkinson, G.N., Eckert, S.R., Hancock, T.W., and Mayo, O. (1983) Nearest neighbor (NN) analysis of
field experiments (with discussion). Journal of the Royal Statistical Society (B), 45:152-212
Winer, B.J. (1971) Statistical Principles in Experimental Design. McGraw-Hill, New York
Wishart, J. (1938) Growth rate determinations in nutrition studies with the bacon pig, and their analysis.
Biometrika, 30:16-28
Wolfinger, R. (1993a) Covariance structure selection in general mixed models. Communications in
Statistics, Simulation and Computation, 22(4):1079-1106
Wolfinger, R. (1993b) Laplace's approximation for nonlinear mixed models. Biometrika, 80:791-795
Wolfinger, R. and O'Connell, M. (1993) Generalized linear mixed models: a pseudo-likelihood
approach. Journal of Statistical Computation and Simulation, 48:233-243
Wolfinger, R., Tobias, R., and Sall, J. (1994) Computing Gaussian likelihoods and their derivatives for
general linear mixed models. SIAM Journal on Scientific and Statistical Computing, 15:1294-
1310
Xu, W., Tran, T., Srivastava, R., and Journel, A.G. (1992) Integrating seismic data in reservoir model-
ing: the collocated cokriging alternative. SPE Paper 24742, 67th Annual Technical Conference
and Exhibition.
Yandell, B.S. (1997) Practical Data Analysis for Designed Experiments. Chapman and Hall, New York
Yates, F. (1936) Incomplete randomized blocks. Annals of Eugenics, 7:121-140
Yates, F. (1940) The recovery of inter-block information in balanced incomplete block designs. Annals
of Eugenics, 10:317-325
Zeger, S.L. and Harlow, S.D. (1987) Mathematical models from laws of growth to tools for biological
analysis: fifty years of Growth. Growth, 51:1-21
Zeger, S.L. and Liang, K.-Y. (1986) Longitudinal data analysis for discrete and continuous outcomes.
Biometrics, 42:121-130
Zeger, S.L. and Liang, K.-Y. (1992) An overview of methods for the analysis of longitudinal data. Sta-
tistics in Medicine, 11:1825-1839
Zeger, S.L., Liang, K.-Y., and Albert, P.S. (1988) Models for longitudinal data: a generalized estimating
equation approach. Biometrics, 44:1049-1060
Zhao, L.P. and Prentice, R.L. (1990) Correlated binary regression using a quadratic exponential model.
Biometrika, 77:642-648
Zheng, L. and Silliman, S.E. (2000) Estimating the theoretical semivariogram from finite numbers of
measurements. Water Resources Research, 36:361-366
Zimdahl, R.L. (1980) Weed-Crop Competition: A Review. International Plant Protection Center, USA
Zimmerman, D.L. (1989) Computationally exploitable structure of covariance matrices and generalized
covariance matrices in spatial models. Journal of Statistical Computation and Simulation, 32:
1-15
Zimmerman, D.L. and Harville, D.A. (1991) A random field approach to the analysis of field-plot
experiments and other spatial experiments. Biometrics, 47:223-239.
Zimmerman, D.L. and Núñez-Antón, V. (1997) Structured antedependence models for longitudinal
data. In: Modelling Longitudinal and Spatially Correlated Data (Gregoire, T.G., Brillinger, D.R.,
Diggle, P.J., Russek-Cohen, E., Warren, W.G., and Wolfinger, R.D., eds). Springer-Verlag, New
York, pp. 63-76
Zimmerman, D.L. and Zimmerman, M.B. (1991) A comparison of spatial semivariogram estimators and
corresponding kriging predictors. Technometrics, 33:77-91

You might also like